Our current working groups are listed below.
Displaying 7 results.
Single-cell genomics has had a transformative impact on basic biology and biomedical research (Regev et al., 2017). What is missing to enable robust solutions in clinical trials, health research and translation is to comprehensively capture all metadata associated with individual cells (Puntambekar et al., 2021). Metadata in this context is highly multi-layered and complex, tightly intervening technical and biological (sample-level) metadata. Addressing these requirements will require new standards and technical solutions to document, annotate and query properties across scales, from cellular identity, state and behavior under disease perturbation as well as technical covariates such as sample location and sequencing depth to tissue state and patient information, including clinical covariates, other genomics and imaging modalities as well as disease progression.
CellTrack builds on the track record of the Stegle and Theis labs who have pioneered computational methods to analyze single-cell data in biomedical settings and have contributed to major international consortia such as the Human Cell Atlas (HCA). We will also leverage existing research and infrastructures established at HMGU/DKFZ, which allow for managing, processing and sharing genomic data. The activities in this project will directly feed into highly visible national and international infrastructure activities, most notably the German Human Genome-Phenome Archive – a national genomics platform funded by the NFDI, and scVerse – a community platform to derive core infrastructure and interoperable software for key analytics tasks in single-cell genomics.
While single cell genomics is quickly approaching common biological and biomedical use, the field still lacks consistent integration with data management beyond count matrices and more so metadata management – this is arguably due to different scale (cell vs patient) and scope (research vs clinics). To address these issues, we propose to (1) build a metadata schema, (2) implementation as well as (3) use cases for robustly tracking, storing and managing metadata in single cell genomics.
The overall goal of CellTrack is to provide a consistent encoding of genomic metadata, thereby reducing many of the common errors related to identifier mapping.
At most laser-plasma research laboratories, type and format of experimental data and metadata are heterogeneous and complex. The data originate from several distinct sources, occur at various levels and at different times during an experimental campaign. Some metadata such as project names and proposal IDs appear usually quite early during an experimental project, while exact time stamps of experimental data or acquisition parameters of diagnostics are generated during the runtime. Similarly, data can originate as early from target pre-characterization, during the experiment’s setup phase and diagnostic calibration runs or ultimately from actual laser shots. Furthermore, the configuration and status of deployed diagnostics and therefore the experimental arrangement are often subject to changes during a campaign, also with very short notice and sometimes without previous planning but as a consequence of experimental results. Beyond, there is a strong need for better data integration and enrichment in the field of high-intensity laser-plasma physics in international context. This has become clear during several online events, e.g. the LPA Online Workshop on Machine Learning and Control Systems , the Laserlab-Europe – ELI – CASUS workshop or NFDI events . Setting out from this status quo and given their leading expertise in laser-driven experiments, HZDR, HI Jena and GSI will develop a metadata standard for the high-intensity laser-plasma community, with an initial emphasis on ion facilities during this project.
Proposed Work
-
Glossary and Ontology: A significant effort is required to conceive a sensible, widely applicable dictionary of concepts of data and metadata associated with the field. A reasonably close exchange to the worldwide community is planned via project observers and dedicated workshops. lead: HI Jena
-
Technical preparation: Prepare the openPMD standard and its API for custom hierarchies and datasets in general and demonstrate interoperability between NeXus and openPMD in particular. openPMD is a meta-standard originally developed as a data format for a high-performance simulation code and is recently being adopted for other simulation codes, enabling interoperability and easing e.g. analysis efforts due to existing software concomitant to openPMD. There is a strong interest to increase and facilitate the exchange between simulations and experiments within the laser-plasma community. NeXus on the other hand is a metadata-standard for experiments in the Photon and Neutron science community. We plan to overcome the present boundaries of the two standards. lead: HZDR
-
Application of the new metadata standard to concrete cases at participating centers for fast examination.
GSI: There is no widely-used metadata format yet. Extend the PHELIX database (PSDB) towards the new standard; generation of data and metadata.
HI Jena: Conduct a pilot beamtime and generate experimental data with metadata.
HZDR: Apply the new standard to research data at HZDR and generate a data object on RODARE for demonstration of FAIR access.
Photoelectron emission spectroscopy (PES) has matured into a versatile tool for characterizing the electronic properties of novel quantum materials. While historically PES was used for accessing the density of states of materials in one-dimensional energy scans, nowadays data sets provide detailed views of band dispersions and topologies, being indispensable for all fields of modern materials based sciences. The latest innovation in this field – photoelectron momentum microscopy (MM) – applies the principles of high-resolution imaging to record tomographic sections of the electronic structure in a high-dimensional parameter space. Despite the rapid world-wide adoption of MM as an universal materials characterization tool, currently no universal scheme exists to describe the highly divers set of parameters – for example, but not limited to, the momentum vector, energy, electron spin, light polarization states – that describe a MM experiment. Implementing findable, accessible, interoperable and reusable (FAIR) principles in momentum microscopy mandates new metadata schemes that describe the abundance of experimental parameters and thus link measured data voxels to the electronic properties of a material. The aim of M³eta is to establish such extensible and sustainable metadata scheme for momentum microscopy that will be stored in a structured file together with the measured data voxels. This will be the basis for an automated and interactive tool-chain that interprets the stored metadata and uses this information to reconstruct views of the multi-dimensional electronic structure of a material.
Manufacturing of composite parts involves multiple process steps, from the production of semi-finished materials to their processing and assembly. At each level of production, large datasets can be produced to trace back each state of the composite material and relate it to the final quality of the manufactured structure. With help of the recently developed data management system shepard, the project MEMAS aims at storing and connecting these manufacturing data in a persistent way. It focuses particularly on the standardization, collection and annotation of meta-data and on their automatic transfer in a simulation environment to estimate the actual structural performances. The consideration of potential defects resulting from the manufacturing techniques or induced by the surrounding environment will allow for the improvement of finite-element methods and of their accuracy. Furthermore, the developed tools will support the manufacturing field and highlight the consequences of manufacturing parameters on the structural behaviour, enabling adjustments of the process parameters after each produced part. Finally, the persistent and structured storage of research data and their metadata in form of FAIR Digital Objects or with help of DataCrates will support long-term data analysis and the further comprehension of manufacturing techniques.
To this goal, software solutions will be developed for the two exemplary manufacturing processes tape laying and additive manufacturing and combined in a general toolchain. The potential of the developed methodology will be tested by performing mechanical tests on representative parts and by comparing the results with the numerically predicted behaviour. Overall acquired experimental, manufacturing and simulation data, meta-data formats and scientific results will be transferred via open-source solutions like the Zenodo platform in the HMC community.
Currently data collection is mostly done by the instrument teams on various ESA, NASA, JAXA and other agency’s missions. Different data products often even from the same satellite mission use different formats and rarely use the standard practices accepted for metadata in more coordinated communities such as atmospheric and oceanic sciences. Moreover, data versions and attributes, instrument PID’s and workflows are not properly recorded which makes a reproduction of the results practically impossible. As a consequence of this lack of standardization both in data access and format, the accessibility and reusability of data provided by satellite missions with budgets of up to several hundred million Euros is substantially limited. As an example, NASA’s flagship Van Allen Probes mission included a number of instruments and each of the instrument teams utilized different
metadata standards as well as different data formats. Reconstruction of historical behavior of the radiation belts is even more complicated as most of the historical data are written in binary codes, sometimes with little documentation.
Similarly, the quantification of precipitating fluxes needed as input for atmospheric models from radiation measurements is often difficult, as relevant properties for the estimation of precipitations’ quantities are either not provided or difficult to obtain. The situation is somewhat similar for ionospheric observational data that are growing exponentially. Numerous ionospheric measurements provided by the GNSS satellites, science missions such as COSMIC I, COSMIC II and now commercial fleets such as Spire provide a vast amount of measurements that are described in various different metadata formats.
Initial efforts have been made to introduce standardization for radiation belt physics. The Committee on Space Research (COSPAR) Panel on Radiation Belt Environment Modeling (PRBEM) developed the “Standard file format guidelines for particle count rate” for data stored in cdf format. NASA’s Space Physics Data Facility (SPDF) makes use of these guidelines for several products but uses different formats for different communities of data providers and stakeholders. The format contains attributes that can hold metadata describing data content, but does not hold information about workflows, nor does it make use of persistent identifiers. For ionospheric sciences, DLR Neustrelitz has pioneered introducing the formats for the ionospheric community during its involvement in CHAMP and GRACE satellite missions as an operator for ionospheric products generation and distribution. Later
DLR’s involvement in several national (SWACI, IMPC) and EU projects such as ESPAS and PITHIA-NRF led to the development of first preparatory standards for ionospheric products. The increasing use of data assimilation and machine learning requiring a vast amount of data from different sources makes this project most timely.
The main goal of PATOF is to make the data of a number of participating experiments fully publicly available and FAIR as far as possible. The work is building on first experience gained at the Mainz A4 nucleon structure experiment (1999-2012).
Here, the analysis, reorganisation, preparation and subsequent publication of the A4 data and A4 analysis environment according to FAIR principles shall be achieved. The lessons learned at A4 are then to be applied to other experiments at DESY (ALPS II, LUXE) and Mainz (PRIMA, P2), collectively called APPLe experiments. In the process, a general and living cookbook – or at least a first collection of recipes – on how to create metadata for PUNCH experiments and how to make their data FAIR is also aimed for.
The cookbook will capture the methodology for making individual experiment-specific metadata schemas FAIR. Another output is the “FAIR metadata factory”, i.e. a process to create a naturally evolved metadata schema for different experiments by extending the DataCite schema without discarding the original metadata concepts.
The HMC-funded STAMPLATE project aims to implement and establish the SensorThings API (STA) of the Open Geospatial Consortium (OGC) as a consistent, modern and lightweight data interface for time series data. Using representative use-cases from all seven research centers of the Helmholtz Research Field Earth & Environment, we ensure transferability and applicability of our solutions for a wide range of measurement systems. Our project is, hence, making a decisive contribution towards a digital ecosystem and an interlinked, consistent, and FAIR research data infrastructure tailored towards time-series data from environmental sciences.
Time-series data are crucial sources of reference information in all environmental sciences. Beyond research applications, the consistent and timely publication of such data is increasingly important for monitoring and issuing warnings, especially in times of growing frequencies of climatic extreme events. In this context, the seven Centers from the Helmholtz Research Field Earth and Environment (E&E) operate some of the largest environmental measurement-infrastructures worldwide. These infrastructures range from terrestrial observation systems in the TERENO observatories and ship-borne sensors to airborne and space-based systems, such as those integrated into the IAGOS infrastructures.
In order to streamline and standardize the usage of the huge amount of data from these infrastructures, the seven Centers have jointly initiated the STAMPLATE project. This initiantive aims to adopt the Open Geospatial Consortium (OGC) SensorThings API (STA) as a consistent and modern interface tailored for time-series data. We evaluate STA for representative use-cases from environmental sciences and enhance the core data model with additional crucial metadata such as data quality, data provenance and extended sensor metadata. We further integrate STA as central data interface into community-based tools for, e.g., data visualization, data access, QA/QC or the management of observation systems. By connecting the different STA endpoints of the participating research Centers, we establish an interlinked research data infrastructure (RDI) and a digital ecosystem around the OGC SensorThings API tailored towards environmental time-series data.
With our project, we further want to promote STA for similar applications and communities beyond our research field. Ultimately, our goal is to provide an important building block towards fostering a more open, FAIR (Findable, Accessible, Interoperable, and Reusable), and harmonized research data landscape in the field of environmental sciences.