Our current working groups are listed below.
Displaying 28 results.
In an ever-changing world, field surveys, inventories and monitoring data are essential for prediction of biodiversity responses to global drivers such as land use and climate change. This knowledge provides the basis for appropriate management. However, field biodiversity data collected across terrestrial, freshwater and marine realms are highly complex and heterogeneous. The successful integration and re-use of such data depends on how FAIR (Findable, Accessible, Interoperable, Reusable) they are.
ADVANCE aims at underpinning rich metadata generation with interoperable metadata standards using semantic artefacts. These are tools allowing humans and machines to locate, access and understand (meta) data, and thus facilitating integration and reuse of biodiversity monitoring data across terrestrial, freshwater and marine realms.
To this end, we revised, adapted and expanded existing metadata standards, thesauri and vocabularies. We focused on the most comprehensive database of biodiversity monitoring schemes in Europe (DaEuMon) as the base for building a metadata schema that implements quality control and complies with the FAIR principles.
In a further step, we will use biodiversity data to test, refine and illustrate the strength of the concept in cases of real use. ADVANCE thus complements semantic artefacts of the Hub Earth & Environment and other initiatives for FAIR biodiversity research, enabling assessments of the relationships between biodiversity across realms and associated environmental conditions. Moreover, it will facilitate future collaborations, joint projects and data-driven studies among biodiversity scientists of the Helmholtz Association and beyond.
Modern Earth sciences produce a continuous increasing amount of data. These data consist of the measurements/observations and descriptive information (metadata) and include semantic classifications (semantics). Depending on the geoscientific parameter, metadata are stored in a variety of different databases, standards and semantics, which is obstructive for interoperability in terms of limited data access and exchange, searchability and comparability. Examples of common data types with very different structure and metadata needs are maps, geochemical data derived from field samples, or time series data measured with a sensor at a point, such as precipitation or soil moisture.
So far, there is a large gap between the capabilities of databases to capture metadata and their practical use. ALAMEDA is designed as modular structured metadata management platform for curation, compilation, administration, visualization, storage and sharing of meta information of lab-, field- and modelling datasets. As a pilot application for stable isotope and soil moisture data ALAMEDA will enable to search, access and compare meta information across organization-, system- and domain boundaries.
ALAMEDA covers 5 major categories: observation & measurements, sample & data history, sensor & devices, methods & processing, environmental characteristics (spatio & temporal). These categories are hierarchically structured, interlinkable and filled with specific metadata attributes (e.g. name, data, location, methods for sample preparation, measuring and data processing, etc.). For the pilot, all meta information will be provided by existing and wellestablished data management tools (e.g. mDIS, Medusa, etc.).
In ALAMEDA, all information is brought together and will be available via web interfaces. Furthermore, the project focuses on features such as metadata curation with intuitive graphical user interfaces, the adoption of well-established standards, the use of domain-controlled vocabularies and the provision of interfaces for a standards-based dissemination of aggregated information. Finally, ALAMEDA should be integrated into the DataHub (Hub-Terra).
A general photovoltaic device and materials data base compliant with FAIR principles is expected to greatly benefit research and development of solar cells. Because data are currently heterogeneous in different labs working on a variety of different materials and cell concepts, database development should be accompanied by ontology development. Based on a recently published literature database for perovskite solar cells, we have started an ontology for these devices and materials which could be extended to further photovoltaic applications. In order to facilitate data management at the lab scale and to allow easy upload of data and metadata to the database, electronic lab notebooks customized for perovskite solar research are developed in cooperation with the NFDI-FAIRmat project.
Single-cell genomics has had a transformative impact on basic biology and biomedical research (Regev et al., 2017). What is missing to enable robust solutions in clinical trials, health research and translation is to comprehensively capture all metadata associated with individual cells (Puntambekar et al., 2021). Metadata in this context is highly multi-layered and complex, tightly intervening technical and biological (sample-level) metadata. Addressing these requirements will require new standards and technical solutions to document, annotate and query properties across scales, from cellular identity, state and behavior under disease perturbation as well as technical covariates such as sample location and sequencing depth to tissue state and patient information, including clinical covariates, other genomics and imaging modalities as well as disease progression.
CellTrack builds on the track record of the Stegle and Theis labs who have pioneered computational methods to analyze single-cell data in biomedical settings and have contributed to major international consortia such as the Human Cell Atlas (HCA). We will also leverage existing research and infrastructures established at HMGU/DKFZ, which allow for managing, processing and sharing genomic data. The activities in this project will directly feed into highly visible national and international infrastructure activities, most notably the German Human Genome-Phenome Archive – a national genomics platform funded by the NFDI, and scVerse – a community platform to derive core infrastructure and interoperable software for key analytics tasks in single-cell genomics.
While single cell genomics is quickly approaching common biological and biomedical use, the field still lacks consistent integration with data management beyond count matrices and more so metadata management – this is arguably due to different scale (cell vs patient) and scope (research vs clinics). To address these issues, we propose to (1) build a metadata schema, (2) implementation as well as (3) use cases for robustly tracking, storing and managing metadata in single cell genomics.
The overall goal of CellTrack is to provide a consistent encoding of genomic metadata, thereby reducing many of the common errors related to identifier mapping.
The seismological community promotes since decades standardisation of formats and services as well as open data policies which are making easy data exchange an asset for this community. Thus, data is made perfectly Findable and Accessible as well as Interoperable and Reusable with enhancements expected for the latter two. The strict and technical domain specific standardisation may complicate the sharing of more exotic data within the domain itself as well as hinder interoperability throughout the earth science community. Within eFAIRs, leveraging on the know-how of the major OBS park operators and seismological data curators within the Helmholtz association, we aim at facilitating integration of special datasets from the ocean floor enhancing interoperability and reusability.
To achieve this goal, in close collaboration with AWI and Geomar, supported by IPGP, the seismological data archive of the GFZ has created special workflows for OBS data curation. In particular, with close interaction with AWI, new datasets have been archived defining a new workflow which is being translated into guidelines for the community. Domain specific software have been modified to allow OBS data inclusion with specific additional metadata. Among these metadata also persistent identifiers of the instruments in use have been included for the first time from the AWI sensor information system. Next steps are going to enlarge the portfolio of keywords and standard vocabularies in use to facilitate data discovery from scientists of different domains. Finally we plan to adopt the developed workflows for OBS data management.
Metadata can be recorded, stored, and published very efficiently by the use of electronic lab notebooks (ELNs) being a key prerequisite for a comprehensive documentation of research processes. However, the interdisciplinarity of modern research groups creates the necessity to use different ELNs or to use data interoperably in different ELNs. Despite manifold ELNs on the market, an interface between ELNs remains yet not achieved due to missing (metadata) standards but also missing collaboration efforts.
In this project, an interface for metadata transfer between the open-source ELNs Chemotion (development lead at KIT) and Herbie (development at Hereon) will be developed. The implemented methods will aim for a generalization of the process via a guideline and will include available standards. The development will be demonstrated for the use case of polymer membrane research. This tool will improve the interoperability and reusability of metadata and thus support all ELN users relying on Chemotion and Herbie by enriching their datasets with data from the complementary ELN. The project will aim for a generalization of the specific process via the description of a general guideline and the implementation of available standards.
The developments will generate a direct benefit for the growing community of scientists using the two selected ELNs as the ELN extensions and its metadata schemas will be adapted interactively with the scientific community.
FAIR Workflows to establish IGSN for Samples in the Helmholtz Association” (FAIR WISH) is a joint project between the Helmholtz Centres GFZ, AWI and Hereon funded within the HMC Project Cohorte 2020 of the Helmholtz Metadata Collaboration Platform HMC.
The International Generic Sample Number (IGSN) is a globally unique and persistent identifier (PID) for physical samples and collections with discovery function in the internet. IGSNs enable to directly link data and publications with samples they originate from and thus close one of the last gaps in the full provenance of research results.
FAIR WISH will (1) develop standardised and discipline specific IGSN metadata schemes for different samples types within the research field Earth and Environment (EaE) that are complementing the core IGSN metadata schema; and (2) develop workflows to generate machine-readable IGSN metadata from different states of digitisation (from templates to databases) and to automatically register IGSNs. Our use cases were selected to include the large variety of sample types from different sub-disciplines across the project partners (e.g. terrestrial, marine environments, rock, soil, vegetation, water samples) and represent all states of digitization: from individual scientists, collecting sample descriptions in their field books to digital sample management systems fed by an app that is used in the field.
Imaging the environment is an essential and crucial component in spatial science. This concerns nearly everything between the exploration of the ocean floor and investigating planetary surfaces. In and between both domains, this is applied at various scales – from microscopy through ambient imaging to remote sensing – and provides rich information for science. Due to recent the increasing number data acquisition technologies, advances in imaging capabilities, and number of platforms that provide imagery and related research data, data volume in nature science, and thus also for ocean and planetary research, is further increasing at an exponential rate. Although many datasets have already been collected and analyzed, the systematic, comparable, and transferable description of research data through metadata is still a big challenge in and for both fields. However, these descriptive elements are crucial, to enable efficient (re)use of valuable research data, prepare the scientific domains e.g. for data analytical tasks such as machine learning, big data analytics, but also to improve interdisciplinary science by other research groups not involved directly with the data collection.
In order to achieve more effectiveness and efficiency in managing, interpreting, reusing and publishing imaging data, we here present a project to develop interoperable metadata recommendations in the form of FAIR digital objects (FDOs) for 5D (i.e. x, y, z, time, spatial reference) imagery of Earth and other planet(s). An FDO is a human and machine-readable file format for an entire image set, although it does not contain the actual image data, only references to it through persistent identifiers (FAIR marine images). In addition to these core metadata, further descriptive elements are required to describe and quantify the semantic content of imaging research data. Such semantic components are similarly domain-specific but again synergies are expected between Earth and planetary research.
Project Partners: AWI, GEOMAR
Biomolecular data, e.g. DNA and RNA sequences, provides insights into the structure and functioning of marine communities in space and time. The associated metadata has great internal diversity and complexity, and to date biomolecular (meta)data management is not well integrated and harmonised across environmentally focused Helmholtz Centers.
As part of the HMC Project HARMONise, we aim to develop sustainable solutions and digital cultures to enable high-quality, standards-compliant curation and management of marine biomolecular metadata at AWI and GEOMAR, to better embed biomolecular science into broader digital ecosystems and research domains. Our approach builds on a relational database that aligns metadata with community standards such as the Minimum Information about any (x) sequence (MIxS) supported by the International Nucleotide Sequence Database Collaboration (INSDC), and with associated ontology content (e.g. The Environment Ontology - ENVO).
At the same time, we ensure the harmonization of metadata with existing Helmholtz repositories (e.g. PANGAEA). A web-portal for metadata upload and harvest will enable sustainable data stewardship and support researchers in delivering high-quality metadata to national and global repositories, and improve accessibility of the metadata.
Metadata subsets will be harvested by the Marine Data Portal ( https://marine-data.de ), increasing findability across research domains and promoting reuse of biomolecular research data. Alignment of the recorded metadata with community standards and relevant data exchange formats will support Helmholtz and global interoperability .
HELIPORT is a data management solution that aims at making the components and steps of the entire research experiment’s life cycle discoverable, accessible, interoperable and reusable according to the FAIR principles.
Among other information, HELIPORT integrates documentation, scientific workflows, and the final publication of the research results - all via already established solutions for proposal management, electronic lab notebooks, software development and devops tools, and other additional data sources. The integration is accomplished by presenting the researchers with a high-level overview to keep all aspects of the experiment in mind, and automatically exchanging relevant metadata between the experiment’s life cycle steps.
Computational agents can interact with HELIPORT via a REST API that allows access to all components, and landing pages that allow for export of digital objects in various standardized formats and schemas. An overall digital object graph combining the metadata harvested from all sources provides scientists with a visual representation of interactions and relations between their digital objects, as well as their existence in the first place. Through the integrated computational workflow systems, HELIPORT can automate calculations using the collected metadata.
By visualising all aspects of large-scale research experiments, HELIPORT enables deeper insights into a comprehensible data provenance with the chance of raising awareness for data management.
At most laser-plasma research laboratories, type and format of experimental data and metadata are heterogeneous and complex. The data originate from several distinct sources, occur at various levels and at different times during an experimental campaign. Some metadata such as project names and proposal IDs appear usually quite early during an experimental project, while exact time stamps of experimental data or acquisition parameters of diagnostics are generated during the runtime. Similarly, data can originate as early from target pre-characterization, during the experiment’s setup phase and diagnostic calibration runs or ultimately from actual laser shots. Furthermore, the configuration and status of deployed diagnostics and therefore the experimental arrangement are often subject to changes during a campaign, also with very short notice and sometimes without previous planning but as a consequence of experimental results. Beyond, there is a strong need for better data integration and enrichment in the field of high-intensity laser-plasma physics in international context. This has become clear during several online events, e.g. the LPA Online Workshop on Machine Learning and Control Systems , the Laserlab-Europe – ELI – CASUS workshop or NFDI events . Setting out from this status quo and given their leading expertise in laser-driven experiments, HZDR, HI Jena and GSI will develop a metadata standard for the high-intensity laser-plasma community, with an initial emphasis on ion facilities during this project.
Proposed Work
-
Glossary and Ontology: A significant effort is required to conceive a sensible, widely applicable dictionary of concepts of data and metadata associated with the field. A reasonably close exchange to the worldwide community is planned via project observers and dedicated workshops. lead: HI Jena
-
Technical preparation: Prepare the openPMD standard and its API for custom hierarchies and datasets in general and demonstrate interoperability between NeXus and openPMD in particular. openPMD is a meta-standard originally developed as a data format for a high-performance simulation code and is recently being adopted for other simulation codes, enabling interoperability and easing e.g. analysis efforts due to existing software concomitant to openPMD. There is a strong interest to increase and facilitate the exchange between simulations and experiments within the laser-plasma community. NeXus on the other hand is a metadata-standard for experiments in the Photon and Neutron science community. We plan to overcome the present boundaries of the two standards. lead: HZDR
-
Application of the new metadata standard to concrete cases at participating centers for fast examination.
GSI: There is no widely-used metadata format yet. Extend the PHELIX database (PSDB) towards the new standard; generation of data and metadata.
HI Jena: Conduct a pilot beamtime and generate experimental data with metadata.
HZDR: Apply the new standard to research data at HZDR and generate a data object on RODARE for demonstration of FAIR access.
HERMES is an acronym for “ HE lmholtz R ich ME tadata S oftware publication”.
To satisfy the principles of FAIR research software , software sustainability and software citation, research software must be formally published. Publication repositories make this possible and provide published software versions with unique and persistent identifiers. However, software publication is still a tedious, mostly manual process and impedes promoting software to first class research citizenship.
To streamline software publication, this project develops automated workflows to publish research software with rich metadata. Our tooling utilizes continuous integration solutions to retrieve, collate, and process existing metadata in source repositories, and publish them on publication repositories, including checks against existing metadata requirements. To accompany the tooling and enable researchers to easily reuse it, the project also provides comprehensive documentation and templates for widely used CI solutions.
Many, if not most, publication repositories cannot be labeled “research software ready” today (2022). In addition to the deposition workflows, this project cooperates with the upstream InvenioRDM and Dataverse projects. We are working on the necessary bits to achieve full readiness for these and put a nice badge on ‘em.
Photoelectron emission spectroscopy (PES) has matured into a versatile tool for characterizing the electronic properties of novel quantum materials. While historically PES was used for accessing the density of states of materials in one-dimensional energy scans, nowadays data sets provide detailed views of band dispersions and topologies, being indispensable for all fields of modern materials based sciences. The latest innovation in this field – photoelectron momentum microscopy (MM) – applies the principles of high-resolution imaging to record tomographic sections of the electronic structure in a high-dimensional parameter space. Despite the rapid world-wide adoption of MM as an universal materials characterization tool, currently no universal scheme exists to describe the highly divers set of parameters – for example, but not limited to, the momentum vector, energy, electron spin, light polarization states – that describe a MM experiment. Implementing findable, accessible, interoperable and reusable (FAIR) principles in momentum microscopy mandates new metadata schemes that describe the abundance of experimental parameters and thus link measured data voxels to the electronic properties of a material. The aim of M³eta is to establish such extensible and sustainable metadata scheme for momentum microscopy that will be stored in a structured file together with the measured data voxels. This will be the basis for an automated and interactive tool-chain that interprets the stored metadata and uses this information to reconstruct views of the multi-dimensional electronic structure of a material.
Manufacturing of composite parts involves multiple process steps, from the production of semi-finished materials to their processing and assembly. At each level of production, large datasets can be produced to trace back each state of the composite material and relate it to the final quality of the manufactured structure. With help of the recently developed data management system shepard, the project MEMAS aims at storing and connecting these manufacturing data in a persistent way. It focuses particularly on the standardization, collection and annotation of meta-data and on their automatic transfer in a simulation environment to estimate the actual structural performances. The consideration of potential defects resulting from the manufacturing techniques or induced by the surrounding environment will allow for the improvement of finite-element methods and of their accuracy. Furthermore, the developed tools will support the manufacturing field and highlight the consequences of manufacturing parameters on the structural behaviour, enabling adjustments of the process parameters after each produced part. Finally, the persistent and structured storage of research data and their metadata in form of FAIR Digital Objects or with help of DataCrates will support long-term data analysis and the further comprehension of manufacturing techniques.
To this goal, software solutions will be developed for the two exemplary manufacturing processes tape laying and additive manufacturing and combined in a general toolchain. The potential of the developed methodology will be tested by performing mechanical tests on representative parts and by comparing the results with the numerically predicted behaviour. Overall acquired experimental, manufacturing and simulation data, meta-data formats and scientific results will be transferred via open-source solutions like the Zenodo platform in the HMC community.
In toxicology and pharmacology data from chemistry, biology, informatics, and human or ecosystem health science merge and toxicological metadata need to become interoperable and compliant with existing ontology-based data infrastructures of these fields.
A team from three Helmholtz programs ( Earth and Environment , Information , and Health ) will review existing metadata standards and ontologies across fields and generate an integrative, suitable ontology for the annotation of toxicological/pharmacological data and workflows from the experimental design to the data deposit in repositories.
We will establish a metadata framework for the FAIR description of exposure and experimental settings interlinked with chemical IDs and data processing workflows using ‘omics data, which will be implemented into the community-based “Galaxy” project. This will enable interoperability between disciplines to address the grand challenges of chemical pollution and human and ecosystem health.
The importance of metadata when sharing FAIR data cannot be overstated. The MetaCook project was initially envisioned as a cookbook-like set of instructions for developing high-quality vocabularies and subsequently metadata. However, we decided to turn the cookbook into an interactive software called VocPopuli, which includes the collaborative development of controlled vocabularies. In this way, instead of reading lists of instructions, users from any background and level of experience can easily navigate VocPopuli and receive interactive guidance. Using VocPopuli ensures that the developed vocabularies will be FAIR themselves.
VocPopuli offers the capability to immediately apply the vocabulary to datasets stored in electronic lab notebooks. As an example, an integration with Kadi4Mat is already established, and we are currently implementing the same for Herbie.
Functionally, VocPopuli has a few main features:
-
GitLab login is used to associate user contributions with term versions and vocabularies
-
Each term’s provenance can be tracked both via VocPopuli’s user interface and GitLab’s commit history and branches
-
Every term contains: label, synonyms, translations, expected data type, broader terms, related internal/external terms, up/downvotes, collaborative discussion board, visual history graph
-
Each vocabulary contains: metadata about its contents, a hierarchical structure of terms, a set of allowed named relationships, a git repository
-
In the backend VocPopuli’s data is stored in a graph database
-
SKOS and PROV (in progress) are optional exports
-
Any resource can be digitalized: Lab Procedure; Lab Specimen, ELN, ELN Export, Data Analysis
FAIR vocabularies hold enough information which enables their transformation into ontologies. We verify this with a prototype of a second piece of software called OntoFAIRCook. At the time of writing, OntoFAIRCook is available as a command line tool, and is being transformed into a web interface which is easy to use.
This project brings together the metadata from three centers (HMGU, UFZ, DLR) from three different domains (Health, Earth & Environment, and Aeronautics, Space & Transport). The environment plays an increasingly important role for human health and efficient linkage with environmental and earth observation data is crucial to quantify human exposures. However, there are currently no harmonized metadata standards for automatically mapping available. Therefore, this project aims to facilitate the linkage of data of the different research fields by generating and enriching interoperable and machine-readable metadata for exemplary data of our three domains and by mapping these metadata so that they can be jointly queried, searched and integrated into HMC. We finalized the conceptualization phase by developing a joint mapping strategy which identified a joint standard (ISO 19115) for our cross-domain metadata and spatial and time coverage as the main mapping criteria. In the ongoing implementation phase, we set up a test instance of the selected platform GeoNetwork, a catalog application which includes metadata editing, search functions, filtering and an interactive web map viewer. We already uploaded our use case metadata (HMGU: children cohorts GINI and LISA, UFZ: drought monitor, DLR: land cover) after converging and enriching to ISO 19115. We are currently testing the full functionality of the tool and uploading additional metadata. By the end of the project, we plan to release the platform to HMC and other researchers working in thematically related fields
This data project contains data and software with regard to Metamorphoses , a joint project in the framework of Helmholtz Metadata Collaboration (HMC) with contributions from KIT-IMK and FZJ-IEK7.
Currently the amount and diversity of high-quality satellite-based atmospheric observations is quickly increasing, and their synergetic use offers unprecedented knowledge gaining opportunities. However, for this kind of interoperability and reusability of the remote sensing data, the storage intensive averaging kernels and errorcovariances are needed for each individual observation.
This project will develop enhanced standards for storage efficient decomposed arrays thus enabling the advanced reuse of very large remote sensing data sets. The synergetic data merging will be further supported by Lagrange trajectory metadata. For this purpose, the project will develop tools for an automated generation of standardised trajectory data files.
A case study will demonstrate the impact on science of the multi-sensor atmospheric observational data generated with the support of Lagrange trajectory calculations. The project will actively support data merging activities in other research fields.
Modern science is to a vast extent based on simulation research. With the advances in high-performance computing (HPC) technology, the underlying mathematical models and numerical workflows are steadily growing in complexity.
This complexity gain offers a huge potential for science and society, but simultaneously constitutes a threat for the reproducibility of scientific results. A main challenge in this field is the acquisition and organization of the metadata describing the details of the numerical workflows, which are necessary to replicate numerical experiments, and to explore and compare simulation results. In the recent past, various concepts and tools for metadata handling have been developed in specific scientific domains. It remains unclear to what extent these concepts are transferable to HPC based simulation research, and how to ensure interoperability in the face of the diversity of simulation based scientific applications.
This project aims at developing a generic, cross-domain metadata management framework to foster reproducibility of HPC based simulation science, and to provide workflows and tools for an efficient organization, exploration and visualization of simulation data.
Within the project, we so far did a review of existing approaches from different fields. A plethora of tools around metadata handling and workflows have been developed in the past years. We identified tools and formats like the odML that are useful for our work. The metadata management framework will address all components of simulation research and the corresponding metadata types, including model description, model implementation, data exploration, data analysis, and visualization. We have now developed a general concept to track, store and organize metadata. Next, the required tools within the concept will be developed such that they are applicable both in the Computational Neuroscience and Earth and Environmental Science.
Reflection seismic data, 2D as well as 3D, and refraction data such as active OBS data are the paramount source of information for the deep subsurface structure as they provide by far the highest resolution of any comparable geophysical technique. To this date, they have been used for a large variety of academic and commercial purposes. For many decades, reflection and refraction seismic data were the largest data sets in earth sciences, which created significant storage and archival problems. This fact and the lack of metadata standards hampers all new scientific projects that would like to use present-day and legacy data. However, GEOMAR has already initiated the implementation of the FAIR standards concerning 2D seismic data within a NFDI4Earth Pilot “German Marine Seismic Data Access” running until February 2023 in cooperation with the University of Hamburg and the University of Bremen.
Within MetaSeis, we will develop a unifying data infrastructure and prepare for future archival of reflection 3D seismic data and active OBS data from recent and future research cruises. We aim to adopt and extend existing standards and interoperable vocabularies in the seismic metadata including metadata quality and validation checks. To ensure long-term archival according to the FAIR-principles, a workflow for the integration of future and legacy data sets will be established along with best practices developed within previous projects (Mehrtens and Springer, 2019).
With this initiative, HMC will serve Germany’s marine geophysics community as represented by AGMAR of the Fachkollegium Physik der Erde of DFG but also contributes to the efforts of NFDI4Earth/DAM/DataHUB and the involved Helmholtz centres to establish a distributed infrastructure for data curation by harmonized data workflows with connections to international data repositories such as MGDS (Marine Geoscience Data System), IEDA (Interdisciplinary Earth Data Alliance), SNAP (Seismic data Network Access Point) and SeaDataNet (Pan-European Infrastructure for Ocean and Marine Data Management). International cooperation will benefit from synergy with the industrial seismic standard Open Subsurface Data Universe (OSDU) to ensure future cooperation between industry and academic research.
Capitalizing on advancements in Large Language Models (LLMs), MetaSupra aspires to expedite the process of metadata enrichment in FAIR-compliant repositories.
Specifically, MetaSupra will enhance SupraBank, a platform that provides machinereadable physicochemical parameters and metadata of intermolecular interactions in solution. By utilizing LLMs, we aim to develop data crawlers capable of extracting context-specific information from chemical literature, simplifying data acquisition, and curating FAIR repositories. This crawler software will be made accessible, inviting potential adoption by other Helmholtz centers. In addition, MetaSupra will illustrate how repositories' utility for correlation studies, machine learning, and educational purposes can be substantially amplified through the integration of quantum-chemically computed molecular parameters, positioning it as a model for other chemical repositories and moving forward with its integration into IUPAC activitie
The MetaSurf project is a comprehensive initiative aimed at transforming how data is managed, shared, and utilized in the field of surface science. It seeks to implement the FAIR (Findable, Accessible, Interoperable, and Reusable) principles across a broad spectrum of experimental and simulation data. The project's central objectives include:
-
Extension of Existing Infrastructure: Enhancing the Kadi4Mat platform by integrating advanced simulation and modeling workflows, GitLab, and JupyterLab. This extension aims to facilitate automated processing steps and streamline the data management process.
-
Development of a Public Data Repository: Establishing a centralized repository for surface science data, accessible to the global research community. This repository will serve as a hub for data exchange, fostering collaboration and accelerating scientific discovery.
-
Metadata-Driven Approach: Emphasizing the use of metadata, electronic lab notebooks, and data repositories to promote reproducibility and transparency in research. By developing tools, workflows, and templates that leverage metadata, the project intends to enable a more structured approach to data management, ensuring that data from diverse sources can be easily integrated and analyzed.
-
Community Engagement and Standardization: Working closely with the surface science community to develop standards for data exchange and processing. The project aims to cultivate a culture of data sharing and collaboration, encouraging researchers to adopt these standards in their work.
-
Innovation in Data Processing: Introducing new processing tools and techniques designed to handle the complexities of surface science data. These innovations will address the specific needs of the community, such as data visualization, analysis, and interpretation, enhancing the overall quality and impact of research in this field.
By achieving these goals, the MetaSurf project aspires to create a more cohesive, efficient, and innovative research environment in surface science, where data can be easily accessed, shared, and leveraged to drive new discoveries and advancements.
Currently data collection is mostly done by the instrument teams on various ESA, NASA, JAXA and other agency’s missions. Different data products often even from the same satellite mission use different formats and rarely use the standard practices accepted for metadata in more coordinated communities such as atmospheric and oceanic sciences. Moreover, data versions and attributes, instrument PID’s and workflows are not properly recorded which makes a reproduction of the results practically impossible. As a consequence of this lack of standardization both in data access and format, the accessibility and reusability of data provided by satellite missions with budgets of up to several hundred million Euros is substantially limited. As an example, NASA’s flagship Van Allen Probes mission included a number of instruments and each of the instrument teams utilized different
metadata standards as well as different data formats. Reconstruction of historical behavior of the radiation belts is even more complicated as most of the historical data are written in binary codes, sometimes with little documentation.
Similarly, the quantification of precipitating fluxes needed as input for atmospheric models from radiation measurements is often difficult, as relevant properties for the estimation of precipitations’ quantities are either not provided or difficult to obtain. The situation is somewhat similar for ionospheric observational data that are growing exponentially. Numerous ionospheric measurements provided by the GNSS satellites, science missions such as COSMIC I, COSMIC II and now commercial fleets such as Spire provide a vast amount of measurements that are described in various different metadata formats.
Initial efforts have been made to introduce standardization for radiation belt physics. The Committee on Space Research (COSPAR) Panel on Radiation Belt Environment Modeling (PRBEM) developed the “Standard file format guidelines for particle count rate” for data stored in cdf format. NASA’s Space Physics Data Facility (SPDF) makes use of these guidelines for several products but uses different formats for different communities of data providers and stakeholders. The format contains attributes that can hold metadata describing data content, but does not hold information about workflows, nor does it make use of persistent identifiers. For ionospheric sciences, DLR Neustrelitz has pioneered introducing the formats for the ionospheric community during its involvement in CHAMP and GRACE satellite missions as an operator for ionospheric products generation and distribution. Later
DLR’s involvement in several national (SWACI, IMPC) and EU projects such as ESPAS and PITHIA-NRF led to the development of first preparatory standards for ionospheric products. The increasing use of data assimilation and machine learning requiring a vast amount of data from different sources makes this project most timely.
The collection and usage of sensor data are crucial in science, enabling the evaluation of experiments and validation of numerical simulations. This includes sensor maintenance metadata, e.g. calibration parameters and maintenance time windows. Enriched sensor data allows scientists to assess data accuracy, reliability, and consistence through Quality Assurance and Quality Control (QA/QC) processes. Today, maintenance metadata is often collected but not readily accessible due to its lack of digitalization. Such audit logs are commonly stored in analogue notebooks, which poses challenges regarding accessibility, efficiency, and potential transcription errors.
In MOIN4Herbie (Maintenance Ontology and audit log INtegration for Herbie), we will address the obvious lack of digitized maintenance metadata in Helmholtz’s research areas Information and Earth and Environment.
To this end, MOIN4Herbie will extend the electronic lab notebook Herbie – developed at the Hereon Institute of Metallic Biomaterials - with ontology-based forms to deliver digital records of sensor maintenance metadata for two pilot cases: For both the redeployed Boknis Eck underwater observatory and the already implemented Tesperhude Research Platform we will establish a digital workflow from scratch.
This will lead to a unified and enhanced audit of sensor maintenance metadata, and thus more efficient data recording, the empowerment of technicians to collect important metadata for scientific purpose, and last but not least improvement and facilitation of the scientific evaluation and use of sensor data.
The main goal of PATOF is to make the data of a number of participating experiments fully publicly available and FAIR as far as possible. The work is building on first experience gained at the Mainz A4 nucleon structure experiment (1999-2012).
Here, the analysis, reorganisation, preparation and subsequent publication of the A4 data and A4 analysis environment according to FAIR principles shall be achieved. The lessons learned at A4 are then to be applied to other experiments at DESY (ALPS II, LUXE) and Mainz (PRIMA, P2), collectively called APPLe experiments. In the process, a general and living cookbook – or at least a first collection of recipes – on how to create metadata for PUNCH experiments and how to make their data FAIR is also aimed for.
The cookbook will capture the methodology for making individual experiment-specific metadata schemas FAIR. Another output is the “FAIR metadata factory”, i.e. a process to create a naturally evolved metadata schema for different experiments by extending the DataCite schema without discarding the original metadata concepts.
The Sample Environment Communication Protocol (SECoP) provides a generalized way for controlling measurement equipment – with a special focus on sample environment (SE) equipment. In addition, SECoP holds the possibility to transport SE metadata in a well-defined way.
Within the project SECoP@HMC, we are developing and implementing metadata standards for typical SE equipment at large scale facilities (photons, neutrons, high magnetic fields). A second focus is the mapping of the SECoP metadata standards to a unified SE vocabulary for a standardized metadata storage. Thus, a complete standardized system for controlling SE equipment and collecting and saving SE metadata will be available and usable in the experimental control systems (ECS) of the participating facilities. This approach can be applied to other research areas as well.
The project SECoP@HMC is organised in four work packages
-
Standards for Sample Environment metadata in SECoP (WP1)
-
Standards for storage of Sample Environment metadata (WP2)
-
Implementation into experimental control systems (WP3)
-
Outreach, Dissemination & Training (WP4)
The objectives of WP1 and WP2 are to standardize the provision and storage of metadata for SE equipment for FAIR compatible reuse and interoperability of the data. WP3 establishes SECoP as a common standard for SE communication at the involved centers by integrating the protocol into experiment control systems, easing the integration of new and user-built SE equipment into experiments providing sufficient SE metadata. In WP4 reach out to the metadata and experimental controls community, e.g. organising workshops and presenting SECoP@HMC at conferences (see e.g. figure 1).
Some background on SECoP:
SECoP is developed in cooperation with the International Society for Sample Environment (ISSE) as an international standard for the communication between SE equipment and ECS. It is intended to ease the integration of sample environment equipment supplied by external research groups and by industrial manufacturers.
-
SECoP is designed to be
-
simple
-
inclusive
-
self-explaining
-
providing metadata
Inclusive means, that different facilities can use this protocol and don't have to change their work flow, e.g. rewrite drivers completely or organize and handle hardware in a specific way to fulfil SECoP requirements. Simple means it should be easy to integrate and to use – even for the non-expert programmer. Self-explaining means that SECoP provides a complete human- and machine-readable description of the whole experimental equipment, including how to control it and what the equipment is representing. In respect to metadata, SECoP greatly facilitates and structures the provision of metadata which is associated with SE equipment.
Several implementations of SECoP are developed and support the design of SECoP-compatible sample environment control software. The complete specifications of SECoP are available on GitHub.
Research software should be published in repositories that assign persistent identifiers and make metadata accessible. Metadata must be correct and rich to support the FAIR4RS principles. Their curation safeguards quality and compliance with institutional software policies. Furthermore, software metadata can be enriched with usage and development metadata for evaluation and academic reporting. Metadata curation, publication approval and evaluation processes require human interaction and should be supported by graphical user interfaces.
We create "Software CaRD" (Software Curation and Reporting Dashboard), an open source application that presents software publication metadata for curation. Preprocessed metadata from automated pipelines are made accessible in a structured graphical view, with highlighted issues and conflicts. Software CaRD also assesses metadata for compliance with configurable policies, and lets users track and visualize relevant metadata for evaluation and reporting.
The HMC-funded STAMPLATE project aims to implement and establish the SensorThings API (STA) of the Open Geospatial Consortium (OGC) as a consistent, modern and lightweight data interface for time series data. Using representative use-cases from all seven research centers of the Helmholtz Research Field Earth & Environment, we ensure transferability and applicability of our solutions for a wide range of measurement systems. Our project is, hence, making a decisive contribution towards a digital ecosystem and an interlinked, consistent, and FAIR research data infrastructure tailored towards time-series data from environmental sciences.
Time-series data are crucial sources of reference information in all environmental sciences. Beyond research applications, the consistent and timely publication of such data is increasingly important for monitoring and issuing warnings, especially in times of growing frequencies of climatic extreme events. In this context, the seven Centers from the Helmholtz Research Field Earth and Environment (E&E) operate some of the largest environmental measurement-infrastructures worldwide. These infrastructures range from terrestrial observation systems in the TERENO observatories and ship-borne sensors to airborne and space-based systems, such as those integrated into the IAGOS infrastructures.
In order to streamline and standardize the usage of the huge amount of data from these infrastructures, the seven Centers have jointly initiated the STAMPLATE project. This initiantive aims to adopt the Open Geospatial Consortium (OGC) SensorThings API (STA) as a consistent and modern interface tailored for time-series data. We evaluate STA for representative use-cases from environmental sciences and enhance the core data model with additional crucial metadata such as data quality, data provenance and extended sensor metadata. We further integrate STA as central data interface into community-based tools for, e.g., data visualization, data access, QA/QC or the management of observation systems. By connecting the different STA endpoints of the participating research Centers, we establish an interlinked research data infrastructure (RDI) and a digital ecosystem around the OGC SensorThings API tailored towards environmental time-series data.
With our project, we further want to promote STA for similar applications and communities beyond our research field. Ultimately, our goal is to provide an important building block towards fostering a more open, FAIR (Findable, Accessible, Interoperable, and Reusable), and harmonized research data landscape in the field of environmental sciences.