Provenance Services


Provenance is an important aspect of Open Science and FAIR data. Provenance of research outputs and resources provides important contextual information on their origin and how they have been processed and so allows assessment of their value and reusability.

More specifically, we can distinguish between resource provenance and metadata provenance. Resource provenance is concerned with the history of a digital object/artefact/resource and metadata provenance with the history of the metadata itself during the curation process. 

Persistent identifiers can contribute to  provenance of research artefacts through clear statements about the artefacts’ origin and connection to other entities with PIDs. Furthermore, PIDs can also have their own provenance and that of their associated metadata.

FREYA has created a central service provision focusing on PID metadata, while also focusing on resource and metadata provenance through the work of the individual disciplinary partners. 

Central Service Provision – Metadata Provenance for DataCite DOIs

The main use case for the activities API is to track all changes to DataCite DOI metadata to provide full transparency over any changes over time. The existing API (https://api.datacite.org/) was adapted to support all three concepts of the PROV conceptual model, introducing the concept of activities, as it already had two types entity (e.g. a dataset) and agent (e.g. a member). DataCite also introduced audit logging to track changes to DOI metadata requested by DataCite clients. 


Provenance 01

Example of provenance information in DataCite API response. DataCite blogged more about this API here https://doi.org/10.5438/wy92-xj57 and more information is provided on the DataCite support pages https://support.datacite.org/docs/tracking-provenance


A selection of  disciplinary examples of provenance in workflows 


CERN – CERN Open Data Portal


CERN’s main provenance use case is about resource provenance, particularly detailed information about how a dataset was generated (all the methodology and processing steps, software used, other related datasets, etc.). 

The CERN Open Data Portal uses customised manually curated metadata which provide rich provenance information for the all published outputs on the portal. Information on provenance creates trust in the resources and facilitates reuse by helping users understand the full context of a published resource. The more contextual information, the more useful a resource is to the community or the external users in the case of public-facing services.. 


Provenance 02


Provenance 03


Provenance 04

Parts of CERN Open Data bibliographic records showing the processing steps for the generation of a specific dataset, related datasets and selection steps.


EMBL-EBI – EuropePMC


EMBL-EBI considers the provenance of a dataset to be the information defining the “source” of the data. Thus,  biological sample information is considered to be part of the provenance, as is information about the equipment or methods by which the data were generated, and by whom. Europe PMC’s primary entry type is peer-reviewed  journal articles. At Europe PMC journal article content is extensively linked to data  ORCID IDs, citations, and funding information. This rich interconnected knowledge base is provided to users via programmatic access (APIs and FTP) and search tools on a user interface. 


Data are often cited via mentions of accession numbers in life science research articles, but often explicit URL-based links are not provided. To enrich the provenance information provided in an article Europe PMC offers the following: 

  • Additional information about data resources used including external data resources that point to the Europe PMC article, which are linked, and text-mined accession numbers (i.e. persistent identifiers of datasets). 

  • Links to grants where  the information is provided by authors, or present within the grants database of Europe PMC funders

  • External links - Links to other relevant external information provided by third-party data miners 

  • Search functionality to identify publications that have Data Availability Statements

Data provenance information is collected and provided to users to assure them of the rigor of the dataset (the context in which the data was collected and the relationship to other datasets) and to enable reproducibility.


Provenance 05

Provenance information for a term text-mined in Europe PMC for article DOI: 10.1038/s41598-018-36552-4. Text mining by Europe PMC after publication picks up 3 accession numbers - shown highlighted in purple. Hovering over a highlighted term, provides the provenance information: in this case it is Accession number AY604039; source of text mining is Europe PMC; the accession number has been verified by the ENA, as an accession number for a nucleotide sequence.


More information on provenance and the PID Graph?


A detailed description of the FREYA work on provenance in the PID Graph is available in the FREYA Deliverables D2.2 PID Metadata Provenance and D4.2 Using the PID Graph: Provenance in Disciplinary Systems. As well as the examples above, the deliverables cover these other disciplinary examples: 

  • British Library’s Shared Research Repository

  • NARCIS,  a national portal by DANS

  • PANGAEA

  • STFC’s ISIS Data Repository