Data provenance

The provenance of a dataset is a record, preferably in machine-readable form, of how the data was produced and transformed. See also scientific workflow management and provenance in knowledge representation.

Projects

  • VisTrails
    • Node-based graphical interface ala SPSS Modeler or LabView
    • Main use case is visualization but includes nodes for sklearn
    • Emphasis on provenance: tracking of workflow over execution and through time
  • CodaLab (GitHub )
    • By Percy Liang and his students and collaborators
    • Two elements: worksheets and competitions
    • Worksheets are overlays on immutable execution graph
  • Apache projects

Literature

  • Simmhan, Plale, Gannon, 2005: A survey of data provenance in e-science (doi, tech report )