arrowCIRSS Home arrow Projects arrow

Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data

Description

Data curation is a critical step in scientific data digitization, sharing, integration and use. The considerable resources allocated to digitization of natural science collections in the U.S. and globally require a focus on both digitization efficiencies and the utility of the generated data. One way to address both issues is to employ workflow software to automate and streamline data curation processes. We are developing Kurator, a suite of biodiversity data quality tools aimed at collection management specialists with little or no programming experience, database administrators and researchers with some scripting language experience, and developers. One of the tools is Kurator-Akka, which can be used as either a command line or a web-based data quality application.  Kurator-Akka is designed to be accessible to data curators through a web interface, to more advanced users through editable configuration files, and to programmers for extending functionality or developing new modules/actors.  Behind the scenes, and typically invisible to users of the web interface, Kurator-Akka runs workflows defined in YAML. Workflows can invoke actors written in Java or Python, with the Kurator-Akka framework managing the dataflow between actors.  One of our goals is to allow users to develop data quality workflows in a drag-and-drop user interface, which behind the scenes builds YAML configuration files that can be executed through the web interface or downloaded and edited for local execution by users with some scripting language programming experience.  Another goal is to enable others to write new actors (e.g., in Python) that interoperate easily with the actors we provide; we further plan to provide means for sharing these actors and example curation workflows with the community.
The Kurator software is open source and available here: https://github.com/kurator-org/

Read more at the project site.

Project PI(s)

PI: Bertram Ludäscher; co-PI: James Macklin (Agriculture and Agri-Food Canada); PI: James Hanken (Director, Museum of Comparative Zoology. Harvard)
No project contact.
Funded by: NSF
Grant number: DBI-1356751

Research Area(s)

Digital Collections and Curation
CIRSS projects in this sector focus on how to build, represent, and make accessible research collections, with a particular focus on the challenges and opportunities associated with the curation and f…

E-Science
Given the ever growing universe of information resources, informatics tools, and scholarly communication options that need to be understood, assessed, and coordinated, the e-Science initiatives at CIR…

Project Team

Bertram Ludäscher (PI)
Timothy McPhillips (Researcher)
Qian Zhang (Researcher)

Publications

McPhillips, T., Bowers, S., Belhajjame, K., & Ludäscher, B. (2015, July). Retrospective provenance without a runtime provenance recorder. In Proceedings of the 7th USENIX Conference on Theory and Practice of Provenance (pp. 1-1). USENIX Association.
Read more

McPhillips, T., Song, T., Kolisnik, T., Aulenbach, S., Belhajjame, K., ... & Ludäscher, B.  (2015, February). YesWorkflow: A User-Oriented Language-Independent Tool for Recovering Workflow Structure, Provenance, and Semantics from Scripts. Paper presented at the 10th International Digital Curation Conference, London, UK.
Read more