arrowCIRSS Home arrow Projects arrow

Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data

Description

Data curation is a critical step in scientific data digitization, sharing, integration and use. The considerable resources allocated to digitization of natural science collections in the U.S. and globally require a focus on both digitization efficiencies and the utility of the generated data. One way to address both issues is to employ workflow software to automate and streamline data curation processes. We are developing Kurator, a suite of biodiversity data quality tools aimed at collection management specialists with little or no programming experience, database administrators and researchers with some scripting language experience, and developers. One of the tools is Kurator-Akka, which can be used as either a command line or a web-based data quality application.  Kurator-Akka is designed to be accessible to data curators through a web interface, to more advanced users through editable configuration files, and to programmers for extending functionality or developing new modules/actors.  Behind the scenes, and typically invisible to users of the web interface, Kurator-Akka runs workflows defined in YAML. Workflows can invoke actors written in Java or Python, with the Kurator-Akka framework managing the dataflow between actors.  One of our goals is to allow users to develop data quality workflows in a drag-and-drop user interface, which behind the scenes builds YAML configuration files that can be executed through the web interface or downloaded and edited for local execution by users with some scripting language programming experience.  Another goal is to enable others to write new actors (e.g., in Python) that interoperate easily with the actors we provide; we further plan to provide means for sharing these actors and example curation workflows with the community.
The Kurator software is open source and available here: https://github.com/kurator-org/

Read more at the project site.

Project PI(s)

PI: Bertram Ludäscher; co-PI: James Macklin (Agriculture and Agri-Food Canada); PI: James Hanken (Director, Museum of Comparative Zoology. Harvard)
No project contact.
Funded by: NSF
Grant number: DBI-1356751

Research Area(s)

Data Curation
Research and education initiatives focused on challenges associated with the curation and federation of digital collections for long-term distributed use.  Work in this area relates to all parts …

Data-driven Science
Activities in this domain aim to improve information transfer and integration, technology development and sustainability, and collaboration in the practice of science.  Several current cyberinfra…

Project Team

Publications

Franz, N. M., Musher, L. J., Brown, J. W., Yu, S., & Ludäscher, B. (2019). Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS computational biology, 15(2), e1006493.
Read more

McPhillips, T., Bowers, S., Belhajjame, K., & Ludäscher, B. (2015, July). Retrospective provenance without a runtime provenance recorder. In Proceedings of the 7th USENIX Conference on Theory and Practice of Provenance (pp. 1-1). USENIX Association.
Read more

McPhillips, T., Song, T., Kolisnik, T., Aulenbach, S., Belhajjame, K., ... & Ludäscher, B.  (2015, February). YesWorkflow: A User-Oriented Language-Independent Tool for Recovering Workflow Structure, Provenance, and Semantics from Scripts. Paper presented at the 10th International Digital Curation Conference, London, UK.
Read more