arrowCIRSS Home arrow Publications arrow Publication Detail

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations

Full APA Reference

Thomer, A., Vaidya, G., Guralnick, R., Bloom, D., & Russell, L. (2012). From documents to datasets: A MediaWiki-based method of annotating and extracting species observations. ZooKeys, 2012(209), 235–253. doi: 10.3897/zookeys.209.3247

Publication Abstract

Biological field notebooks are records of lives spent outside in nature. Part diary, part scientific record, field notebooks often contain details required to understand the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a model workflow to generate structured outputs while also maintaining linkages to the original texts. The first step in this workflow was placing already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open manuscript-editing platform. Next, we created Wikisource-specific templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and placed into Darwin Core compliant record sets. Finally, these record sets were vetted, specifically to provide valid taxon names, via a process we call “taxonomic referencing.” The result is identification and mobilization of 1076 observations from three of Henderson’s thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn.