Completeness, Coverage & Equivalence in Scientific Data Records

Full APA Reference

Thomer, A., Baker, K., Sacchi, S., & Dubin, D. (2012). Completeness, Coverage & Equivalence in Scientific Data Records. Poster presented at the ASIST 2012 Annual Meeting, Baltimore, MD.

Publication Abstract

Earlier we asked, "When is a record data and when is it a fish?" (Wickett et al., 2012a). In this work, we ask, "when and in what contexts are a record and a fish equivalent?" We describe and compare a collection of potentially equivalent records describing a Mola mola, or Ocean Sunfish, specimen. We calculate the Metadata Coverage Index (MCI) of each record and explore the use the Systematic Assertion Model (Dubin, 2010) to support investigation of the assertions contained in these data records.

Natural history museum specimen records are increasingly provisioned and discovered online through cloud-hosted databases such as GBIF and VertNet. While increased use of standard vocabularies like Darwin Core means that these records are more easily aggregated and made interoperable (Wieczorek et al., 2012), the act of cross-walking legacy data and then transferring records from local to cloud-based databases with different representation formats, encodings, and harvesting protocols results in the proliferation of different versions of the "same" record. Depending on the vocabulary and/or schema used, these roughly equivalent records make different amounts and types of data available, and, thus, their fitness-for-purpose or analytic potential in different contexts varies (Hill et al., 2010; Palmer, Weber & Cramer, 2011). In prior work (Wickett et al., 2012a, 2012b) we considered a Mola mola species occurrence record pulled from a Darwin Core Archive (DwC-A) file available on the University of Kansas Biodiversity Institute Integrated Publishing Toolkit (KUBI IPT) installation to explore these issues (" - Darwin Core Archives"). Here, we compare five records downloaded from different data providers describing this same specimen, explore the metadata coverage and completeness of these records, and more fully discuss the nuances of determining their equivalence. SAM was used to begin comparing conflicting assertions between data sources when they arose.