CIRSS Seminar - "Representing Identity and Equivalence for Scientific Data," from AGU 2012

Friday, November 30, 2012
4:00pm - 5:00pm

126 LIS

Session leaders: Karen Wickett
Description: To be presented as part of a session on "Data Stewardship, Citation With Confidence, and Preparing Next Generation of Data Managers," at the upcoming American Geophysical Union (AGU) Fall Meeting, 3-7 December 2012 in San Francisco.

* Paper authors: Karen M Wickett, Simone Sacchi, David Dubin, Allen H Renear

* Abstract:  Matters of equivalence and identity are central to the stewardship of scientific data. In order to properly prepare for and manage the curation, preservation and sharing of digitally-encoded data, data stewards must be able to characterize and assess the relationships holding between data-carrying digital resources. However, identity-related questions about resources and their information content may not be straightforward to answer: for example, what exactly does it mean to say that two files contain the same data, but in different formats? Information content is frequently distinguished from particular representations, but there is no adequately developed shared understanding of what this really means and how the relationship between content and its representations hold.

The Data Concepts group at the Center for Informatics Research in Science and Scholarship (CIRSS), University of Illinois at Urbana Champaign, is developing a logic-based framework of fundamental concepts related to scientific data to support curation and integration. One project goal is to develop precise accounts of information resources carrying the same data. We present two complementary conceptual models for information representation: the Basic Representation Model (BRM) and the Systematic Assertion Model (SAM). We show how these models provide an analytical account of digitally-encoded scientific data and a precise understanding of identity and equivalence.

The Basic Representation Model identifies the core entities and relationships involved in representing information carried by digital objects. In BRM, digital objects are symbol structures that express propositional content, and stand in layered encoding relationships. For example, an RDF description may be serialized as either XML or N3, and those expressions in turn may be encoded as either UTF-8 or UTF-16 sequences. Defining this encoding stack reveals distinctions necessary for a precise account of identity and equivalence relationships.

The Systematic Assertion Model focuses on key provenance events through which propositional content and symbol structures acquire the status of data content and data, respectively. Attention is on events such as a selection of symbols to express propositional content, or an appeal to observational evidence to advance a claim. SAM explicitly identifies data as the primary form of expression the one directly expressing content for a systematic assertion, an assertion where claims are warranted by an observation or a computation event.

Under these models, equivalence relationships may hold between different data expressing the same content, or between different encodings of the same data. Equivalence relationships also hold among different data supporting the same claim and when contrasting claims are based on the same observations. SAM and BRM support a fine-grained characterization of scientific equivalence relationships that can be documented through ordinary data stewardship practices.

