CIRSS Speaker Series, Spring 2024: Trustworthy Computational Science

The CIRSS speaker series continues in Spring 2024 on the theme of “Trustworthy Computational Science: Transparency, Reproducibility and Reuse,” with speakers addressing the challenge of enabling researchers to build on prior computational research with confidence. We meet most Fridays, 11am-noon US Central Time, on Zoom. Our Spring series will be led by Timothy McPhillips, with co-hosts Bertram Ludäscher, Santiago Núñez-Corrales, and Matt Turk. This event is open to the public, and everyone is welcome to attend. The series is hosted by the Center for Informatics Research in Science and Scholarship (CIRSS) of the School of Information Sciences at the University of Illinois at Urbana-Champaign. If you have any questions, please contact Timothy McPhillips and Janet Eke.

Participate: To join a live session, follow the “Join Here” link for the current week below to access the iSchool event page for the talk. There click the “PARTICIPATE online” button to join the live Zoom session. Recordings of past talks can be found via the “Recording” links below if available.

Follow: To receive weekly updates on upcoming talks, subscribe to our CIRSS Seminars mailing list at https://lists.ischool.illinois.edu/lists/info/cirss-seminars. Subscribe to add events to your calendar via Google Calendar or Outlook.

Spring 2024 Speakers

Lars Vilhuber, Cornell University
January 26, 2024, 11am-noon CT
Title: Trust and TRACE: Issues and solutions in reproducible social science

Abstract: How can we trust the integrity of results from research that relies on computations without repeating them? By certifying the successful original execution of a computational workflow that produced findings in situ. With certifications in hand, consumers of research can trust the transparency of results without necessarily repeating computations. I provide background, motivating examples that don’t quite get there, and possible next steps.

Bio: Lars Vilhuber holds a Ph.D. in Economics from Université de Montréal, Canada, and is currently on the faculty of the Cornell University Economics Department. He has interests in labor economics, statistical disclosure limitation and data dissemination, and reproducibility and replicability in the social sciences. He is the Data Editor of the American Economic Association, and Managing Editor of the Journal of Privacy and Confidentiality. More information is available here.

Cheryl Thompson, University of North Carolina
February 2, 2024, 11am-noon CT
Title: Curating for Transparency and Verification in Political Science

Abstract: Across the disciplines, journals are adopting policies requiring authors to share data, code, and other materials, sometimes subject to verification to ensure the transparency and reproducibility of results prior to publication. Since 2015, the UNC Odum Institute has been responsible for the verification of quantitative analyses for two political science journals, including a curation review and re-execution of computational steps to ensure transparency and accuracy of results. The verification audit is seen as essential to ensuring the transparency of published results, in support of future replications. However, the process is both labor intensive and increases the time to publication, raising questions about the costs and feasibility of these and similar policies. In this talk, I will discuss the curation and verification workflows, types of verification errors, and challenges faced in implementing these audits.

Bio: Cheryl A. Thompson is a Research Data Archivist at the H.W Odum Institute for Research in Social Sciences, University of North Carolina. As an archivist, she works on the journal data verification service and all aspects of the data repository. She is particularly interested in promoting the trustworthiness of science through better data sharing, transparency of practice, and responsive data education. Thompson received her PhD from the School of Information Sciences, University of Illinois.

Robert Sisneros, National Center for Supercomputing Applications, UIUC
February 9, 2024, 11am-noon CT
Title: Browser-Based Analytics: The Practical Reality and Developing Robust, Reliable, and Reproducible Analyses in Spite of It

Abstract: Deploying a dashboard interface to a database is pretty straightforward; make a few tables that correspond to charts, and you’re done.  But that’s only the first request.   Creating a series of reliable, dynamic, interconnected charts for exploration and analysis is more than difficult with just a few tables, and in my experience understanding even the structure of the data can be challenging.    In this talk I will introduce the Researchable Archives for Interactive Visualizations (RAIV) project.   RAIV is a tool to capture and archive web visualizations, including their interactions, into self-contained objects which have many potential uses.  I will first motivate the design through an outline of some of the many pitfalls I encountered first when delivering production analytics and then when I attempted to shoehorn an analytics library into a system with a familiar, scientific visualization data model. 

Bio: Robert Sisneros is a Senior Research Scientist at the National Center for Supercomputing Applications and specializes in scientific visualization and data analysis.  Robert’s research interests in I/O and visualization are primarily aligned with issues of particular importance to high performance computing.  These include in situ visualization, data models and representations, parallel analysis algorithms, I/O parameter optimization, and “big data” analytics.  Robert earned the degrees of Bachelor of Science in Mathematics and Computer Science from Austin Peay State University and the degrees of Master of Science and Doctor of Philosophy in Computer Science from the University of Tennessee in Knoxville.

George Alter, University of Michigan
February 16, 2024, 11am-noon CT
Title: SDTL and SDTH: Machine Actionable Descriptions of Data Transformations

Abstract: Realizing the promise of research transparency and the FAIR principles (Wilkinson et al., 2016) requires provenance metadata, i.e., documentation of the origins, contents, and meaning of data. SDTL (Structured Data Transformation Language) and SDTH (Structured Data Transformation History) provide machine-actionable metadata about scripts used to process and transform statistical data in languages like SPSS, SAS, Stata, R, and Python. Unlike most data provenance models, SDTL and SDTH document individual commands within these scripts, rather than treating scripts as ‘black boxes’ described only by their inputs and outputs. SDTL was created to work with metadata standards, like Data Documentation Initiative (DDI) and Ecological Markup Language (EML), so that descriptions of data transformations can be integrated into data production workflows. Since SDTL is structured in formats like JSON and XML, it can also serve as an intermediate language for translation between other languages. SDTH extends the W3C PROV model to facilitate basic queries about the origins and effects of variables, dataframes, and files in data transformation scripts.

Bio: George Alter is Research Professor Emeritus in the Institute for Social Research at the University of Michigan. His research integrates theory and methods from demography, economics, and family history with historical sources to understand demographic behaviors in the past. From 2007 to 2016 Alter was Director of the Inter-university Consortium for Political and Social Research, the world’s largest archive of social science data. He has been active in international efforts to promote research transparency, data sharing, and secure access to confidential research data, and he has worked on new metadata standards that improve the reusability and interoperability of research data. His current projects aim to automate the capture of metadata from statistical analysis software, compare fertility transitions in contemporary and historical populations, and to create a FAIR vocabulary of terms used in population research.

Michael Barton, Arizona State University
February 23, 2024, 11am-noon CT
Title: The Need for Transparency in Socioecological System Science

Abstract: We live in a world dominated by complexly coupled human and natural landscapes. Research into how these landscapes emerged and evolved is key to understanding today’s world, and an important goal of historical sciences like archaeology and paleoecology. Because we cannot directly observe the processes that created coupled human-natural landscapes, computational modeling is becoming an increasingly important tool for the science of long-term dynamics in socioecological systems. The wide range of data needed to generate and to validate increasingly sophisticated models of these complexly coupled systems is beyond the capacity of any individual or research project to generate, making FAIR data critical for advances in computational socioecological science. I draw on examples from research in the western Mediterranean to illustrate the important interplay between models, data, and open science. 

Bio: Michael Barton is a complex system scientist and Professor in the Schools of Complex Adaptive Systems and Human Evolution & Social Change at Arizona State University (USA). He is Executive Director of the Open Modeling Foundation, a global consortium of organizations to promote standards and best practices in computational modeling across the social and natural sciences. He also directs the Network for Computational Modeling in Social and Ecological Sciences (CoMSES.Net), an international scientific network to enable accessibility, open science, and best practices for computation in the socio-ecological sciences. Barton received his BA from the University of Kansas in Anthropology/Archaeology, and MA and PhD from the University of Arizona in Anthropology/Archaeology and Geosciences. His research centers around long-term human ecology, landscape dynamics, and the multi-dimensional interactions between social and biophysical systems, integrating computational modeling, geospatial technologies, and data science with geoarchaeological field studies. Barton has directed transdisciplinary research on hunter-gatherers and small-holder farmers in the Mediterranean and North America for over three decades, and directs research on human-environmental interactions in the modern world. He is a member of the open-source GRASS GIS Development Team and Project Steering Committee, dedicated to making advanced geospatial technologies openly accessible to the world. Web page and CV at: http://www.public.asu.edu/~cmbarton.

Jong S. Lee, National Center for Supercomputing Applications, UIUC
March 1, 2024, 11am-noon CT
Title: IN-CORE: Modelling Platform for Community Resilience

Abstract: The National Institute of Standards and Technology (NIST) funded the multi-university five-year Center of Excellence for Risk-Based Community Resilience Planning (CoE), headquartered at Colorado State University, to develop the measurement science to support community resilience assessment. Measurement science is implemented on a platform called Interdependent Networked Community Resilience Modeling Environment (IN-CORE). On IN-CORE, users can run scientific analyses that model the impact of natural hazards and resiliency against the impact on communities. The CoE identified the need for a modeling environment which is easy for researchers to use and develop their own models and data. In order to meet the need, IN-CORE has been designed and developed as a web-enabled architecture. The platform is built on a Kubernetes cluster with Docker container technology. On the cluster, customized JupyterHub, python library of scientific analyses, web services, and light-weight web applications are implemented. This talk will present the architecture and implementation of IN-CORE, and a plan for open-source community.

Bio: Dr. Jong S. Lee is Deputy Associate Director of Software Directorate at National Center for Supercomputing Applications (NCSA). His research interests focus on designing, developing, and operating end-to-end cyberenvironments supporting various research and education communities. He is particularly interested in the role of geographic information sciences and systems in research cyberenvironments.  Current representative projects include NIST-funded IN-CORE: Modelling Platform for Community Resilience, Ergo: Seismic Risk Assessment Systems, Great Lakes to Gulf Virtual Observatory, and GeoStreaming Platform, and DataWolf: Scientific Workflow Systems. 

Albert Kettner, University of Colorado Boulder
March 22, 2024, 11am-noon CT
Title: CSDMS: advancing research in the Geosciences by supporting finding, accessing, operating and coupling model integration tools for reproducible science

Abstract: The Geosciences modeling community has made significant progress. Prior to the turn of the century, it was rare for numerical models to be publicly available. Through collaboration with publishers, science foundations, as well as through technological advancements, this culture shifted to the extent that FAIR principles (Findable, Accessible, Interoperable, and Reusable) are now being embraced for numerical models as well. After all, the rationale behind the FAIR principles for data applies equally effectively to numerical models. Tasks involved in working with numerical models include replicating previous findings, applying existing models to new problems or locations, integrating models and/or data operations in a sequential workflow, coupling models to investigate feedback loops, enhancing model code with new algorithms or features, and creating new models. Implementing FAIR principles can improve the efficiency of each of these tasks.  The Community Surface Dynamics Modeling System (CSDMS), a cyberinfrastructure facility supported by NSF that facilitates research in earth-surface geoscience, employs various approaches to advocate for and promote FAIR principles in the development, application, and dissemination of numerical models, that I will highlight in this presentation. Additionally, the CSDMS community stands at another pivotal moment where efforts are being made to promote techniques for publishing reproducible and transparent modeling research. Reproducibility constitutes a fundamental facet of scientific inquiry, yet it is not consistently acknowledged in individual numerical studies. Consequently, reproducibility and transparency appear to be more of an anomaly. I will outline collaborative strategies for promoting reproducibility and transparency in numerical modeling within the geosciences community.

Bio: Albert Kettner is an Associate Research Professor at the University of Colorado and Associate Director of the Institute of Arctic and Alpine Research (INSTAAR) at CU-Boulder. He also directs DFO – Flood Observatory, an entity at the University of Colorado that observes surface water changes (like flood disasters) by utilizing satellite data. Kettner received a Ph.D. in Civil Engineering & Geosciences (2007) at Delft University of Technology, the Netherlands where he studied local and global fluvial supply dynamics to the coastal zone. With numerical models he investigates the impact of long-term climate and sea-level controls on riverine water and sediment fluxes and how these fluxes change over time. On shorter timescales, Kettner focuses on anthropogenic changes (e.g. altering of land use and placement of reservoirs) and how these impact water discharge and sediment flux. Kettner has been from the start (2006) intimately involved in the numerical modeling facility, Community Surface Dynamics Modeling System (CSDMS) at the University of Colorado, USA, which is the numerical modeling integration facility for geosciences supported by the U.S. National Science Foundation. He is an active advocate of free available, open source code of numerical models for Earth Surface processes.

Juliana Freire, NYU Tandon School of Engineering
March 29th, 2024, 11am-noon CT
Title: Dataset Search for Data Discovery, Augmentation, and Explanation

Abstract: Recent years have seen an explosion in our ability to collect and catalog immense amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should in theory allow us to make progress on many of our most important scientific and societal questions. However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of available information to discover datasets that are needed for their specific application. While search engines have addressed the discovery problem for Web documents, there are many new challenges involved in supporting the discovery of structured data—from crawling the Web in search of datasets, to the need for dataset-oriented queries and new strategies to rank and display results. I will discuss these challenges and present our recent work in this area. In particular, I will introduce a new class of data-relationship queries that, given a dataset, identifies related datasets; I will describe a collection of methods that efficiently support different kinds of relationships that can be used for data explanation and augmentation; and I will demonstrate Auctus, an open-source dataset search engine that we have developed at the NYU Visualization, Imaging, and Data Analysis (VIDA) Center.

Bio: Juliana Freire is an Institute Professor at the Tandon School of Engineering and  Professor of Computer Science and Engineering and Data Science at New York University. She served as the elected chair of the ACM SIGMOD and as a council member of the Computing Community Consortium (CCC), and was the NYU lead investigator for the Moore-Sloan Data Science Environment,  a grant awarded jointly to UW, NYU, and UC Berkeley.  She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, as well as different application areas, including urban analytics, misinformation, predictive modeling, and computational reproducibility. She is an active member of the database and Web research communities, with over 250 technical papers (including 12 award-winning papers), several open-source systems, and 12 U.S. patents.  According to Google Scholar, her h-index is 66 and her work has received over 19,000 citations. She is an ACM Fellow, a AAAS Fellow, and the recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She was awarded the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She has received M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook, and a B.S. degree in computer science from the Federal University of Ceara (Brazil).

Adelinde Uhrmacher, Universität Rostock
April 12, 2024, 11am-noon CT
Title: Adaptive simulation models for trustworthy computational science

Abstract: Transparency, reproducibility, and reuse in simulation rely on making the different artifacts of simulation studies and the context of their generation explicit and accessible. Here, we want to stress another aspect of trustworthiness in simulation, i.e., the need for simulation models to adapt to the modeled system, changing knowledge, data, and research questions.  Adaptation may be an important part of the simulation model itself, depending on the system to be modeled. The simulation model is refined, revised, and extended depending on the knowledge gained and the data available, and thus subject to frequent adaptations during a simulation study. The reuse of simulation models across simulation studies relies on a suitable adaptation of simulation models according to new research questions, systems variations, knowledge, and data, forming a family of models. Last but not least, digital twins establish a close connection between the physical and the digital twin by frequent adaptations. Adaptive simulation models pose specific challenges for methodological support, including suitable domain-specific modeling languages, adaptive simulators, recording and exploitation of provenance, and automatically generating simulation experiments and models.

Bio: Adelinde Uhrmacher is a Professor at the University of Rostock and Head of the Modeling and Simulation Group at the Institute of Visual and Analytic Computing. Her research is aimed at methodological developments, particularly for stochastic, discrete-event, multi-level modeling and simulation, including domain-specific languages, simulation algorithms, and computational support for conducting simulation studies. Applications from demography, cell biology, and (socio-)ecology drive many of her methodological developments. She received the ACM SIGSIM Distinguished Contributions Award in 2018. She has been editor-in-chief of the ACM Transactions of Modeling and Computer Simulation and a member of the ACM Task Force on Data, Software, and Reproducibility in Publication.  

Jingrui He, School of Information Sciences, UIUC
April 19th, 2024, 11am-noon CT
Title: Graph Transfer Learning

Abstract: In transfer learning, the general goal is to leverage the abundant label information from one or more source domains to build a high-performing predictive model in the target domain with limited or no label information. While many research efforts have been focusing on the IID setting where the examples from both the source and target domains are considered to be independent and identically distributed within each domain, recently more research works have been dedicated to the non-IID setting. In particular, many real applications have motivated the study of transferrable graph learning, where the data from both the source and target domains are represented as graphs. In this talk, I will introduce our recent work in this direction using graph neural networks for both regression and classification. For regression, starting from the transferrable Gaussian process for IID data, I will discuss a generic graph-structured Gaussian process framework for adaptively transferring knowledge across graphs with either homophily or heterophily assumptions. For classification, I will present a novel Graph Subtree Discrepancy to measure the graph distribution shift between source and target graphs, which will lead to the generalization error bounds on cross-network transfer learning, including both cross-network node classification and link prediction tasks. Towards the end, I will also discuss the trustworthy aspect of graph transfer learning.

Bio: Jingrui He is currently a Professor at the School of Information Sciences, University of Illinois at Urbana-Champaign. She also has a courtesy appointment with the Computer Science Department. Dr. He received her Ph.D. from Carnegie Mellon University in 2010. Her research focuses on heterogeneous machine learning, rare category analysis, active learning, and self-supervised learning, with applications in security, social network analysis, healthcare, agriculture, and manufacturing processes. Dr. He is the recipient of the 2016 NSF CAREER Award, the 2020 OAT Award, a three-time recipient of the IBM Faculty Award in 2014, 2015 and 2018, and was selected as IJCAI 2017 Early Career Spotlight. She has more than 160 publications at major conferences (e.g., ICLR, ICML, NeurIPS, IJCAI, AAAI, KDD) and journals (e.g., TKDE, TKDD, DMKD), and is the author of two books. Her papers have received the Distinguished Paper Award at FAccT 2022, as well as Bests of the Conference at ICDM 2016, ICDM 2010, and SDM 2010. Dr. He is a Distinguished Member of ACM, a Senior Member of AAAI, and a Senior Member of IEEE.

Marta Mattoso, COPPE-Federal University of Rio de Janeiro
April 26th, 2024, 11am-noon CT
Title: Traceability for trust: applications and challenges

Abstract: Script applications like data science span multiple systems that integrate legacy and newly developed software components to deliver value to models and scientific results. Traceability provides access to such end-to-end activities to trust and reproduce results. Hence, it becomes necessary to adopt techniques for tracking and correlating the relevant artifacts being produced by script activities. Provenance data, as defined in W3C PROV, provides an abstraction that represents and correlates artifacts to be tracked. In addition to representing metadata on those artifacts, traceability requires a derivation path, so that the artifact’s generation can be automatically followed. Provenance data has been added to frameworks that help execute scripts on data science, health, IoT, etc. aiming at providing security, trust, reproducibility, and explainability of script results. However, often the provenance support is limited to metadata of the artifacts without access to its derivation path, which limits trust and reproducibility. Using provenance representation for traceability in data science requires techniques to associate provenance data to a script execution without the cost and overhead of fully-fledged data capture and process reengineering. Despite being around for many years, using and querying provenance data is still a challenge. This talk highlights different uses of provenance for trust like in data science, detecting threats, and authenticity of artifacts. I will discuss current challenges for capturing provenance to trace back the artifacts’ derivation path with examples of using provenance in machine learning scripts.

Bio: Marta Mattoso is a Full Professor at COPPE-Federal University of Rio de Janeiro. Her subjects of interest in Data Science include aspects of large-scale data management. Among her interests are the provenance data to support human-in-the-loop during the parallel execution of many computing tasks in high performance environments. She has supervised 90 graduate students. She is a CNPq level 1B research productivity fellow. Her research is applied to real problems, addressing scientific experiments in computational science workflows, including machine learning. She coordinates research projects financed by national and international agencies. She is a member of the specialists team of the WorkflowsRI project in the USA.She is a member of ACM, IEEE and founding member of the Brazilian Computer Society. She serves on international conference program committees and is a member of the editorial board of several international journals.