Reanalyses in the age of Machine Learning: Why Dataset Curation Matters Now More Than Ever

Mimi Rose Abel, Alexander J. Thompson, Ethan Gutmann, Kelly Mahoney, Rachel McCrary, Russ S. Schumacher, Laura Slivinski

Research output: Contribution to journalArticlepeer-review

Abstract

As machine learning becomes ever more prevalent within earth and atmospheric science, clear and consistent descriptions of models, observations, and observations-based datasets, particularly reanalyses, are increasingly vital. Reanalyses remain foundational for climate and weather research, but advancements in data assimilation and model nudging methods, as well as increasingly complex physical parameterization options, mean that not all variables within reanalyses are equally constrained by observations. Because machine learning models are often trained and evaluated on such datasets, imprecise terminology and inadequate documentation can lead to a loss of information content and mislead users unfamiliar with data nuances.
This essay argues for more careful use of the term “reanalysis”, emphasizing that it should be reserved for datasets that explicitly blend observations with models through data assimilation. It highlights the rise of "reanalysis-adjacent" datasets, as well as the growing disconnect between data producers and increasingly interdisciplinary users, particularly within the machine learning community. It offers guidance for dataset producers and users, alongside recommendations to enhance transparency, including renewed use of variable classification systems, better documentation of variable-specific uncertainties, and greater community-wide emphasis on data transparency. Without such efforts, Earth science datasets may be applied indiscriminately, regardless of fitness for purpose. Ensuring trustworthy and interpretable data is essential for maintaining the scientific integrity of Earth system modeling in the machine learning age.
Original languageAmerican English
JournalBulletin of the American Meteorological Society (BAMS)
StateSubmitted - 2025

Fingerprint

Dive into the research topics of 'Reanalyses in the age of Machine Learning: Why Dataset Curation Matters Now More Than Ever'. Together they form a unique fingerprint.

Cite this