Skip to main navigation Skip to search Skip to main content

Reanalyses in the Age of Machine Learning: Why Dataset Curation Matters Now More than Ever

  • National Oceanic and Atmospheric Administration
  • University of Colorado Boulder
  • Colorado State University

Research output: Contribution to journalArticlepeer-review

Abstract

As machine learning becomes ever more prevalent within Earth and atmospheric science, clear and consistent descriptions of models, observations, and observation-based datasets, particularly reanalyses, are increasingly vital. Reanalyses remain foundational for climate and weather research, but advancements in data assimilation and model nudging methods, as well as increasingly complex physical parameterization options, mean that not all variables within reanalyses are equally constrained by observations. Because machine learning models are often trained and evaluated on such datasets, imprecise terminology and inadequate documentation can lead to a loss of information content, mislead users unfamiliar with data nuances, lead to the training of flawed machine learning models, and ultimately result in model evaluations that do not realistically describe performance relative to observations. This essay argues for more careful use of the term “reanalysis,” emphasizing that it should be reserved for datasets that explicitly blend observations with models through data assimilation. It highlights the rise of “reanalysis adjacent” datasets, as well as the growing disconnect between data producers and increasingly interdisciplinary users, particularly within the machine learning community. It offers guidance for dataset producers and users, alongside recommendations to enhance transparency, including renewed use of variable classification systems, better documentation of variable-specific uncertainties, and greater community-wide emphasis on data transparency. Without such efforts, Earth science datasets may be applied indiscriminately, regardless of fitness for purpose. Ensuring trustworthy and interpretable data are essential for maintaining the scientific integrity of Earth system modeling in the machine learning age.

Original languageEnglish
Pages (from-to)E922-E931
JournalBulletin of the American Meteorological Society
Volume107
Issue number4
DOIs
StatePublished - Apr 2026

Keywords

  • Community
  • Data assimilation
  • Quality assurance/ control
  • Reanalysis data
  • Uncertainty

Fingerprint

Dive into the research topics of 'Reanalyses in the Age of Machine Learning: Why Dataset Curation Matters Now More than Ever'. Together they form a unique fingerprint.

Cite this