Abstract
As machine learning becomes ever more prevalent within Earth and atmospheric science, clear and consistent descriptions of models, observations, and observation-based datasets, particularly reanalyses, are increasingly vital. Reanalyses remain foundational for climate and weather research, but advancements in data assimilation and model nudging methods, as well as increasingly complex physical parameterization options, mean that not all variables within reanalyses are equally constrained by observations. Because machine learning models are often trained and evaluated on such datasets, imprecise terminology and inadequate documentation can lead to a loss of information content, mislead users unfamiliar with data nuances, lead to the training of flawed machine learning models, and ultimately result in model evaluations that do not realistically describe performance relative to observations. This essay argues for more careful use of the term “reanalysis,” emphasizing that it should be reserved for datasets that explicitly blend observations with models through data assimilation. It highlights the rise of “reanalysis adjacent” datasets, as well as the growing disconnect between data producers and increasingly interdisciplinary users, particularly within the machine learning community. It offers guidance for dataset producers and users, alongside recommendations to enhance transparency, including renewed use of variable classification systems, better documentation of variable-specific uncertainties, and greater community-wide emphasis on data transparency. Without such efforts, Earth science datasets may be applied indiscriminately, regardless of fitness for purpose. Ensuring trustworthy and interpretable data are essential for maintaining the scientific integrity of Earth system modeling in the machine learning age.
| Original language | English |
|---|---|
| Pages (from-to) | E922-E931 |
| Journal | Bulletin of the American Meteorological Society |
| Volume | 107 |
| Issue number | 4 |
| DOIs | |
| State | Published - Apr 2026 |
Keywords
- Community
- Data assimilation
- Quality assurance/ control
- Reanalysis data
- Uncertainty
Fingerprint
Dive into the research topics of 'Reanalyses in the Age of Machine Learning: Why Dataset Curation Matters Now More than Ever'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver