When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to AI safety concerns about model reliability under distribution shift; highlights risks in training data curation practices that could affect deployed systems behaving unexpectedly in evolving real-world contexts.
Paper Details
Metadata
Abstract
Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models. First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive. Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers. Overall, our findings caution against conventional wisdom that "more data is better" and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.
Summary
This paper investigates how adding historical training data can hurt model performance when temporal distribution shifts exist between past and present data. It challenges the common assumption that more training data is always better, showing that naive data expansion under temporal shifts can mislead models and degrade generalization to current distributions.
Key Points
- •Adding older training data can harm model performance when the data distribution has shifted over time, contradicting the 'more data is better' heuristic.
- •Temporal distribution shift is a critical but underexplored failure mode in training data curation strategies for ML systems.
- •The paper proposes methods to identify and mitigate the negative effects of temporally misaligned training data.
- •Findings have implications for continual learning, model updates, and deployment in dynamic real-world environments.
- •Results suggest practitioners should carefully evaluate data recency and relevance rather than defaulting to larger historical datasets.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Distributional Shift | Risk | 91.0 |
Cached Content Preview
[License: CC BY-NC-ND 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2509.01060v2 \[cs.CY\] 04 Sep 2025
# When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts
Report issue for preceding element
Chengyuan Yao1,
Yunxuan Tang2,
Christopher Brooks2,
Rene F. Kizilcec3,
Renzhe Yu1
Report issue for preceding element
###### Abstract
Report issue for preceding element
Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models.
First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive.
Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers.
Overall, our findings caution against conventional wisdom that ”more data is better” and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.
Report issue for preceding element
Code — https://github.com/AEQUITAS-Lab/Distribution-Shift-AIES-2025
Report issue for preceding element
## Introduction
Report issue for preceding element
Machine learning applications have been widely deployed to facilitate decision making in social sectors such as healthcare and education (Dixon et al. [2024](https://arxiv.org/html/2509.01060v2#bib.bib11 ""); Broby [2022](https://arxiv.org/html/2509.01060v2#bib.bib7 ""); Sghir, Adadi, and Lahmer [2023](https://arxiv.org/html/2509.01060v2#bib.bib27 "")). In developing machine learning models, there is a common assumption that larger training d
... (truncated, 67 KB total)08de88197c266e9d | Stable ID: YjJiMGE5Mj