When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts

paper

2025·arXiv·arxiv.org/html/2509.01060

Authors

Chengyuan Yao·Yunxuan Tang·Christopher Brooks·Rene F. Kizilcec·Renzhe Yu

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety concerns about model reliability under distribution shift; highlights risks in training data curation practices that could affect deployed systems behaving unexpectedly in evolving real-world contexts.

Paper Details

Citations

0 influential

Year

2025

arXiv:2509.01060 DOI:10.48550/arXiv.2509.01060 Semantic Scholar

Metadata

Importance: 52/100arxiv preprintprimary source

Abstract

Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models. First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive. Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers. Overall, our findings caution against conventional wisdom that "more data is better" and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.

Summary

This paper investigates how adding historical training data can hurt model performance when temporal distribution shifts exist between past and present data. It challenges the common assumption that more training data is always better, showing that naive data expansion under temporal shifts can mislead models and degrade generalization to current distributions.

Key Points

•Adding older training data can harm model performance when the data distribution has shifted over time, contradicting the 'more data is better' heuristic.
•Temporal distribution shift is a critical but underexplored failure mode in training data curation strategies for ML systems.
•The paper proposes methods to identify and mitigate the negative effects of temporally misaligned training data.
•Findings have implications for continual learning, model updates, and deployment in dynamic real-world environments.
•Results suggest practitioners should carefully evaluate data recency and relevance rather than defaulting to larger historical datasets.

Cited by 1 page

Page	Type	Quality
AI Distributional Shift	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202659 KB

When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts 
 
 
 
 
 
 

 
 

 
 
 
 
 When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts

 
 
 
Chengyuan Yao 1 ,
Yunxuan Tang 2 ,
Christopher Brooks 2 ,
Rene F. Kizilcec 3 ,
Renzhe Yu 1 
 
 
 
 Abstract

 Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models.
First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive.
Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers.
Overall, our findings caution against conventional wisdom that ”more data is better” and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.

 
 
 Code — https://github.com/AEQUITAS-Lab/Distribution-Shift-AIES-2025 

 
 
 Introduction

 
 Machine learning applications have been widely deployed to facilitate decision making in social sectors such as healthcare and education  (Dixon et al. 2024 ; Broby 2022 ; Sghir, Adadi, and Lahmer 2023 ) . In developing machine learning models, there is a common assumption that larger training datasets would lead to better performance and greater model generalizability. The logic behind this assumption is intuitive: more data covers more diverse underlying patterns, which can help reduce both bias and variance in predictive estimates. However, this assumption depends on the condition that training and test data come from similar distributions, 

... (truncated, 59 KB total)

Resource ID: 08de88197c266e9d | Stable ID: sid_79c9k5wW9w