Beyond the Bias-Variance Paradigm: Understanding Double Descent in Modern Machine Learning

18 min readOct 4, 2024

Abstract

This is an exploration of how classical statistical intuitions about bias-variance tradeoffs and overfitting may not align with the realities of modern machine learning (ML), particularly in settings where generalization to new data is prioritized. We examine how the shift from fixed to random design settings underlies these discrepancies, with significant implications for understanding phenomena like double descent and benign overfitting.

We are seeing a seismic change in science and society of late, spurred by breakthroughs in machine learning, yet when it comes to the fundamental understanding of the technology, we lag far behind. One of the core tenets of the field, the bias–variance trade-off, appears to be at odds with the observed behavior of methods used in modern machine-learning practice. The bias–variance trade-off suggest that model should balance underfitting and overfitting: Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns. In modern practice, however, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data when classically, such models would be considered overfitted, and yet they often obtain high accuracy on test data.

The contradiction is apparent, and this has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This “double-descent” curve subsumes the textbook U-shaped bias–variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine-learning models delineates the limits of classical analyses and has implications for both the theory and the practice of machine learning.

We also explore the evolving understanding of the bias-variance trade-off in machine learning, specifically in the context of “double descent.” It reconciles traditional statistical theory with modern machine learning practices, which have shown that models that perfectly fit (or interpolate) training data can still generalize well on unseen data, contradicting the classical U-shaped bias-variance trade-off curve.

The author among other things proposes a unified framework, the double-descent curve, which extends the traditional U-shaped risk curve into a new regime that explains this phenomenon.

Furthermore, Machine learning (ML) has reshaped various domains with its predictive capabilities, yet its advancements challenge long-standing statistical concepts. The traditional bias-variance trade-off, a cornerstone of statistical learning theory, seems at odds with the empirical success of highly overparameterized models like neural networks. Such models often achieve zero training error, yet still generalize effectively to new data. This paper explores the reconciliation of classical statistical concepts with modern ML practices through a unified framework. We discuss the emergence of the double-descent phenomenon, which extends the classical U-shaped risk curve, and how it accounts for the apparent contradictions posed by overparameterized models. Building on the works of Curth (2024) and Belkin et al. (2019), this paper provides a cohesive overview of these developments and their implications for both theoretical understanding and practical model selection in ML.

The bias-variance trade-off has long been a guiding principle for understanding the performance of predictive models in statistical learning. Traditionally, this trade-off suggests a balance between underfitting, where a model is too simple to capture the underlying structure of the data, and overfitting, where a model is overly complex, capturing noise along with the signal, thereby compromising its ability to generalize (Hastie et al., 2009). However, recent observations in ML, particularly with deep neural networks, challenge this classical notion, as models trained to interpolate, or perfectly fit the training data, often achieve impressive generalization performance on unseen data. This paradox has prompted an investigation into how classical statistical insights align with modern ML phenomena.

Introduction

Machine learning (ML) has transformed numerous fields by providing sophisticated methods for data analysis and predictive modeling. However, its theoretical frameworks, particularly the bias-variance trade-off, also offer a unique perspective on enduring debates within the research community. The bias-variance trade-off can illuminate the complexities of various debates, revealing how different positions relate to trade-offs between simplicity and adaptability. By applying this concept to debates surrounding learning theories, research methodologies, and pedagogical approaches, we gain fresh insights into how these discussions might be reconciled within a more unified framework, potentially guiding future research and practice in education. The rapid advancement of machine learning has provided not only powerful computational tools but also foundational principles that extend beyond data science. The bias-variance trade-off, a core concept in ML, addresses the balance between a model’s capacity to accurately fit data (bias) and its flexibility to generalize across diverse contexts (variance). While this trade-off has traditionally been applied to optimize predictive models, it can also inform various longstanding debates in education, providing a conceptual framework for understanding opposing viewpoints.

We also provide an understanding of the bias-variance trade-off in machine learning, specifically in the context of “double descent.” It reconciles traditional statistical theory with modern machine learning practices, which have shown that models that perfectly fit (or interpolate) training data can still generalize well on unseen data, contradicting the classical U-shaped bias-variance trade-off curve. The author proposes a unified framework, the double-descent curve, which extends the traditional U-shaped risk curve into a new regime that explains this phenomenon.

The Bias-Variance Trade-Off: A Conceptual Overview

In ML, the bias-variance trade-off involves a balancing act between underfitting and overfitting. High-bias models, which are often simpler, may miss important nuances in data, while high-variance models, which are more complex, might overfit to specific data patterns, leading to poor generalization. Achieving an optimal trade-off involves finding a model that is both accurate and flexible enough to generalize across new datasets.

When applied to education, the bias-variance trade-off suggests that different research paradigms and pedagogical strategies can be viewed along a continuum, with approaches that are high in bias offering clarity and structure but risking oversimplification, and those high in variance embracing complexity but risking inconsistency. This lens provides a way to interpret the strengths and limitations of each approach in educational debates.

The Classical Bias-Variance Trade-off:

Traditionally, the bias-variance trade-off suggests that increasing model capacity reduces bias but increases variance. The goal is to find a “sweet spot” where the model is complex enough to capture underlying patterns but simple enough to avoid overfitting to noise. This trade-off creates a U-shaped risk curve, where risk (or error) initially decreases with model complexity, reaches a minimum, and then increases again as the model starts to overfit. In practice, however, highly complex models, such as neural networks, are often trained to zero training error, which classical theory would label as overfitting. However, these models frequently perform well on test data, prompting a re-evaluation of classical bias-variance trade-off intuitions. This paradox, where models that fit the training data perfectly can still generalize well, is addressed by the concept of double descent. Double Descent: The double-descent curve extends the U-shaped risk curve by showing how increasing model capacity beyond the interpolation threshold (where training error reaches zero) can lead to improved test performance. This curve initially follows the classical U-shape but then decreases again as model complexity continues to increase, hence the term “double descent.” The authors show that models that achieve zero training error do not necessarily exhibit high test error. Instead, as capacity continues to increase, test error can decrease due to inductive biases like smoothness or regularity that align with the underlying data distribution. Mechanism Behind Double Descent: The interpolation threshold represents the point at which a model’s complexity is just enough to fit all training points exactly. At this point, risk is often at its peak. As model complexity grows past this threshold, the model can leverage more “regular” or “smooth” solutions within the function space that fit the data perfectly while also maintaining generalizability. This is similar to the principle of Occam’s razor, where simpler solutions that fit the data are preferred. The author suggests that this phenomenon is observed across various models, including neural networks, decision trees, and ensemble methods. Empirical Evidence Across Models: Random Fourier Features (RFF): The paper demonstrates double descent using RFF models, which are simplified neural networks. The curve shows a peak at the interpolation threshold followed by a decrease as more features are added, which increases smoothness and reduces error. Neural Networks: Double descent is observed in neural networks as well, with larger models showing improved test performance even after achieving zero training risk. This suggests that widely used training methods like stochastic gradient descent (SGD) may implicitly favor smooth solutions. Decision Trees and Ensembles: Double descent is seen in models like random forests and AdaBoost when trees are allowed to grow large enough to interpolate the data. Averaging over many such trees yields smoother solutions with lower test error. Historical and Practical Considerations: The authors speculate that the double-descent behavior was historically overlooked due to a focus on fixed, small feature sets and the use of regularization techniques, which can prevent interpolation. In nonparametric settings, where double descent might be more observable, regularization often obscures the peak at the interpolation threshold. Additionally, training in classical settings often stops once test risk stops improving, hiding the interpolation peak. Implications for Model Selection and Learning Theory: The double-descent framework helps explain why overparameterized models (those with more parameters than training samples) can still generalize well. The paper highlights the importance of understanding inductive biases — such as smoothness, which is favored in neural network training through algorithms like SGD. Double descent suggests that in some cases, increasing model capacity can enhance performance without overfitting, challenging practitioners to reconsider standard practices around model selection, regularization, and stopping criteria. We propose further exploration of the computational, statistical, and mathematical properties that differentiate classical and modern learning regimes. Understanding the double-descent curve better could guide improvements in both learning algorithms and their practical applications. In conclusion, the double-descent curve unifies classical and modern views on the bias-variance trade-off by showing how increasing model capacity beyond the interpolation threshold can lead to improved test performance. This challenges the conventional wisdom that zero-training error implies overfitting, providing a new lens through which to understand and develop machine learning models. The framework has broad implications for the design and optimization of learning algorithms, encouraging a shift towards larger, more complex models that embrace the principles of double descent for better generalization.

Apparent contradiction

Curth (2024) highlights that this apparent contradiction may stem from a historical focus on fixed design settings, where both training and test data share the same inputs but differ in outcome noise. In contrast, modern ML predominantly evaluates models on generalization, or out-of-sample prediction, where both inputs and outputs at test time are newly sampled. Belkin et al. (2019) propose a framework that extends the bias-variance trade-off into a new regime known as the double-descent risk curve, which reconciles the observed behavior of overparameterized models with classical theory. This paper also synthesizes these two perspectives, exploring their implications for ML theory and practice. This paper addresses a significant gap between classical statistical intuitions and the behaviors observed in modern machine learning (ML). Concepts such as the bias-variance tradeoff and overfitting, which are fundamental in classical statistics, appear to be contradicted by phenomena like double descent and benign overfitting in ML. Curth suggests that one primary reason for this discrepancy is the shift from fixed to random design settings, which significantly impacts how these classical concepts are understood and applied. The paper highlights the difference between two key settings: In traditional statistics, this setting evaluates models based on in-sample prediction error. Here, the test data consists of the same inputs as the training data, but with resampled, noisy outcomes. This means that the observed data points (inputs) remain constant, so the model’s performance is assessed on its ability to predict new outcomes for these same points. The focus is on reducing variance due to noise in outcomes, as the model sees familiar inputs with new outputs during testing. Classical biases and variances can be accurately assessed because test points are identical to training points in terms of features. Random Design Setting: In modern ML, interest shifts to generalization error or out-of-sample prediction error. This requires models to make predictions on entirely new inputs (data points not seen during training), meaning both test inputs and outputs are resampled from a potentially different distribution. This shift from fixed to random inputs at test time changes the dynamics of bias and variance. Bias no longer decreases consistently as model complexity increases, because there is no guarantee that new test points resemble any training points. As a result, the relationship between bias, variance, and model complexity can differ dramatically from classical intuitions, even in cases where models are not overparameterized.

Bias-Variance Tradeoff and its Breakdown in Random Designs:

Classical statistics rely on the bias-variance tradeoff as follows: In Fixed Designs: As model complexity increases, bias typically decreases (because the model better captures the training data), but variance increases due to overfitting to noisy outcomes. In Random Designs: Curth shows that this tradeoff does not hold universally. Even simple models like k-nearest neighbors (k-NN) reveal that increasing complexity does not always decrease bias. Using k-NN estimators, Curth demonstrates that: In fixed design settings, k-NN models show a predictable decrease in bias and an increase in variance with higher complexity (lower values of kkk). In random design settings, however, decreasing kkk can actually increase both bias and variance, depending on how well new test points match the distribution of training points. This breakdown occurs because out-of-sample bias incorporates discrepancies between training inputs and entirely new inputs at test time. In a random design setting, there is no guarantee that the most complex (or simple) model will perform best, as the nearest neighbors in training may not correspond well to new test points.

The Emergence of Double Descent

The double-descent phenomenon, as articulated by Belkin et al. (2019), provides a framework for understanding how overparameterized models can still generalize effectively despite achieving zero training error. According to the double-descent theory, risk initially follows the classical U-shape as model complexity increases. However, once a model reaches the interpolation threshold — where it perfectly fits the training data — the risk does not continue to increase. Instead, a second descent occurs as model complexity further increases, which can lead to improved generalization performance.

This double-descent behavior is not limited to neural networks. Belkin et al. (2019) provide empirical evidence for double descent across a variety of models, including random Fourier features, decision trees, and ensemble methods like AdaBoost and random forests. In these cases, increasing model capacity beyond the interpolation threshold allows for the selection of interpolating solutions that align with inductive biases, such as smoothness, which is advantageous for generalization. This smoothness aligns with Occam’s razor, favoring simpler explanations that fit the data well.

Mechanisms Underlying Double Descent

The key to understanding double descent lies in the relationship between model capacity and inductive biases. Belkin et al. (2019) suggest that larger function classes increase the likelihood of finding an interpolating solution with desirable properties, such as lower complexity or smoother decision boundaries. In neural networks, for instance, this translates to an inductive bias towards smooth functions, which are more likely to generalize well. This behavior is also observed in other model classes where overparameterization does not necessarily equate to high test error but can lead to effective generalization.

Curth (2024) further supports this by showing that fixed design settings may obscure such phenomena, as interpolation in these settings often leads to high variance and poor generalization. In random design settings, however, both bias and variance can exhibit non-monotonic behavior, allowing models to generalize well even after achieving zero training error. This highlights the role of inductive biases and the importance of the design setting in understanding double descent.

Double Descent and Overparameterization:

Modern ML often encounters the double descent phenomenon: As the number of model parameters increases beyond the number of training examples, a U-shaped error curve emerges, reflecting the traditional bias-variance tradeoff in the under-parameterized regime. Once the model enters the overparameterized regime (where it can perfectly interpolate the training data), generalization error decreases again, leading to a second descent. Curth explains that double descent is often observed in random design settings, but it’s absent in fixed design settings: In fixed designs, any model in the interpolation regime (where the model perfectly fits training data) will have zero bias and constant variance across training points, hence no second descent in error. Therefore, double descent does not conflict with classical intuitions because it applies to random designs, where classical bias-variance tradeoff assumptions do not hold.

Benign Overfitting and “Benign Interpolation”:

Curth tackles the term “benign overfitting” and argues that a more accurate term might be benign interpolation, as “overfitting” typically implies poor generalization: Benign Interpolation: Describes models that fit training data perfectly yet generalize well to new data. This can occur when models act differently at test time, balancing complexity in a way that reduces both bias and variance for new inputs. In random designs: Benign interpolation can occur because the model might create “spiked” predictions for training points while adopting smoother, less complex behaviors for new inputs, allowing it to generalize better despite fitting the training data exactly. This is seen in models like neural networks and random forests, which can interpolate training data but still perform well on unseen data due to their ability to modulate complexity based on the input data.

Bias-Variance Trade-off in Fixed and Random Designs

Classical statistical theory has predominantly relied on the bias-variance trade-off to explain model behavior in fixed design settings, where in-sample prediction error is the primary focus. Here, the test inputs remain identical to those in training, and only the outcomes are resampled. Curth (2024) explains that in such settings, the bias typically decreases with increasing model complexity, while variance increases, creating the well-known U-shaped curve. This setup suggests a sweet spot in model complexity, where both underfitting and overfitting are minimized, resulting in optimal model performance.

In contrast, random design settings prioritize generalization error, where both the inputs and outputs at test time differ from those in the training set. This shift has profound implications for the bias-variance trade-off. Curth (2024) demonstrates that when moving from fixed to random designs, bias and variance behave unpredictably. She illustrates this using k-nearest neighbors (k-NN) models, showing that reducing model complexity does not always reduce bias in random designs. Instead, models can exhibit non-monotonic bias behavior, which challenges classical intuitions and sets the stage for phenomena like double descent.

Implications for Statistical Learning and Education:

The shift from fixed to random designs suggests a need to revisit foundational concepts in statistical learning. Curth suggests: Updating educational materials to distinguish between fixed and random design settings, making clear when bias-variance tradeoffs and overfitting are of concern. Emphasizing that classical intuitions may not apply in settings where generalization to new data points is key, as in many ML applications. Developing a refined understanding of the settings in which different forms of overfitting or interpolation occur, so that students and practitioners can make informed decisions based on their specific application needs. Curth concludes by affirming that classical statistical concepts like bias-variance tradeoffs and overfitting are still relevant, but their interpretation depends on the design setting: In fixed designs, these concepts hold as expected, with a clear tradeoff between bias and variance. In random designs, however, these tradeoffs do not necessarily apply, which explains the emergence of double descent and benign interpolation in modern ML. Curth highlights that as ML continues to evolve, so too should our understanding and teaching of foundational statistical concepts. This note is a call to adapt classical statistical frameworks to better align with the empirical realities of modern ML, particularly in random design settings that emphasize generalization over interpolation of fixed points.

Implications for Model Selection and Statistical Education

The implications of these findings extend beyond theoretical concerns to practical model selection in ML. The double-descent curve suggests that increasing model capacity should not always be avoided; under certain conditions, highly complex models can yield better performance than simpler counterparts traditionally favored under the classical bias-variance trade-off. For practitioners, this means embracing overparameterized models that align with the double-descent regime when generalization is of paramount importance.

Furthermore, these insights have educational implications. As Curth (2024) argues, the shift from fixed to random designs should be incorporated into statistical learning curricula, where the context-specific nature of the bias-variance trade-off can be emphasized. Future research and teaching should account for these settings, providing a nuanced view that aligns with the realities of modern ML.

Learning Theories: Balancing Structure and Context

In educational psychology, debates between cognitivist and situativist theories exemplify the tension between structure and context. Cognitivist approaches often emphasize structured mental processes and favor methodologies that are less context-dependent, which can be seen as a high-bias, low-variance approach. Situativist theories, in contrast, prioritize the social and contextual aspects of learning, aligning with a low-bias, high-variance stance that is adaptable to diverse settings but less likely to produce universally applicable principles.

The bias-variance framework offers a way to conceptualize these differing approaches. Cognitivist theories may provide clarity and generalizability, but they risk excluding essential contextual factors. Situativist theories offer rich insights into specific learning environments but may lack the broad applicability that some educational researchers seek. By considering these theories through the bias-variance lens, educators and researchers can better appreciate the trade-offs inherent in each approach, potentially guiding more balanced theoretical development.

Methodological Debates: Quantitative vs. Qualitative Approaches

Educational research methodologies often fall into quantitative or qualitative categories, each with distinct advantages and limitations. Quantitative methods, which favor structured data collection and statistical analysis, often align with high-bias, low-variance approaches. These methods are well-suited to identifying general patterns but may miss the nuanced, context-specific insights that qualitative methods can reveal.

Qualitative methods, on the other hand, align with a high-variance approach, capturing complex, context-rich details but often lacking the generalizability of quantitative methods. Through the bias-variance trade-off lens, this debate reflects a tension between achieving broad applicability and capturing in-depth, context-specific insights. Recognizing this trade-off allows researchers to select methodologies that best suit their specific research questions and to consider mixed-methods approaches as a potential path to balancing these competing demands.

Pedagogical Approaches: Direct Instruction and Discovery Learning

Debates around instructional strategies, such as direct instruction versus discovery learning, are also illuminated by the bias-variance trade-off. Direct instruction, with its structured and guided approach, represents a high-bias approach that minimizes variance by providing clear, consistent instruction. This method is effective for transmitting well-defined knowledge but may limit opportunities for exploratory and individualized learning.

In contrast, discovery learning embraces a low-bias, high-variance approach, encouraging students to explore and construct knowledge independently. While this method can foster deep, personalized learning, it also introduces variability in outcomes, as students may diverge in their learning paths and depth of understanding. The bias-variance framework helps us understand why direct instruction is effective for foundational skills and why discovery learning is beneficial for fostering creativity and critical thinking. It suggests that a hybrid approach, combining elements of both, may offer a balanced solution.

Conclusion

The double-descent framework offers a compelling reconciliation of classical statistical concepts with the realities of modern ML, where overparameterized models frequently challenge traditional wisdom. By extending the classical bias-variance trade-off into the overparameterized regime, double descent explains why models trained to interpolate data can still achieve high test accuracy. As ML continues to evolve, so too must our understanding of these foundational concepts, with implications for both theoretical development and practical applications in model selection.

Applying the bias-variance trade-off to academic debates highlights the complex interplay between structure and flexibility, generalizability and context-specificity. By framing these debates within this conceptual framework, researchers and academicians can move beyond polarized positions and towards more integrative approaches that recognize the strengths and limitations of both high-bias and high-variance strategies.

The insights offered by the bias-variance trade-off do not resolve these debates but instead provide a framework for understanding and navigating them. As machine learning continues to influence research, the potential for cross-disciplinary insights grows, encouraging a more nuanced and adaptive approach to education that leverages the best of both structured and flexible paradigms.

References

Anderson, J. R., Reder, L. M., & Simon, H. A. (1996). Situated learning and education. Educational Researcher, 25(4), 5–11.
Doroudi, S. (2020). The bias-variance tradeoff: How data science can inform educational debates. AERA Open, 6(1), 1–12.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias-variance dilemma. Neural Computation, 4(1), 1–58.
Greeno, J. G. (1997). On claims that answer the wrong questions. Educational Researcher, 26(1), 5–17.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media.
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854.
Shayan Doroudi(2020).The Bias-Variance Tradeoff: How Data Science Can Inform Educational Debates. The Bias-Variance Tradeoff: How Data Science Can Inform Educational Debates (sagepub.com)
Curth, A. (2024). Classical Statistical (In-Sample) Intuitions Don’t Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting, and Moving from Fixed to Random Designs. arXiv preprint arXiv:2409.18842.

Declaration:

This article is a scholarly one drawing inspiration from various research papers from the field. I acknowledge their contribution and thankful to them. I may have missed some reference. Readers are advised to consult various papers in the field for better understaning of the bias varince tradeoff in machine leraning.