Reconciling modern machine learning and the bias-variance trade-off

Показать описание

It turns out that the classic view of generalization and overfitting is incomplete! If you add parameters beyond the number of points in your dataset, generalization performance might increase again due to the increased smoothness of overparameterized functions.

Abstract:
The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology.
The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample.
However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data.
How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice?
In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation.
Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models.
The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.

Authors: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

Рекомендации по теме

Комментарии

I struggled with understanding this paper due to lack of knowledge (conceptually spoken), but after seeing your explanation, everything is clear.
thank you very much

PeterJMPuyneers

Mind blown. Super cool! I have so many tests to rerun with higher parameter count now

AntonCode

you did a great job. This just left me speechless!!!

MLDawn

Fantastic video -- thank you! Fascinating...

danielbigham

Mind blown. Very interesting paper! Does this mean that if you are in the regime where the test loss has started to decrease (as a function of parameters) again and you add more training examples, your test accuracy will get worse because it makes it harder for the optimizer to find a simple function that perfectly mahces the training data? In theory, this could make it beneficial to reduce the number of training examples, but intuitively, that feels wrong.

kristoferkrus

This is such an amazing study. So many synergies with the Deep Double Descent paper.

sayakpaul

I started to read this paper during the last days and I confirm that it is really interesting! However, I have some doubts on the way they evaluate the MSE (how do they deal with the fact the function h(x) is complex?) and the zero-one loss/norm of coefficients (since it is a multi-class classification problem, they probably use one-hot encoding, but again how do they deal with the complex h(x)? Moreover, if they use one-hot encoding, the regressor is a 2D matrix, thus what norm are they plotting? L2 norm for matrices?). Did you try to reproduce their plots with the MNIST database? Are these technical passages clear to you? Thank you again for the video!

Fede

A high-complexity solution be like "Braaaah! Brrraah!" 😂👍

DasGrosseFressen

This is an interesting paper; I wonder if this applies to boosting/bagging with models that don't have many parameter options like multinomial naive bayes. Would parameter optimization on ensemble models have the same effect when the baseline model within are linear? Interesting option for some testing here.

DrAhdol

Can you elaborate on the Hilbert space thing? What does Hilbert space to do with neural networks?

herp_derpingson

Is complexity of H means no of features here ?

ujjwalkar

Reconciling modern machine learning and the bias-variance trade-off

'Reconciling modern machine learning practice...', M.Belkin, D.Hsu, S.Ma, S.Mandal

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning practice and the bias-variance trade-off

Reconciling modern machine learning and the bias variance trade-off

Double Descent explained by Yann LeCun

Mikhail Belkin: From classical statistics to modern deep learning

Deep Double Descent

From Classical Statistics to Modern Machine Learning

Accelerating Deep Learning by Focusing on the Biggest Losers

The Elegant Math Behind Machine Learning

Reconciling knowledge-based and data-driven AI for human-in-the-loop machine learning by Ute Schmid

Bias and Variance for Machine Learning | Deep Learning

Reconciling the double descent curve with older ideas

Reconciling Event Structures with Modern Multiprocessors

Definition Of Bias And Variance In Machine Learning- Interview Question

Why Matching Reconciliation is a Hard Problem to Solve

Suture material used for drain fixation during surgery | Dr Rohan Khandelwal #norcet #inicet

Real Time Power BI Project, Blinkit Analysis #powerbi #powerbidashboard #dataanalyst

Make a professional presentation with this 30 seconds tutorial #powerpoint

Model Complexity

Reconciling Reinforcement Learning: Optimization, Generalization, and Exploration -- Part 1 of 4

Transformers, explained: Understand the model behind GPT, BERT, and T5

The Path to Reconciliation | Albert Sehnaoui | TEDxIEMadrid

Top 5 Supervisor Job Interview Questions & Answers! (Must Watch)