Learning at test time in LLMs

Показать описание

Jonas Hübotter from ETH presents SIFT (Select Informative data for Fine-Tuning), a breakthrough algorithm that dramatically improves language model performance through test-time adaptation. Using intelligent data selection, SIFT achieves state-of-the-art results with a 3.8B parameter model - 30x smaller than previous approaches. The system combines a parametric controller with non-parametric memory to optimize training example selection, showing impressive results across mathematics, coding, and legal domains. This novel approach points toward more efficient and adaptable AI systems that can continuously improve through interaction.

SLIDES:

SPONSOR MESSAGE:
CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.

Jonas Hübotter
Doctoral Researcher at ETH Zurich working on Active Fine-Tuning and Local Learning.

Test-Time Training on Nearest Neighbors for Large Language Models

TOC:
1. SIFT Algorithm Core Concepts
[00:00:00] 1.1 Introduction to Test-Time Adaptation and SIFT Algorithm
[00:02:45] 1.2 The Pile Benchmark and Parameter Efficiency
[00:07:00] 1.3 Local Learning Models and Vapnik's Principle
[00:12:33] 1.4 SIFT Performance and Domain-Specific Comparisons

2. Training and Data Selection Methods
[00:22:50] 2.1 Data Selection and Error Measurement Methods
[00:32:33] 2.2 Non-IID Training Experiments on MNIST

3. Scaling and Implementation and Audience QnA
[00:35:50] 3.1 Scaling Experiments to Larger Datasets and Models
[00:42:30] 3.2 Model Scaling and Performance Across Architectures
[00:44:25] 3.3 Exploration-Exploitation Trade-offs in Fine-tuning
[00:47:54] 3.4 Two-Stage Local Learning Architecture and SIFT Implementation

SHOWNOTES (transcript, references, best quotes etc):

REFS:
[0:00:25] Paper: 'Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs' introducing SIFT algorithm for optimizing LLM performance through test-time fine-tuning (Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause)

[0:02:45] The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A comprehensive dataset comprising 22 diverse high-quality subsets for training large-scale language models (Leo Gao et al.)

[0:03:20] Language Models are Unsupervised Multitask Learners (GPT-2) -

[0:11:05] Vladimir Vapnik's principle from Statistical Learning Theory: 'When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need, but not a more general one.'

[0:22:05] Paper discussed at ICML 'The Linear Representation Hypothesis and the Geometry of Large Language Models' by Kiho Park et al.

[0:23:20] On choosing and bounding probability metrics - Paper discussing Total Variation (TV) distance and its applications in probability theory (ALISON L. GIBBS AND FRANCIS EDWARD SU)

[0:33:25] MNIST dataset - Standard database of handwritten digits containing 60,000 training images and 10,000 test images of size 28x28 pixels (Yann LeCun, Corinna Cortes)

[0:35:50] CIFAR-100 dataset - A dataset of 32x32 color images in 100 classes, with 600 images per class (Alex Krizhevsky)

[0:36:00] ImageNet - Large-scale hierarchical image database with over 14 million images organized according to the WordNet hierarchy (Jia Deng et al)

[0:42:55] Llama 2: Collection of foundation and fine-tuned chat models ranging from 7B to 70B parameters (Hugo Touvron et al.)

[0:43:35] Scaling Instruction-Finetuned Language Models - Paper introducing Flan-T5, showing performance improvements through instruction finetuning (Hyung Won Chung et al.)

[0:45:10] Active Few-Shot Fine-Tuning methodology paper discussing exploration-exploitation trade-offs in the context of fine-tuning neural networks. (Jonas Hübotter et al.)

Machine Learning Street Talk

Рекомендации по теме

Комментарии

A very keen and interested young man that clearly expressed his passion at providing a solution. I really hope he continues to express willingness to share his thoughts and beliefs with others and that he fulfills the potential that he shows so fluently.

alexandermoody

😂 I was literally geeking out about this paper two weeks ago. Entropy-based everything is clearly the answer. Data selection, pretraining sampling, fine-tuning, inference…everything should be driven by entropy, because in real-life there are no gold labels, only
Best guesses based on evidence and accepted axioms.

zandrrlife

Local and Context based Learning is one area where more and more people should work. It may solve the huge energy-need problem to run LLMs and can have various applications from Query Retrieval in DBMS to Edge Learning devices.

PhotoninDark

Oh man, my eyes lit up when I saw that oil and paint magnetic ferrofluid animation! I literally saved that exact videoclip from YouTube a few months ago because it was just 4k, awesome!!!

smicha

Superb presentation but it would be better to show the slides while the presenter is talking may be as a separate window.

PhotoninDark

Yesss I was waiting for this. Specifically looked for it in the channel last night and realized the paper only came out a week ago

steve_jabz

Warping the state space to experience, such as by K-means clustering, is how knowledge is applied efficaciously ("system 2" thinking). Interesting talk, thank you for sharing.

SapienSpace

This TTT is very exciting. Only a matter of time (or coincidence) the the OS community achieves as good capability as internal OAI. Were you able to see the robotics they've been working on, the anymal bot? I like their technical video updates they put out, but always wish there was an accompanying interview.

Charles-Darwin

This is connected to the power consumption of AGI. Scaling laws of progress will be based on electricity flowing through systems. Limited power is the main factor now

superfliping

Wow, the time stamps with info are really helpful.

wryltxw

where can I learn more about test time training? I am not familiar with this concept

kevalan

good we are getting at the person of interest level soon :)

SinanWP

Some of the techniques he's talking about I've tried even on very small data and have seen some interesting results!

aiamfree

Nice complements to my recent active learning course :)

lestode

That quote from Vapnik sounds like engineering at best and hacking at worst.

wryltxw

can this be used with entropix sampler ?

AlexKen-zvmm

Is the answer to solve extrapolation and generalisation, interpolate bettter? Seems like this will get us to GenAI saturation faster, not take us beyond?

luke.perkin.online

Working on ARC, Francois gave us some hints about this approach in his University tour. Now he's left Google to work on ....? Something that we learned in ARC-AGI Challenge we hope!!!

KevinKreger

Does that mean it can be unrelated and non-redundant data for local training? It’s hard to see how that can help with inference. If not then, are you not imposing an inductive prior by your selection of data? And your bet effectively becomes the inductive prior.

wryltxw

When he says that the learned information becomes part of its beliefs, technically how does that work?
I’m really stuck on how providing new data “gets into” the model.

BeTheFeatureNotTheBug

Learning at test time in LLMs

Learning at test time in LLMs

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Titans: Learning to Memorize at Test Time

Test-Time Augmentation In Machine Learning.

Karan Dalal - Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

[QA] Titans: Learning to Memorize at Test Time

[QA] Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

[CVPR 2023] TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation

Implementation of 'Learning to (Learn at Test Time): RNNs with Expressive Hidden States' w...

All Traffic signal | Learning License Test Questions and Answers for Driving Test Exam #shorts

Learning to Learn at Test Time RNNs with Expressive Hidden StatesStanford & UCSD & UCB &...

Titans: Learning to Memorize at Test Time

Google Research Paper: Titans Architecture Solves AI Memory?

Train Test Split with Python Machine Learning (Scikit-Learn)

How to Write Answer in Exams🔥| 5 Tips to Hack Your Exam paper| Prashant Kirad

How Does Drinking Impact Reaction Time? 🍺😳 #shorts

Quick learning before test | Theory test 2025 #dvsa #drivinglessons #theorytest #short

[CVPR 2023] Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

PINER: Prior-informed Implicit Neural Representation Learning for Test-time Adaption in Sparse-view

DON’T REST ON THE CLUTCH #driving #test #fail #learn #howto #drive #safe #car #london

Ilya Sutskever: Sequence to sequence learning with neural networks | NeurIPS 2024 Test of Time Award

Which Doesn't Belong 006 #learning #trivia #learn #knowledge #quiz #quiztime #test #iqtest #gu...