Learning at test time in LLMs

preview_player
Показать описание
Jonas Hübotter from ETH presents SIFT (Select Informative data for Fine-Tuning), a breakthrough algorithm that dramatically improves language model performance through test-time adaptation. Using intelligent data selection, SIFT achieves state-of-the-art results with a 3.8B parameter model - 30x smaller than previous approaches. The system combines a parametric controller with non-parametric memory to optimize training example selection, showing impressive results across mathematics, coding, and legal domains. This novel approach points toward more efficient and adaptable AI systems that can continuously improve through interaction.

SLIDES:

SPONSOR MESSAGE:
CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.

Jonas Hübotter
Doctoral Researcher at ETH Zurich working on Active Fine-Tuning and Local Learning.

Test-Time Training on Nearest Neighbors for Large Language Models

TOC:
1. SIFT Algorithm Core Concepts
[00:00:00] 1.1 Introduction to Test-Time Adaptation and SIFT Algorithm
[00:02:45] 1.2 The Pile Benchmark and Parameter Efficiency
[00:07:00] 1.3 Local Learning Models and Vapnik's Principle
[00:12:33] 1.4 SIFT Performance and Domain-Specific Comparisons

2. Training and Data Selection Methods
[00:22:50] 2.1 Data Selection and Error Measurement Methods
[00:32:33] 2.2 Non-IID Training Experiments on MNIST

3. Scaling and Implementation and Audience QnA
[00:35:50] 3.1 Scaling Experiments to Larger Datasets and Models
[00:42:30] 3.2 Model Scaling and Performance Across Architectures
[00:44:25] 3.3 Exploration-Exploitation Trade-offs in Fine-tuning
[00:47:54] 3.4 Two-Stage Local Learning Architecture and SIFT Implementation

SHOWNOTES (transcript, references, best quotes etc):

REFS:
[0:00:25] Paper: 'Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs' introducing SIFT algorithm for optimizing LLM performance through test-time fine-tuning (Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause)

[0:02:45] The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A comprehensive dataset comprising 22 diverse high-quality subsets for training large-scale language models (Leo Gao et al.)

[0:03:20] Language Models are Unsupervised Multitask Learners (GPT-2) -

[0:11:05] Vladimir Vapnik's principle from Statistical Learning Theory: 'When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need, but not a more general one.'

[0:22:05] Paper discussed at ICML 'The Linear Representation Hypothesis and the Geometry of Large Language Models' by Kiho Park et al.

[0:23:20] On choosing and bounding probability metrics - Paper discussing Total Variation (TV) distance and its applications in probability theory (ALISON L. GIBBS AND FRANCIS EDWARD SU)

[0:33:25] MNIST dataset - Standard database of handwritten digits containing 60,000 training images and 10,000 test images of size 28x28 pixels (Yann LeCun, Corinna Cortes)

[0:35:50] CIFAR-100 dataset - A dataset of 32x32 color images in 100 classes, with 600 images per class (Alex Krizhevsky)

[0:36:00] ImageNet - Large-scale hierarchical image database with over 14 million images organized according to the WordNet hierarchy (Jia Deng et al)

[0:42:55] Llama 2: Collection of foundation and fine-tuned chat models ranging from 7B to 70B parameters (Hugo Touvron et al.)

[0:43:35] Scaling Instruction-Finetuned Language Models - Paper introducing Flan-T5, showing performance improvements through instruction finetuning (Hyung Won Chung et al.)

[0:45:10] Active Few-Shot Fine-Tuning methodology paper discussing exploration-exploitation trade-offs in the context of fine-tuning neural networks. (Jonas Hübotter et al.)
Рекомендации по теме
Комментарии
Автор

A very keen and interested young man that clearly expressed his passion at providing a solution. I really hope he continues to express willingness to share his thoughts and beliefs with others and that he fulfills the potential that he shows so fluently.

alexandermoody
Автор

😂 I was literally geeking out about this paper two weeks ago. Entropy-based everything is clearly the answer. Data selection, pretraining sampling, fine-tuning, inference…everything should be driven by entropy, because in real-life there are no gold labels, only
Best guesses based on evidence and accepted axioms.

zandrrlife
Автор

Local and Context based Learning is one area where more and more people should work. It may solve the huge energy-need problem to run LLMs and can have various applications from Query Retrieval in DBMS to Edge Learning devices.

PhotoninDark
Автор

Oh man, my eyes lit up when I saw that oil and paint magnetic ferrofluid animation! I literally saved that exact videoclip from YouTube a few months ago because it was just 4k, awesome!!!

smicha
Автор

Superb presentation but it would be better to show the slides while the presenter is talking may be as a separate window.

PhotoninDark
Автор

Yesss I was waiting for this. Specifically looked for it in the channel last night and realized the paper only came out a week ago

steve_jabz
Автор

Warping the state space to experience, such as by K-means clustering, is how knowledge is applied efficaciously ("system 2" thinking). Interesting talk, thank you for sharing.

SapienSpace
Автор

This TTT is very exciting. Only a matter of time (or coincidence) the the OS community achieves as good capability as internal OAI. Were you able to see the robotics they've been working on, the anymal bot? I like their technical video updates they put out, but always wish there was an accompanying interview.

Charles-Darwin
Автор

This is connected to the power consumption of AGI. Scaling laws of progress will be based on electricity flowing through systems. Limited power is the main factor now

superfliping
Автор

Wow, the time stamps with info are really helpful.

wryltxw
Автор

where can I learn more about test time training? I am not familiar with this concept

kevalan
Автор

good we are getting at the person of interest level soon :)

SinanWP
Автор

Some of the techniques he's talking about I've tried even on very small data and have seen some interesting results!

aiamfree
Автор

Nice complements to my recent active learning course :)

lestode
Автор

That quote from Vapnik sounds like engineering at best and hacking at worst.

wryltxw
Автор

can this be used with entropix sampler ?

AlexKen-zvmm
Автор

Is the answer to solve extrapolation and generalisation, interpolate bettter? Seems like this will get us to GenAI saturation faster, not take us beyond?

luke.perkin.online
Автор

Working on ARC, Francois gave us some hints about this approach in his University tour. Now he's left Google to work on ....? Something that we learned in ARC-AGI Challenge we hope!!!

KevinKreger
Автор

Does that mean it can be unrelated and non-redundant data for local training? It’s hard to see how that can help with inference. If not then, are you not imposing an inductive prior by your selection of data? And your bet effectively becomes the inductive prior.

wryltxw
Автор

When he says that the learned information becomes part of its beliefs, technically how does that work?
I’m really stuck on how providing new data “gets into” the model.

BeTheFeatureNotTheBug