filmov
tv
Learning at test time in LLMs
Показать описание
Jonas Hübotter from ETH presents SIFT (Select Informative data for Fine-Tuning), a breakthrough algorithm that dramatically improves language model performance through test-time adaptation. Using intelligent data selection, SIFT achieves state-of-the-art results with a 3.8B parameter model - 30x smaller than previous approaches. The system combines a parametric controller with non-parametric memory to optimize training example selection, showing impressive results across mathematics, coding, and legal domains. This novel approach points toward more efficient and adaptable AI systems that can continuously improve through interaction.
SLIDES:
SPONSOR MESSAGE:
CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.
Jonas Hübotter
Doctoral Researcher at ETH Zurich working on Active Fine-Tuning and Local Learning.
Test-Time Training on Nearest Neighbors for Large Language Models
TOC:
1. SIFT Algorithm Core Concepts
[00:00:00] 1.1 Introduction to Test-Time Adaptation and SIFT Algorithm
[00:02:45] 1.2 The Pile Benchmark and Parameter Efficiency
[00:07:00] 1.3 Local Learning Models and Vapnik's Principle
[00:12:33] 1.4 SIFT Performance and Domain-Specific Comparisons
2. Training and Data Selection Methods
[00:22:50] 2.1 Data Selection and Error Measurement Methods
[00:32:33] 2.2 Non-IID Training Experiments on MNIST
3. Scaling and Implementation and Audience QnA
[00:35:50] 3.1 Scaling Experiments to Larger Datasets and Models
[00:42:30] 3.2 Model Scaling and Performance Across Architectures
[00:44:25] 3.3 Exploration-Exploitation Trade-offs in Fine-tuning
[00:47:54] 3.4 Two-Stage Local Learning Architecture and SIFT Implementation
SHOWNOTES (transcript, references, best quotes etc):
REFS:
[0:00:25] Paper: 'Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs' introducing SIFT algorithm for optimizing LLM performance through test-time fine-tuning (Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause)
[0:02:45] The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A comprehensive dataset comprising 22 diverse high-quality subsets for training large-scale language models (Leo Gao et al.)
[0:03:20] Language Models are Unsupervised Multitask Learners (GPT-2) -
[0:11:05] Vladimir Vapnik's principle from Statistical Learning Theory: 'When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need, but not a more general one.'
[0:22:05] Paper discussed at ICML 'The Linear Representation Hypothesis and the Geometry of Large Language Models' by Kiho Park et al.
[0:23:20] On choosing and bounding probability metrics - Paper discussing Total Variation (TV) distance and its applications in probability theory (ALISON L. GIBBS AND FRANCIS EDWARD SU)
[0:33:25] MNIST dataset - Standard database of handwritten digits containing 60,000 training images and 10,000 test images of size 28x28 pixels (Yann LeCun, Corinna Cortes)
[0:35:50] CIFAR-100 dataset - A dataset of 32x32 color images in 100 classes, with 600 images per class (Alex Krizhevsky)
[0:36:00] ImageNet - Large-scale hierarchical image database with over 14 million images organized according to the WordNet hierarchy (Jia Deng et al)
[0:42:55] Llama 2: Collection of foundation and fine-tuned chat models ranging from 7B to 70B parameters (Hugo Touvron et al.)
[0:43:35] Scaling Instruction-Finetuned Language Models - Paper introducing Flan-T5, showing performance improvements through instruction finetuning (Hyung Won Chung et al.)
[0:45:10] Active Few-Shot Fine-Tuning methodology paper discussing exploration-exploitation trade-offs in the context of fine-tuning neural networks. (Jonas Hübotter et al.)
SLIDES:
SPONSOR MESSAGE:
CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.
Jonas Hübotter
Doctoral Researcher at ETH Zurich working on Active Fine-Tuning and Local Learning.
Test-Time Training on Nearest Neighbors for Large Language Models
TOC:
1. SIFT Algorithm Core Concepts
[00:00:00] 1.1 Introduction to Test-Time Adaptation and SIFT Algorithm
[00:02:45] 1.2 The Pile Benchmark and Parameter Efficiency
[00:07:00] 1.3 Local Learning Models and Vapnik's Principle
[00:12:33] 1.4 SIFT Performance and Domain-Specific Comparisons
2. Training and Data Selection Methods
[00:22:50] 2.1 Data Selection and Error Measurement Methods
[00:32:33] 2.2 Non-IID Training Experiments on MNIST
3. Scaling and Implementation and Audience QnA
[00:35:50] 3.1 Scaling Experiments to Larger Datasets and Models
[00:42:30] 3.2 Model Scaling and Performance Across Architectures
[00:44:25] 3.3 Exploration-Exploitation Trade-offs in Fine-tuning
[00:47:54] 3.4 Two-Stage Local Learning Architecture and SIFT Implementation
SHOWNOTES (transcript, references, best quotes etc):
REFS:
[0:00:25] Paper: 'Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs' introducing SIFT algorithm for optimizing LLM performance through test-time fine-tuning (Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause)
[0:02:45] The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A comprehensive dataset comprising 22 diverse high-quality subsets for training large-scale language models (Leo Gao et al.)
[0:03:20] Language Models are Unsupervised Multitask Learners (GPT-2) -
[0:11:05] Vladimir Vapnik's principle from Statistical Learning Theory: 'When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need, but not a more general one.'
[0:22:05] Paper discussed at ICML 'The Linear Representation Hypothesis and the Geometry of Large Language Models' by Kiho Park et al.
[0:23:20] On choosing and bounding probability metrics - Paper discussing Total Variation (TV) distance and its applications in probability theory (ALISON L. GIBBS AND FRANCIS EDWARD SU)
[0:33:25] MNIST dataset - Standard database of handwritten digits containing 60,000 training images and 10,000 test images of size 28x28 pixels (Yann LeCun, Corinna Cortes)
[0:35:50] CIFAR-100 dataset - A dataset of 32x32 color images in 100 classes, with 600 images per class (Alex Krizhevsky)
[0:36:00] ImageNet - Large-scale hierarchical image database with over 14 million images organized according to the WordNet hierarchy (Jia Deng et al)
[0:42:55] Llama 2: Collection of foundation and fine-tuned chat models ranging from 7B to 70B parameters (Hugo Touvron et al.)
[0:43:35] Scaling Instruction-Finetuned Language Models - Paper introducing Flan-T5, showing performance improvements through instruction finetuning (Hyung Won Chung et al.)
[0:45:10] Active Few-Shot Fine-Tuning methodology paper discussing exploration-exploitation trade-offs in the context of fine-tuning neural networks. (Jonas Hübotter et al.)
Комментарии