filmov
tv
Fine-tuning an LLM judge to reduce hallucination
Показать описание
In this webinar, we explore the potential of leveraging out-of-domain data to enhance the fine-tuning of MistralAI language models for detecting factual inconsistencies, also known as hallucinations.
Inspired by Eugene Yan’s article on bootstrapping hallucination detection, we use the Factual Inconsistency Benchmark (FIB) dataset and initially fine-tune a MistralAI-based model solely on this dataset, achieving limited success.
We then employed pre-finetuning on Wikipedia summaries from the Unified Summarization Benchmark (USB) before applying task-specific finetuning on FIB. This approach significantly improved performance.
Our methodology incorporates Weights & Biases Weave to automate model evaluation, demonstrating that pre-fine-tuning on related but out-of-domain data can effectively bootstrap the detection of factual inconsistencies, thus reducing the need for extensive task-specific data collection. This technique offers a promising strategy for enhancing the accuracy and applicability of natural language inference models in production environments
Chapters:
0:00 Webinar agenda and overview of Mistral AI
1:08 Fine-Tuning Services: Introduction to Mistral's fine-tuning API and services
2:46 Conversational AI Interface: Introduction to LAT, Mistral's conversational AI tool
3:34 Latest Model Releases: Newest Mistral models and their features
4:09 Fine-Tuning Process: Steps and benefits of fine-tuning models
5:31 Hackathon Winning Projects: Examples of innovative uses of fine-tuning
10:59 Hands-On Demo Introduction: Introduction to the practical demo segment
12:05 Setting Up the Demo: Instructions for setting up and running the demo notebook
16:43 Creating Initial Prompt: Steps to create and test an initial prompt
20:25 Evaluation Pipeline: Setting up and running an evaluation pipeline for model performance
24:44 Improving Model Performance: Strategies and techniques to enhance model accuracy
42:02 Fine-Tuning and Results: Creating and evaluating a fine-tuned model
51:01 Two-Step Fine-Tuning: Explanation and demonstration of the two-step fine-tuning process
57:00 Conclusion and final thoughts
Inspired by Eugene Yan’s article on bootstrapping hallucination detection, we use the Factual Inconsistency Benchmark (FIB) dataset and initially fine-tune a MistralAI-based model solely on this dataset, achieving limited success.
We then employed pre-finetuning on Wikipedia summaries from the Unified Summarization Benchmark (USB) before applying task-specific finetuning on FIB. This approach significantly improved performance.
Our methodology incorporates Weights & Biases Weave to automate model evaluation, demonstrating that pre-fine-tuning on related but out-of-domain data can effectively bootstrap the detection of factual inconsistencies, thus reducing the need for extensive task-specific data collection. This technique offers a promising strategy for enhancing the accuracy and applicability of natural language inference models in production environments
Chapters:
0:00 Webinar agenda and overview of Mistral AI
1:08 Fine-Tuning Services: Introduction to Mistral's fine-tuning API and services
2:46 Conversational AI Interface: Introduction to LAT, Mistral's conversational AI tool
3:34 Latest Model Releases: Newest Mistral models and their features
4:09 Fine-Tuning Process: Steps and benefits of fine-tuning models
5:31 Hackathon Winning Projects: Examples of innovative uses of fine-tuning
10:59 Hands-On Demo Introduction: Introduction to the practical demo segment
12:05 Setting Up the Demo: Instructions for setting up and running the demo notebook
16:43 Creating Initial Prompt: Steps to create and test an initial prompt
20:25 Evaluation Pipeline: Setting up and running an evaluation pipeline for model performance
24:44 Improving Model Performance: Strategies and techniques to enhance model accuracy
42:02 Fine-Tuning and Results: Creating and evaluating a fine-tuned model
51:01 Two-Step Fine-Tuning: Explanation and demonstration of the two-step fine-tuning process
57:00 Conclusion and final thoughts