Regression Testing | LangSmith Evaluations - Part 15

Показать описание

Evaluations can accelerate LLM app development, but it can be challenging to get started. We've kicked off a new video series focused on evaluations in LangSmith.

With the rapid pace of AI, developers are often faced with a paradox of choice: how to choose the right prompt, how to trade-off LLM quality vs cost? Evaluations can accelerate development with structured process for making these decisions. But, we've heard that it is challenging to get started. So, we are launching a series of short videos focused on explaining how to perform evaluations using LangSmith.

This video focuses on Regression Testing, which lets a user highlight particular examples in an eval set that show improvement or regression across a set of experiments.

LangChain

Рекомендации по теме

Комментарии

This is extremely useful, especially for agent systems where the rules have been written to be over-fit for a particular LLM. I find crewai often has that problem, it works well for the LLM it was written for but then makes nonsense with a different LLM.

MattJonesYT

An extension of this idea would be doing regressions on the prompt system as a whole in an agent system to see how well it adapts to other LLMs. Make a matrix of how its prompts work for its original LLM vs new, out-of-sample LLMs. If it immediately breaks on new LLMs then it is probably over-fit and you can have AI try to re-write those prompts to be simpler and then make a system that is more robust for different LLMs.

MattJonesYT

Where could we find the Jupyter Notebook files?

nachoeigu

Regression Testing | LangSmith Evaluations - Part 15

Regression Testing | LangSmith Evaluations - Part 15

Why Evals Matter | LangSmith Evaluations - Part 1

Repetitions | LangSmith Evaluation - Part 23

LangSmith in 10 Minutes

Online Evaluation (Guardrails) | LangSmith Evaluations - Part 21

LangSmith For Beginners | Must know LLM Evaluation Platform 🔥

Agent Response | LangSmith Evaluation - Part 24

Pairwise Evaluation | LangSmith Evaluations - Part 17

Dataset Splits | LangSmith Evaluation - Part 22

How to evaluate upgrading your app to GPT-4o | LangSmith Evaluations - Part 18

Using LangSmith in a non-LangChain codebase

LLM Benchmarks for Evaluation

LangFuzz: Redteaming for Language Models

LangSmith: In-Depth Platform Overview

LangSmith in Depth: Part 11.

LLMs & AI Benchmarks! - GenAI Eval Deep Dive

Instrumenting & Evaluating LLMs

Making LLMs (Large Language Models) More Predictable: Expert Insights from Microsoft & LangChain

Retrieval Augmented Generation with LangChain: ChatGPT for Your Data (PT 2)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics (EMNLP 2020)

AI Engineering 201: The Rest of the Owl

Chatbot Testing Made Easy: Step-by-Step Guide for Beginners | software testing | AxelBuzz Testing

Mastering HuggingFace Model Evaluation: In-Detail Walkthrough of Measurement, Metric & Comparato...