Developing and Serving RAG-Based LLM Applications in Production

preview_player
Показать описание
There are a lot of different moving pieces when it comes to developing and serving LLM applications. This talk will provide a comprehensive guide for developing retrieval augmented generation (RAG) based LLM applications — with a focus on scale (embed, index, serve, etc.), evaluation (component-wise and overall) and production workflows. We’ll also explore more advanced topics such as hybrid routing to close the gap between OSS and closed LLMs.

Takeaways:

• Evaluating RAG-based LLM applications are crucial for identifying and productionizing the best configuration.

• Developing your LLM application with scalable workloads involves minimal changes to existing code.

• Mixture of Experts (MoE) routing allows you to close the gap between OSS and closed LLMs.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai
Рекомендации по теме
Комментарии
Автор

I would love to see an hour long presentation on this!

jzziesing
Автор

Really enjoyed this talk - found a lot of value in it. Both speakers are clearly so knowledgable, and i love the extra little details the chap in the blue hoodie gave throughout. Would love to connect & share!

TymonVideos
Автор

Great presentation. Just one question: What is relevance_score in this case? Is it an aggregation of grounding metrics for all reference examples?

ndamulelosbg
Автор

🎯 Key Takeaways for quick navigation:

00:05 🚀 Initial Motivation and Project Start
- Started building LM applications to gain firsthand experience and improve user experience.
- Developed a RAG application, focusing on making it easier for users to work with products.
- Emphasized the importance of underlying documents and user questions in building such applications.
01:31 🌐 Community Engagement and Insights
- Encouraged sharing insights and experiences on building RAG-based applications.
- Acknowledged the community's early stage and the value of diverse perspectives.
- Welcomed external input to enrich the collective understanding of RAG applications.
03:07 🧩 Experimentation with Data Chunking
- Explored different strategies for efficient data chunking, moving beyond random chunking.
- Utilized HTML document sections for precise references and better understanding of content.
- Aimed for a generalizable template, potentially open-sourcing a solution for various HTML documents.
05:14 🗃️ Vector Database and Technology Choices
- Chose Postgres as the Vector database, emphasizing familiarity and compatibility.
- Highlighted the increasing options of specialized databases for LM applications.
- Advised selecting a database based on team familiarity but exploring new options for specific features.
06:10 🔄 Retrieval Workflow and Database Query
- Described the retrieval process, including embedding queries and calculating distances.
- Discussed pros and cons of building Vector DB on Postgres versus using dedicated solutions.
- Addressed potential limitations based on document scale and the flexibility of different databases.
08:20 📏 Considerations for Context Size and Token Limits
- Acknowledged token limits in LM context and model-specific variations.
- Encouraged experimenting with different chunk sizes, possibly using multiple embeddings for longer chunks.
- Highlighted the importance of adapting to the LM's limitations and exploring diverse experimental setups.
09:29 🔍 Evaluation Metrics and Component-wise Assessment
- Introduced the two major components for evaluation: retrieval workflow and LM response quality.
- Explained the evaluation process, including isolating each component for focused assessment.
- Shared insights into the challenges and considerations of scoring LM responses.
11:32 📊 Evaluator Selection and Quality Assessment
- Used GPT-4 as an evaluator based on empirical comparison and understanding of the application.
- Discussed the limitations of available LM models and potential biases in self-evaluation.
- Advocated for iterative improvement and potential collaboration with external LM development communities.
15:13 📈 Iterative Evaluation and System Trust Building
- Illustrated the iterative evaluation process, starting with trusting an evaluator.
- Demonstrated the evaluation flow, using different configurations and trusting the chosen LM's outputs.
- Emphasized the importance of building trust in each component before assessing the overall system.
17:04 ❄️ Cold Start Strategy and Bootstrapping
- Presented a cold start strategy using chunked data to generate initial questions.
- Addressed noise reduction by refining generated questions and encouraging creativity.
- Described the bootstrapping cycle from clean slate to using generated data for further annotations.
18:38 🔄 Continuous Learning and Evaluation Scaling
- Responded to questions about the number of examples for cold start and overall evaluation.
- Advocated for a balance of quantity and diversity in examples for comprehensive evaluations.
- Stressed the importance of continuous learning, adaptation, and leveraging automated pipelines for scaling evaluations.
19:49 📈 Chunk Size Impact on Retrieval and Quality
- Retrieval score increases with chunk size but starts tapering off.
- Quality continues to improve even as chunk sizes increase.
- Code snippets benefit from longer context or special chunking logic.
21:30 🧩 Number of Chunks and Context Size
- Increasing the number of chunks improves retrieval and quality scores.
- Larger context windows for LLMs show a positive trend.
- Experimentation with techniques like RoPE for extending context.
22:30 🛠️ Fixing Hyperparameters During Tuning
- Fixing hyperparameters sequentially: context size, chunk size, embedding models.
- Experimentation with spread and fixing parameters once optimized.
- Illustrates a pragmatic approach to hyperparameter tuning.
23:12 🏆 Model Selection and Benchmarking
- GPT-3.5-based model (GTE base) outperformed larger models on their use case.
- Emphasizes the importance of evaluating models based on specific use cases.
- Benchmarking against openai's text embedding and choosing a smaller, performant model.
23:56 💰 Cost Analysis and Hybrid LM Routing
- Cost analysis comparing different LM models.
- Introduction of a hybrid LM routing approach for cost-effectiveness.
- Consideration of performance, cost, and hybrid routing for optimal results.
25:10 🤖 Classifier vs. Language Model for Routing
- Classifier used for routing decisions due to speed considerations.
- Mention of training a classifier using a labeled dataset for routing.
- Potential transition to LM-based routing as LM inference speed improves.
27:17 🔄 Future Developments and System Integration
- Integration of components into larger systems, citingAnyScale's doctor application.
- Anticipation of more developments and applications in the future.
- Acknowledgment of the importance of iteration in building robust systems.

Made with HARPA AI

junaidiqbal
Автор

How to protect a company's information with technology ?

JavierTorres-stgt