Jan Čurn - How to feed LLMs with data from the web | WebExpo 2024

Показать описание

All major generative AI models have been trained using data scraped from the web. Applications of large language models (LLMs) often extract web data to provide up-to-date context using Retrieval Augmented Generation (RAG). Unfortunately, reliably collecting online data at scale is challenging due to issues like blocking, dynamic content rendering, and the sheer volume of data. In this talk, Jan will explain how you can establish an efficient web data extraction pipeline, clean the HTML to circumvent the “garbage in, garbage out” problem, and demonstrate how to use this in an LLM application.

This talk was presented at the WebExpo Conference in Prague on May 30, 2024 🎤

Big thanks to the WebExpo team for allowing us to publish this recording 🤝🏻

More AI-related resources from Apify 🧑‍💻

Follow us 🤳

#webscraping #webexpo

Рекомендации по теме

Комментарии

why do you need to prefix a query to an LLM with "please"? I did not find any difference in trying prompts with or without formalities like that.

techwy

Jan Čurn - How to feed LLMs with data from the web | WebExpo 2024

Jan Čurn - How to feed LLMs with data from the web | WebExpo 2024

Days of AI 2023: Data as the fuel for AI [in Czech]

Thomas Puskailer - skúšky Legendy Popu ''Kaskader''

Petr Marada: Zemědělské hospodaření zaměřené na adaptaci krajiny na klimatickou změnu