How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

preview_player
Показать описание

(if there are issues with viewing the code, just fork and clone the repository. It's just a current problem with GitHub's way of displaying Jupyter notebooks - nbconvert)

Tools mentioned:

Рекомендации по теме
Комментарии
Автор

Dammit stop telling everybody about Jina my secret weapon, just stop, it's my advantage, everybody ignore it it's horrible I swear

jarad
Автор

Thank you for introducing all the latest technology for web scraping!

kylelau
Автор

Youtube algorithm is just insanely good at what it does, this exactly the content I needed and I think I have found what I want to dedicate my life to as a professional.
Thank you for the video, I will buy your course as fast as I collect the money.

alonsoalarconaguilar
Автор

The reader API tip is so clutch. Thank You!

matten_zero
Автор

Just started wondering about web scraping and here you are.
Thank you.

antoniuskonovalov
Автор

How did I not get your content sooner? Love it!

ariG
Автор

Came at the perfect time. Very good video. Thx 😊

kuhltime
Автор

Thanks for adding a new project to my to do list!

forgotmyoldSN
Автор

🎯 Key Takeaways for quick navigation:

00:00 *🚀 Introduction to web scraping for LLMs in 2024*
- Overview of startups pivoting to web scraping.
- Mention of Mendable and its "fire crawl" tool for scraping the web using large language models.
02:06 *🔍 Scraping competitors' pricing pages*
- The process of scraping competitors' pricing for market research.
- Introduction to tools used for scraping: Jina AI, Mendable, and Scrapegraph-ai.
03:01 *🧠 Understanding "Tik token" and its application*
- Explanation of tokenization and encoding in web scraping.
- Discussion on the cost implications based on tokenization.
05:17 *🛠️ Setting up scrapers with Beautiful Soup and other tools*
- Description of different scraping tools and their setup.
- Comparisons among Beautiful Soup, Jina AI, and Mendable based on ease of use and output.
07:32 *📊 Running scrapers and analyzing outputs*
- Execution of web scraping and evaluation of the output from different tools.
- Analysis of readability and format of the scraped data.
09:37 *💰 Cost comparison and effectiveness of scraping tools*
- Comparison of costs associated with various scraping tools.
- Evaluation of which tool provides the most value for money.
12:53 *🤖 Extracting pricing information using OpenAI*
- Utilization of OpenAI for extracting specific data points.
- Challenges and strategies in obtaining clean and useful information.
17:20 *🌐 Overview of Scrapegraph for advanced web scraping*
- Introduction to Scrapegraph as an open-source project.
- Examples of complex data extraction and its accuracy.

Made with HARPA AI

roberthuff
Автор

@LLMs for Devs. I'm from Jina AI. Cool that you are using our reader app. I like seeing the exact use-cases people use that one - very interesting.

florianhonicke
Автор

keep up the good work! - this is an awesome presentation!

markt
Автор

So valueable video content! Many thanks for sharing~~

shuaiwang
Автор

The transcript at 1:39 states that you are using large sandwich models. This must be a brand new type of model - mouth watering indeed. 😂

uwepleban
Автор

Greta video! The open source tool looks great!

As an aside, I use instructor and pydantic classes to get the LLMs to provide the JSON as I expect it. In my limited experience, dspy wasn't as explicit as I wanted.

nickk
Автор

chigga dropping bomb content, meranwhile i made a comment analyzer for highly detailed videos which have 100+ comments, and dint have time for going through all. man, sometimes you dont need to build an ironman suit to do simple shet.

NikhilSwamiExperimental
Автор

16:05 Worth trying out GPT-4, I find it more accurate at following instruction.

terrytan
Автор

Can anyone speak to the architecture or other tools to prevent detection using beautiful soup as he mentioned? What would be the best process to avoid detection and what tools I wish you elaborated there considering it’s the subject of video in large part.

augmentos
Автор

Is the LLM community really not aware of 40 year old Natural Language Pre-processing methods developed for data mining and NLP?

jetlime
Автор

But the first problem that all crawls need to face is how to avoid being blocked.

planplay
Автор

What are the good and easy to use tools with langchain? Llm is not very useful without such tools, even it has no idea about the date today.

stanTrX