Web Scraping for LLM in 2024: Jina AI Reader API, Mendable Firecrawl, and Crawl4AI and More

preview_player
Показать описание
In this video, we look into various tools for web scraping, both free and paid. Learn how to scrape data from web pages and PDFs using Beautiful Soup, Reader API from Jena AI, and Firecrawl from Mendable. We also discuss advanced web scraping solutions like Scrape Graph AI and Crawl4AI. Ideal for creating LLM applications, this video provides practical examples and code demonstrations. Subscribe for more tutorials on building LLM applications and tools!

#webscraping #llm #parsing

RAG Beyond Basics Course:

LINKS:

TIMESTAMPS
00:00 Introduction to Data Scraping Series
00:21 Challenges of Web Data
01:32 Overview of Web Scraping Tools
01:59 Example Web Pages for Scraping
03:05 BeautifulSoup: The Baseline Approach
05:05 Reader API: JINA AI
08:21 FireCrawl: An Alternative Tool
10:42 Crawl4Ai and ScrapeGraphAI
12:13 Conclusion and Next Steps

All Interesting Videos:

Рекомендации по теме
Комментарии
Автор

Thanks for mentioning ScrapeGraphAI, I'm one of the co-founders, we have implemented new features like code generator for scraping to minimize the number of calls to LLM on sites that have a shared structure on different pages, we are preparing something big related to KG, stay tuned :))))

lurensss
Автор

Thanks for mentioning Crawl4Ai! I'm adding some new features, such as extracting all media tags (video, image, audio), Breadth-First Search (BFS) Crawling, and more. I do it with the aim to generate quality data without relying on large language models (LLM). I think firing up GPUs for just crawling data from a page with billions of parameters is a bit over the top. Developers can use LLMs themselves once they have the right raw data from web sources.

unclecode
Автор

Yes PLEASE, Do a videos on {Crawl4Ai and ScrapeGraphAI}, and thank you for everything you do and your time 🙏

mjacfardk
Автор

I just use selenium web driver and JavaScript or Jquery to interact with and get the parts of pages I want. If they use cloud flare or other bot blocking you can run js in console and utilize the copy command then paste in a txt file

TimTruth
Автор

For jina reader Api key free for 1 million tokens which was 570 sites then pay 10 for 500 mil worth is 250k sites which is totally insane just pay the tiny amount for much better rate limits

jarad
Автор

Nice comparison! Please continue work on scraping for AI applications. Hot topic!

beemerrox
Автор

Thank you so much for sharing this valuable information. It is absolutely helpful.

ahassan
Автор

Great review. Please do a review on ScrapeGraphAI. Maybe a comparison to Uncle Code's Crawl4AI? I like Crawl4AI and hope UC incorporates PDF options.

GetzAI
Автор

Scrapegraph is pretty amazing, highly recommended

jcksn
Автор

Can you make a detailed video on scrapegraphai? It’s kinda buggy right now for me

AJ-lgzr
Автор

Thank you. If you could dive deeper into scrapegraph, specifically the knowledge graph feature.

SeeFoodDie
Автор

The android in the thumbnail looks like he's DJing. Like he's ready to drop a sick beat...NOW!

john_blues
Автор

I need this materials very much, , can you share codes and api brothe??

planetgamecommunity
Автор

Thank you so much for sharing this valubale information. It is absouletly helpful. But, is it possible, as far as jina ai is concerned, to specify in the code the number of pages that I want to scrape, as spmetimes the pdf file has more than 500 pages .

ahassan
Автор

Do any of these solutions work on sites you have to log in to? You can give them a url, but if the site requires you to log in, you will not be able to scrape further.

chuckcarlson
Автор

Probably a silly question, but in what is all this complicated proccess better than doing a simple copy paste from the url?

stefleur
Автор

Are there any scrapper available for LinkedIn and Instagram?

ppp
Автор

We must create order from the messiness! 😎🤖

thesimplicitylifestyle