How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

Показать описание

(if there are issues with viewing the code, just fork and clone the repository. It's just a current problem with GitHub's way of displaying Jupyter notebooks - nbconvert)

Tools mentioned:

LLMs for Devs

Рекомендации по теме

Комментарии

Dammit stop telling everybody about Jina my secret weapon, just stop, it's my advantage, everybody ignore it it's horrible I swear

jarad

Thank you for introducing all the latest technology for web scraping!

kylelau

Youtube algorithm is just insanely good at what it does, this exactly the content I needed and I think I have found what I want to dedicate my life to as a professional.
Thank you for the video, I will buy your course as fast as I collect the money.

alonsoalarconaguilar

The reader API tip is so clutch. Thank You!

matten_zero

Just started wondering about web scraping and here you are.
Thank you.

antoniuskonovalov

How did I not get your content sooner? Love it!

ariG

Came at the perfect time. Very good video. Thx 😊

kuhltime

Thanks for adding a new project to my to do list!

forgotmyoldSN

🎯 Key Takeaways for quick navigation:

00:00 *🚀 Introduction to web scraping for LLMs in 2024*
- Overview of startups pivoting to web scraping.
- Mention of Mendable and its "fire crawl" tool for scraping the web using large language models.
02:06 *🔍 Scraping competitors' pricing pages*
- The process of scraping competitors' pricing for market research.
- Introduction to tools used for scraping: Jina AI, Mendable, and Scrapegraph-ai.
03:01 *🧠 Understanding "Tik token" and its application*
- Explanation of tokenization and encoding in web scraping.
- Discussion on the cost implications based on tokenization.
05:17 *🛠️ Setting up scrapers with Beautiful Soup and other tools*
- Description of different scraping tools and their setup.
- Comparisons among Beautiful Soup, Jina AI, and Mendable based on ease of use and output.
07:32 *📊 Running scrapers and analyzing outputs*
- Execution of web scraping and evaluation of the output from different tools.
- Analysis of readability and format of the scraped data.
09:37 *💰 Cost comparison and effectiveness of scraping tools*
- Comparison of costs associated with various scraping tools.
- Evaluation of which tool provides the most value for money.
12:53 *🤖 Extracting pricing information using OpenAI*
- Utilization of OpenAI for extracting specific data points.
- Challenges and strategies in obtaining clean and useful information.
17:20 *🌐 Overview of Scrapegraph for advanced web scraping*
- Introduction to Scrapegraph as an open-source project.
- Examples of complex data extraction and its accuracy.

Made with HARPA AI

roberthuff

@LLMs for Devs. I'm from Jina AI. Cool that you are using our reader app. I like seeing the exact use-cases people use that one - very interesting.

florianhonicke

keep up the good work! - this is an awesome presentation!

markt

So valueable video content! Many thanks for sharing~~

shuaiwang

The transcript at 1:39 states that you are using large sandwich models. This must be a brand new type of model - mouth watering indeed. 😂

uwepleban

Greta video! The open source tool looks great!

As an aside, I use instructor and pydantic classes to get the LLMs to provide the JSON as I expect it. In my limited experience, dspy wasn't as explicit as I wanted.

nickk

chigga dropping bomb content, meranwhile i made a comment analyzer for highly detailed videos which have 100+ comments, and dint have time for going through all. man, sometimes you dont need to build an ironman suit to do simple shet.

NikhilSwamiExperimental

16:05 Worth trying out GPT-4, I find it more accurate at following instruction.

terrytan

Can anyone speak to the architecture or other tools to prevent detection using beautiful soup as he mentioned? What would be the best process to avoid detection and what tools I wish you elaborated there considering it’s the subject of video in large part.

augmentos

Is the LLM community really not aware of 40 year old Natural Language Pre-processing methods developed for data mining and NLP?

jetlime

But the first problem that all crawls need to face is how to avoid being blocked.

planplay

What are the good and easy to use tools with langchain? Llm is not very useful without such tools, even it has no idea about the date today.

stanTrX

How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

Beginners Guide To Web Scraping with Python - All You Need To Know

How To Scrape Any Website

Scraping Data from a Real Website | Web Scraping in Python

Web Scraping With Python 101

How to Scrape Data From Any Website

What is Web Scraping and What is it Used For? | Definition and Examples EXPLAINED

Am I going to jail for web scraping?

Web Scraping Made EASY With Power Automate Desktop - For FREE & ZERO Coding

PHP Tutorial 15 : How to Use file_get_contents to Fetch Website Data | @CodeWithNaf

Industrial-scale Web Scraping with AI & Proxy Networks

How to scrape any website in minutes - No-code tutorial

Scrape ANY Website with AI For Free | Best AI Tools

How to Scrape Any Website in Make.com

Web Scraping with Python - Beautiful Soup Crash Course

Scrape Any Website in Minutes - Web Scraping Without Coding! | Octoparse Tutorial

How to scrape a website that requires login using Python

Python Web Scraping is Easy | Scrape Data with 2 Lines of Code!

How to Scrape Websites Without Getting Blacklisted or Blocked

How to SCRAPE DYNAMIC websites with Selenium

How to Scrape Websites Without Code | The Ultimate Tutorial

AWESOME Excel trick to scrape data from web automatically

How To Scrape A Web Page With PowerShell

How To Scrape Multiple Pages on Websites | Web Scraping using BeautifulSoup

Video 12: Scrape a Website | Google Apps Script | Learn in 15 Minutes