This Open Source Scraper CHANGES the Game!!!

Показать описание

Hello Everyone,

Here is the link with the whole code in my website :

My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on Github and none of them got suspended.

Also check out the 2.0 version here:

________ 👇 Links 👇 ________

________ 👇 Content👇 ________

Рекомендации по теме

Комментарии

Hey Everyone,

My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.

I opened a ticket and I'm waiting for their answer.
in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

redamarzouk

So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

RoughSubset

Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.

Another great feature—although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI—would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

thisisfabiop

Pretty cool.

Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.

So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

SergeyNumerov

Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.

The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.

I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

danielcave

You earned a new subscriber. Algerian brother here.

moiguess

Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

justjosh

This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
The shortcome is that it is very limited to just the single page at the URL location.

To be truly valuable, it needs to also be a scraper (as you mention).
Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

rgsiiiya

Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
Also is useful + ("--headless") maybe?

minissoft

The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

ScottLahteine

The dependency on OpenAI and the API key is a bummer.

It would be better if we insert our own open-source AI engine and models.

MoneylessWorld

you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

shawnsmith

Great video as always, only downside is that it is adressing people who work with code and experienced in data scraping, but for no code or very little code like me, i think the best way is to use computer vision models, Vllm, chatgpt already have it in their api, but also we have 2 new open source models that just got ou this week, Qwen 2 VL, and microsoft phi 3.5 vision.

TheLionsaba

One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

ginocote

0:36 - dude got possessed by ChatGPT and his eyes went bananas.

JordanCrawfordSF

Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

CicadaMania

Thanks dor the video! What mic are you using?

aleksandars

Thanks for the simple tutorial and code.

Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

SamirDamle

I want to use groq api key bcoz it's free to use or local llm like Please modify this code if

snehasissnehasis-cosn

Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

mzahran

This Open Source Scraper CHANGES the Game!!!

This Open Source Scraper CHANGES the Game!!!

Am I going to jail for web scraping?

This AI Agent can Scrape ANY WEBSITE!!!

This AI Scraper Update Changes EVERYTHING!!

Industrial-scale Web Scraping with AI & Proxy Networks

This will change Web Scraping forever.

Scrape ANY Website with AI For Free | Best AI Tools

The Biggest Mistake Beginners Make When Web Scraping

How To Scrape Any Website

Web Scraping 101: A Million Dollar Project Idea

Web Scraping Made EASY With Power Automate Desktop - For FREE & ZERO Coding

Always Check for the Hidden API when Web Scraping

NEW: Crawl4AI + Cursor AI Scrape Any Website🤖🕸 Best LLM Web Scraper - Scrape API Documents AI Coding...

Python WEB SCRAPING in 30 Seconds! 🔥👨‍💻 #shorts

Google's secret algorithm exposed via leak to GitHub…

Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING

Beginners Guide To Web Scraping with Python - All You Need To Know

The secret to scaling in web scraping? Do LESS! #webscraping #datascraping

The Best Tools to Scrape Data in 2024

Scrape ANY Website in a Few Seconds!!! 💥 AI powered Web Scraping 💥

The ultimate AI SCRAPER is Finally COMPLETE!!

7 Game-Changing Tools to Supercharge Your Prospecting with Data Scraping

Crawl4AI: The Ultimate Web Scraping Tool for AI🚀

TOP 22 Web Scraping Tools for 2024 | Free to Paid Data Scraping Tools [Hands on Lab]