This Open Source Scraper CHANGES the Game!!!

preview_player
Показать описание
Hello Everyone,

Here is the link with the whole code in my website :

My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on Github and none of them got suspended.

Also check out the 2.0 version here:

________ 👇 Links 👇 ________

________ 👇 Content👇 ________
Рекомендации по теме
Комментарии
Автор

Hey Everyone,


My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.

I opened a ticket and I'm waiting for their answer.
in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

redamarzouk
Автор

So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

RoughSubset
Автор

Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.

Another great feature—although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI—would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

thisisfabiop
Автор

Pretty cool.

Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.

So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

SergeyNumerov
Автор

Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.

The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.

I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

danielcave
Автор

You earned a new subscriber. Algerian brother here.

moiguess
Автор

Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

justjosh
Автор

This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
The shortcome is that it is very limited to just the single page at the URL location.

To be truly valuable, it needs to also be a scraper (as you mention).
Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

rgsiiiya
Автор

Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
Also is useful + ("--headless") maybe?

minissoft
Автор

The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

ScottLahteine
Автор

The dependency on OpenAI and the API key is a bummer.

It would be better if we insert our own open-source AI engine and models.

MoneylessWorld
Автор

you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

shawnsmith
Автор

Great video as always, only downside is that it is adressing people who work with code and experienced in data scraping, but for no code or very little code like me, i think the best way is to use computer vision models, Vllm, chatgpt already have it in their api, but also we have 2 new open source models that just got ou this week, Qwen 2 VL, and microsoft phi 3.5 vision.

TheLionsaba
Автор

One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

ginocote
Автор

0:36 - dude got possessed by ChatGPT and his eyes went bananas.

JordanCrawfordSF
Автор

Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

CicadaMania
Автор

Thanks dor the video! What mic are you using?

aleksandars
Автор

Thanks for the simple tutorial and code.

Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

SamirDamle
Автор

I want to use groq api key bcoz it's free to use or local llm like Please modify this code if

snehasissnehasis-cosn
Автор

Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

mzahran