This AI Agent can Scrape ANY WEBSITE!!!

preview_player
Показать описание
In this video, we'll create a python script together that can scrape any website with only minor modifications

________ 👇 Links 👇 ________

________ 👇 Content👇 ________

Introduction to Web Scraping with AI - 0:00
Advantages Over Traditional Methods - 0:36
Overview of FireCrawl Library - 1:13
Setting Up FireCrawl Account and API Key - 1:24
Scraping with FireCrawl : Example and Explanation - 1:36
Universal Web Scraping Agent Workflow - 2:33
Setting Up the Project in VS Code - 3:52
Writing the Scrape Data Function - 5:41
Formatting and Saving Data - 6:58
Running the Code: First Example - 10:14
Handling Large Data and Foreign Languages - 13:17
Conclusion and Recap - 17:21
Рекомендации по теме
Комментарии
Автор

Hey everyone! 😊 I'm curious about your thoughts—was the explanation and flow of the video too fast, or was it clear and to the point?

redamarzouk
Автор

Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios).

First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory.

Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped".

These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges.

What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background.

Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment.

Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route.

Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script.

Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s)

Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc.

Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.)

Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt".

I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it.

All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped):

1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the <script>, <style>, <head>, <footer> tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes.

2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created.

3. Then I will fine tune a GPT3.5Turbo model with that json data.

After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability).

Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed).

I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again.

Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.

todordonev
Автор

Nice project, I worked on your code base for a while and used Groq mixtral instead, with multiple keys to pass limits, and Firecrawl is not automatic when it comes to pagination, you still need to add HTML code, which defeats the purpose, slow but ok for a free purpose. But I got around that I think. The next step is to use it in the front end. Zillow's API is only available for property developers, so scraping with manual inputs is the only way. However, working with the live API functionality would be the best way forward. Nice job!

aimattant
Автор

It's easy to do it with free python library. Reading HTML convert to markdown, even convert it for free to vector with transformer ect

ginocote
Автор

In the US, a “bedroom” is a room with a closet, a window, and a door that can be closed.

ConsultantJS
Автор

In my experience, function calling is way better at extracting consistent JSON than just prompting. Anyway, تبارك الله على ولد بلادي.

Yassine-tmtj
Автор

You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need

seqvpiq
Автор

wa ta fiiine a bba reda, scrape lya data a wld aami, w7rrak lya l agents, 💪

talatala
Автор

Good work ! Nice presentation, nice code ! 😃 It will help me a lot. Merci Reda

paulham.
Автор

Web scraping (getting data) and parsing (making sense of it) are two crucial steps for data extraction, often misunderstood as interchangeable. While AI promises a magic solution for parsing, it's expensive, unreliable, and environmentally unfriendly. It's better suited for rare cases where traditional methods struggle. Here, data pre-processing, training data creation, and fine-tuning a specific AI model is the key for success. Overall, scraping and parsing remain essential, with AI as a valuable tool for specific situations.

TrejonEdmonds
Автор

It's not literally beds, rather short hand for bedrooms. 3 bedrooms.

MichaelStephenLau
Автор

I was thinking this was going to be similar to PandasAi in which case you can do natural language prompts and the LLM figures out how to convert that into code for you and that is further enhanced with UIs now so you literally get a prompt and off you go. Once the environment is ready there's hardly any coding required. This seems quite a bit more involved than that (?)

KhuramMalikme
Автор

Neat overview. Curious about API costs associated with these demos. Try zooming into your code for viewers.

bls
Автор

Very helpful. Great job and thanks for sharing

LucesLab
Автор

Hey bro.. this is awesome. being a no code platform user I am unable to grasp your coding though I understanding it. Can you share the scripts you are using please.

hrarung
Автор

Amazing video and great explanations. Many thanks.

iokinpardoitxaso
Автор

Thank you for this wonderful tutorial, but as I am not a software programmer, I like to use web scraping tools for business purposes Is there a way to get this as a simple installer package? or a image for docker?

bjornschmitz
Автор

I have a question, the website that you're using seem to be listings from a city like San Francisco but the results that you're getting only have around 10 entries scraped. Why aren't there more?

bestebahn
Автор

Very helpful. How do you work around the output limit of 4096 tokens?

titubhowmick
Автор

Great many thanks for sharing, quick questions how add line of code to go to page 2 and do the same thing then page 3 and so on Please.

dldsijx