Yeah but can it RUN LOCALLY?

preview_player
Показать описание
Hello Everyone,
A lot of you asked about adding a local Llama model to the universal scraper.
In this video we'll see how to use a local free model to scrape any website from the internet.

________ 👇 Links 👇 ________

Here is the link with the whole code.

________ 👇 Content👇 ________
Рекомендации по теме
Комментарии
Автор

Re: pagination, there would be a few ways to tackle this.

The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop.

Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a <ul> tag, links with a valid href attribute (not just a #) that contain a number, etc.

Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected.

But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.

bluetheredpanda
Автор

check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow

EmilioGagliardi
Автор

From a performance standpoint, I think it's better to use the LLM to analyze the source page layout and have it write a scrapy (or similar) scraper, and then to use that to scrape the data. Using the LLM to process all the data is fine for one or two pages, but if you need to do a big scrape of 1000s of pages, the performance is going to be very poor compared to writing a dedicated scraper with the LLM and using that.

IanHobday
Автор

Really looking forward to a follow-up video to this!

thisisfabiop
Автор

Thx. Definitely following this project!

IdPreferNot
Автор

Thank you Thank you Thank you. They can not stop you.

andretaylor
Автор

Thanks Buddy. I feel sorry for your account suspension. I hope they will remove suspension soon.

lowbudgetgamer
Автор

Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg here> ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.

MrRossss
Автор

Here is how I would implement automatic processing of Paginated and/or Nested data :

The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement).

The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing.

The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.

ld-yt.
Автор

great work!
I think it should also be an advanced spider, wich checks the full sitestructure and then use the most needed.

wasserbesser
Автор

why did they suspend the github account?

py_coder_fpv
Автор

What about save data in a database and run it for a while? That would be amazing ...that would be the real useful tool for so many people ..

adsgsd
Автор

Tutorial how to dockerize and launch on own linux server to access from any device and from everywhere would be great. Thanks for the app and code!

unisol
Автор

Hi Marzouk, first thanks so much.
can you show us which files i need to change to use this on linux?
are you planing to docker this app in a future?

solporcima
Автор

A lot modern websites don't use pagination but load as you scroll. You have to be able to handle that.

schongut
Автор

perhaps u allow users input url pattern with [1-X] at the end so the your code can turn it into page urls and run your code per each.

rayhon
Автор

An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?

Cairthebest
Автор

I love this content, but I'd like to see your takes on the good old regular scraping, like scrapy with proxies

guerra_dos_bichos
Автор

Can you guide exactly which google api key to get? As there are lot of options out there. Getting confused.

Alimehdimalpara
Автор

they ban you because unlike others I will not mention you are giving value without a financial lure of a subscription fee, this is bad for the business models of many others so they will always try shut you down. Keep being a rebel and giving REAL value, this is the only way someone with nothing can ever have a chance and trust me I have used this to get a new income and I was rock bottom so thank you!!! keep giving IT WILL GIVE BACK :D

andyshaw-vp