Yeah but can it RUN LOCALLY?

Показать описание

Hello Everyone,
A lot of you asked about adding a local Llama model to the universal scraper.
In this video we'll see how to use a local free model to scrape any website from the internet.

________ 👇 Links 👇 ________

Here is the link with the whole code.

________ 👇 Content👇 ________

Рекомендации по теме

Комментарии

Re: pagination, there would be a few ways to tackle this.

The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop.

Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a <ul> tag, links with a valid href attribute (not just a #) that contain a number, etc.

Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected.

But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.

bluetheredpanda

check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow

EmilioGagliardi

From a performance standpoint, I think it's better to use the LLM to analyze the source page layout and have it write a scrapy (or similar) scraper, and then to use that to scrape the data. Using the LLM to process all the data is fine for one or two pages, but if you need to do a big scrape of 1000s of pages, the performance is going to be very poor compared to writing a dedicated scraper with the LLM and using that.

IanHobday

Really looking forward to a follow-up video to this!

thisisfabiop

Thx. Definitely following this project!

IdPreferNot

Thank you Thank you Thank you. They can not stop you.

andretaylor

Thanks Buddy. I feel sorry for your account suspension. I hope they will remove suspension soon.

lowbudgetgamer

Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg here> ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.

MrRossss

Here is how I would implement automatic processing of Paginated and/or Nested data :

The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement).

The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing.

The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.

ld-yt.

great work!
I think it should also be an advanced spider, wich checks the full sitestructure and then use the most needed.

wasserbesser

why did they suspend the github account?

py_coder_fpv

What about save data in a database and run it for a while? That would be amazing ...that would be the real useful tool for so many people ..

adsgsd

Tutorial how to dockerize and launch on own linux server to access from any device and from everywhere would be great. Thanks for the app and code!

unisol

Hi Marzouk, first thanks so much.
can you show us which files i need to change to use this on linux?
are you planing to docker this app in a future?

solporcima

A lot modern websites don't use pagination but load as you scroll. You have to be able to handle that.

schongut

perhaps u allow users input url pattern with [1-X] at the end so the your code can turn it into page urls and run your code per each.

rayhon

An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?

Cairthebest

I love this content, but I'd like to see your takes on the good old regular scraping, like scrapy with proxies

guerra_dos_bichos

Can you guide exactly which google api key to get? As there are lot of options out there. Getting confused.

Alimehdimalpara

they ban you because unlike others I will not mention you are giving value without a financial lure of a subscription fee, this is bad for the business models of many others so they will always try shut you down. Keep being a rebel and giving REAL value, this is the only way someone with nothing can ever have a chance and trust me I have used this to get a new income and I was rock bottom so thank you!!! keep giving IT WILL GIVE BACK :D

andyshaw-vp

Yeah but can it RUN LOCALLY?

Yeah but can it RUN LOCALLY?

Raspberry Pi 3+. Yeah but can it run Crysis

Yeah But No - Run Run Run (Adam Port Remix)

Yeah, But can it run Dead Rising 3?

I Murdered MrBeast!

Yeah But No - Run Run Run (official music video)

5 MANGA STYLES DRAWING CHALLENGE #2

CAN IT RUN DOOM? (Hypothetically, Yes)

Is Steel Ball Run that Good? Yeah, but Why?

Yes but I had to run

Language Performance Comparisons Are Junk

Yes I can 😂 We can run everywhere 💪

Yes, I would run away with you Taylor. She finally played my favorite song on Reputation 😭😭😭...

Can We run #OS without an #HDD/ #SSD ? YES ?

Which Run Do You Prefer?🤔(yes it’s dark outside)

🗣️ “I can ask for you if you want!” 😅

it didn't show the other half but yeah I can run faster with no in resistance

Will run you down, down 'til the dark Yes, and they will run you down KALEO - Way Down We Go

I bought a cheap VW that’s more complicated than my Bugatti: a V10 diesel with a NIGHTMARE OIL LEAK...

Can mermaids run? Yes. Yes they can. 😎

Yes, it’s important to get dressed up *even* to run errands #shorts

SUPER SHOES for a EASY run? Yes, and why YOU SHOULD TRY IT!! #running #marathon #vo2max

you can run but you can't hide oh yeah

Yes but I can’t run Fortnite 😤😤😭😢

Yes, it’s important to get dressed up even to run errands #shorts