Recursion ➰ for Paginated Web Scraping

preview_player
Показать описание

We figure out how to deal with the paginated search results in our web scrape. RECURSION is our tool - not as difficult as you might think!!

🗿 MILESTONES
⏯ 00:12 Fika 🍪
⏯ 13:10 Extracting the next page number with regex
⏯ 16:50 Encounter with prettier... 🌋
⏯ 18:39 ➰ Recap
⏯ 20:15 TIME FOR RECURSION 😎
⏯ 29:00 Quick Google rant 🌋
⏯ 29:23 ➰➰ Rerecap by Commenting the Code

See the previous episode where we explain Puppeteer and finding the data to scrape

The code used in this video is on GitHub

Puppeteer - Headless Chrome browser for scraping (instead of Phantom JS)

The editor is called Visual Studio Code and is free. Look for the Live Share extension to share your environment with friends.

DevTips is a weekly show for YOU who want to be inspired 👍 and learn 🖖 about programming. Hosted by David and MPJ - two notorious bug generators 💖 and teachers 🤗. Exploring code together and learning programming along the way - yay!

DevTips has a sister channel called Fun Fun Function, check it out!

#recursion #webscraping #nodejs
Рекомендации по теме
Комментарии
Автор

.. and you just answered my question on the previous video! Thanks! I enjoyed so much this two on web scraping.

simoneicardi
Автор

These two web scraping vids are awesome! Would love to see one on building a crawler 🕸

naansequitur
Автор

I would have used the “next” button at the navigation and use its href to get the next page until there are no more next pages

justvashu
Автор

Love this video - learned so much and the guys are entertaining to listen to. Thanks

kasio
Автор

Thanks you!!!
excellent video that really helped when trying to figure out puppeteer, moreover recursion!
I did find that the count in recursion didn't like numbers over 9 so i added these two lines to account of any pagination number.
```
const digit =
const newStreet = street.slice(0, -digit);
```
thanks again for a well timed video that saved the day :)

jolyonfavreau
Автор

Thank you so much David for this amazing scrapping video.

g-you
Автор

Im impressed that you didnt get an error saying 'browser is not defined'!

Joevanbo
Автор

Great tutorial, thank you so much for sharing! I am wondering how to design the function to stop when a certain length of found products has been reached (e.g. when 50 total partners are found, stop the recursion and proceed to other parts of the code) ?

avecho
Автор

David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.

drewlomax
Автор

Why all the regex stuff over just passing the page number as an argument and creating the URL in the method?

alexzanderflores
Автор

Great Vid! You guys should go over Docker next

charlyecastro
Автор

hello i have other basic python web scape code that saves to csv file and so what is the added code so we can save to csv file plz ?
Lisa and thank you

pjmclenon
Автор

Thanks!
Hmm silly-questions-section here : the first rule of scraping is "be nice", dont overload servers etc, wouldn't it be nicer if we first copied all result pages and scraped them locally? what's the general approach?

trendYou
Автор

Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ], sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.

kainarilyasov
Автор

Hi David! On my pagination Url has no page parameter. Is there any way to scrape for Ajax response? The required content is loading from AJAX/ Client-side.

sridharnetha
Автор

Hello everyone, I think it does not work anymore. The class '' Compact '' is no longer there. How to fix that? I try with '' Landscape '' and it returns me an empty array in any case.

JohnnyMylot
Автор

I have a serious, and only slightly related question. The truth is I am not a coder, I am renting software via ParseHub. I can use the software just fine, but the website I am using to scrape, despite having tens of thousands of desired results, has a page limit of 15. There is no way I can get the amount of information I need from such small scrapes. Is there anyway to bypass this page limit and gain access to the totality of the actual results, oppose to pitiful amount I am actually able see at this time?

logandarsee
Автор

I'd like for you to deploy this (maybe to Firebase hosting, using Firebase Cloud Function). You probably would come up with an annoying CORS error, so I'd be interested to see how you resolve it. For myself, following the CORS tips in the Firebase Cloud Functions docs doesn't seem to help with web scraping with Puppeteer. :(

JBuchmann
Автор

David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.

Soundtech
Автор

While the cat's away the fika comes out to play.

Laek