Pagination is Bad for Scrapy and How to Avoid it

preview_player
Показать описание
It is a very common practice to create a new request to the next page to get next page data, but this produces inefficient Scrapy spiders. Understand WHY it is bad and how to overcome this. This video is part of my brand new Scrapy course. To get access to this course now,
⮕ Become a member and get access to all the courses on my site:
To get a taste of Scrapy,
⮕ Take the course on Scrapy Basics for $1 or free 😀
(Use coupon code *FREE* on checkout page)

🔓 SOURCE CODE

📠 GEAR I USE and RECOMMEND
Note: These are affiliate links. I get a small commission if you click on these links and buy. This does not cost you anything.
⮕ Røde PodMic
⮕ Audio Interface - Focusrite Scarlett 2i2

📕 CHAPTERS
00:00 Traditional Paging
00:37 What's Wrong with this Approach
01:56 Fixing a Simple Spider
05:15 Fixing Amazon Scraper
08:55 Why Proxies are Useful

#webscraping #python #codeRECODE #upendra #scrapy

-~-~~-~~~-~~-~-
Please watch: "Making Scrapy Playwright fast and reliable"
-~-~~-~~~-~~-~-
Рекомендации по теме
Комментарии
Автор

This video is from my course on Scrapy.
I edited this for YouTube. So watching this one video still makes sense. Hope its useful :-)

codeRECODE
Автор

Thank you sir, had a task at work today about pagination and I remembered to check this vid, code worked perfectly runtime reduced significantly.

teodortodorov
Автор

Thanks for sharing this tip, its very useful for utilizing the real power of Scrapy async.

aleksandarboshevski
Автор

To develop right mindset for fully utilizing async capabilities in Scrapy, probably this method should be the default one to be teached from the beginning, because when you make some coding/thinking habits later is much harder to change that state of the mind thinking differently.

aleksandarboshevski
Автор

Thank you for making a video on this great topic !

subrinalazad
Автор

Nice way to look at it! Thanks for the video

carloscampos
Автор

Wow, I didn't know about that. I always thought that the next_page approach was best practice. Didn't realize there is a better one

DittoRahmat
Автор

I always use this approach when available... But didn't know it actually speed up the performance 😂 thanks for the explanation ❤️ love your content.

jagdisho
Автор

Great idea, but, what about crawlspider?

diegovargas
Автор

What a interesting video. I have some question
1. I'm just a beginner of Web Scraping, Do you show how we can scrape in cloud for free, I see some videos but they are all paid for
2. Can you make a video how to scrape by jupyter notebook

hungduy
Автор

Hi! I'm having problems scraping a page. I extract a different number of items each time I run the spider. When I use pagination it loses less items. Do you know what could be the reasons of this? Thank you in advance!

HP-stff
Автор

Hi,
While scraping I am facing 429 too many requests in scrapy
Can u pls advise on how to solve
If possible a video would be great on it

syedghouse
Автор

But even when parse method is using recursion, scrapy scheduler works asynchronously. Its still nice to Iterate the pages.

DeepDeepEast
Автор

Hi thank you for the video. In the Amazon scraper the range goes from 2 to int(total_pages)+1, shouldn't it go from current_page to int(total_pages)+1? thanks.

tirullow
Автор

Hi, What if we dont know the page number, would it make sense to scan until faliure of next button presence in a page html and then do pagination on the number of pages available?

vikasunnikkannan
Автор

which method is better this one or Linkextractor

ayoubsarab
Автор

Hello sir I have some doubts if the site has the growing list then how we avoid duplicates

sobinchacko
Автор

I think I figured out my issue how do you do this for crawl spiders? I'm new. Nvm if you get a chance could you do pagination for crawlers with websites using javascript.

Scuurpro
Автор

why i unable to do this in this website ?
error : url is not defined

import scrapy

class ThriftSpider(scrapy.Spider):
name = 'thrift'
allowed_domains = ['www.thriftbooks.com']


def start_requests(self):
for i in range(1, 15):
yield


def parse(self, response):

yield {
'title' : title
}

pythonically