Pagination is Bad for Scrapy and How to Avoid it

Показать описание

It is a very common practice to create a new request to the next page to get next page data, but this produces inefficient Scrapy spiders. Understand WHY it is bad and how to overcome this. This video is part of my brand new Scrapy course. To get access to this course now,
⮕ Become a member and get access to all the courses on my site:
To get a taste of Scrapy,
⮕ Take the course on Scrapy Basics for $1 or free 😀
(Use coupon code *FREE* on checkout page)

🔓 SOURCE CODE

📠 GEAR I USE and RECOMMEND
Note: These are affiliate links. I get a small commission if you click on these links and buy. This does not cost you anything.
⮕ Røde PodMic
⮕ Audio Interface - Focusrite Scarlett 2i2

📕 CHAPTERS
00:00 Traditional Paging
00:37 What's Wrong with this Approach
01:56 Fixing a Simple Spider
05:15 Fixing Amazon Scraper
08:55 Why Proxies are Useful

#webscraping #python #codeRECODE #upendra #scrapy

-~-~~-~~~-~~-~-
Please watch: "Making Scrapy Playwright fast and reliable"
-~-~~-~~~-~~-~-

Рекомендации по теме

Комментарии

This video is from my course on Scrapy.
I edited this for YouTube. So watching this one video still makes sense. Hope its useful :-)

codeRECODE

Thank you sir, had a task at work today about pagination and I remembered to check this vid, code worked perfectly runtime reduced significantly.

teodortodorov

Thanks for sharing this tip, its very useful for utilizing the real power of Scrapy async.

aleksandarboshevski

To develop right mindset for fully utilizing async capabilities in Scrapy, probably this method should be the default one to be teached from the beginning, because when you make some coding/thinking habits later is much harder to change that state of the mind thinking differently.

aleksandarboshevski

Thank you for making a video on this great topic !

subrinalazad

Nice way to look at it! Thanks for the video

carloscampos

Wow, I didn't know about that. I always thought that the next_page approach was best practice. Didn't realize there is a better one

DittoRahmat

I always use this approach when available... But didn't know it actually speed up the performance 😂 thanks for the explanation ❤️ love your content.

jagdisho

Great idea, but, what about crawlspider?

diegovargas

What a interesting video. I have some question
1. I'm just a beginner of Web Scraping, Do you show how we can scrape in cloud for free, I see some videos but they are all paid for
2. Can you make a video how to scrape by jupyter notebook

hungduy

Hi! I'm having problems scraping a page. I extract a different number of items each time I run the spider. When I use pagination it loses less items. Do you know what could be the reasons of this? Thank you in advance!

HP-stff

Hi,
While scraping I am facing 429 too many requests in scrapy
Can u pls advise on how to solve
If possible a video would be great on it

syedghouse

But even when parse method is using recursion, scrapy scheduler works asynchronously. Its still nice to Iterate the pages.

DeepDeepEast

Hi thank you for the video. In the Amazon scraper the range goes from 2 to int(total_pages)+1, shouldn't it go from current_page to int(total_pages)+1? thanks.

tirullow

Hi, What if we dont know the page number, would it make sense to scan until faliure of next button presence in a page html and then do pagination on the number of pages available?

vikasunnikkannan

which method is better this one or Linkextractor

ayoubsarab

Hello sir I have some doubts if the site has the growing list then how we avoid duplicates

sobinchacko

I think I figured out my issue how do you do this for crawl spiders? I'm new. Nvm if you get a chance could you do pagination for crawlers with websites using javascript.

Scuurpro

why i unable to do this in this website ?
error : url is not defined

import scrapy

class ThriftSpider(scrapy.Spider):
name = 'thrift'
allowed_domains = ['www.thriftbooks.com']

def start_requests(self):
for i in range(1, 15):
yield

def parse(self, response):

yield {
'title' : title
}

pythonically

Pagination is Bad for Scrapy and How to Avoid it

Pagination is Bad for Scrapy and How to Avoid it

The Biggest Mistake Beginners Make When Web Scraping

Skip Pagination with Scrapy SitemapSpider: The Easiest Way To handle pages!

Following Pagination Links with Scrapy

Live Coding - Scrapy, Pagination, Post, Tokens | Freelance Gig | Python Scrapy

Following LINKS Automatically with Scrapy CrawlSpider

Scraping pagination with scrapy | Pattern Matching

I swapped my project to Scrapy, was it a Mistake?

How to paginate in web scraping (Paid Project Story)

Pagination In Web Scraping | Challenges and Solutions

Web scraping freelancer's routine working days - debugging SCRAPY spider's pagination craw...

Python: extract data from an api using scrapy

Web Scraping | Pagination with Next Button

Scrapy Middleware - Custom User Agents

Always Check for the Hidden API when Web Scraping

Scrapy From one Script: ProcessCrawler

What I'd Add FIRST To a new Scrapy Project

How to Scrape JavaScript Websites with Scrapy and Playwright

BeautifulSoup + Requests | Web Scraping in Python

Advanced Web Scraping with Python using Scrapy & Splash - learn Data Science

How To Add a Database to your Scrapy Project

PYTHON : Click a Button in Scrapy

Pagination with link extraction | elegant method

How I Use Scrapy Shell When Creating Web Scraping Projects