Fastest Python Web Scraper - Exploring Sessions, Multiprocessing, Multithreading, and Scrapy

preview_player
Показать описание
In this video, we will make a fast web scraper. We will begin with BeautifulSoup.
🚀 The first script takes 128 seconds and after optimization, takes as little as 2.5 seconds.
Finally, we will create a scrapy spider without optimization and see what kind of results we get.
We will use BeautifulSoup, Requests, Sessions, Multithreading, Multiprocessing, and Scrapy.
You can jump to the sections you like:
00:31 Scraper Objective
00:44 Creating Scraper with Requests+BS4
9:20 First Run
10:07 Sessions
13:58 Multiprocessing
17:22 Multithreading
22:36 Scrapy Without Optimization

*Related videos*
-------

----------------------------------------------
What is Web Scraping?
In a nutshell: Web Scraping = Getting Data from Websites with Code

What is Scrapy?
Scrapy is a Python library to makes web scraping very powerful, fast, and efficient.

There are other libraries too like BeautifulSoup, for web scraping. However, when it comes to true power and flexibility, Scrapy is the most powerful.
Why Learn Scrapy?
- Most powerful library for scraping
- Easy to master
- Cross-platform: doesn't matter which OS you are using
- Cloud-ready: Can be run on the cloud with a free account

Most Important: You would be able to earn by taking up some of the web scraping gigs as a freelancer

#scrapy #fast #beautifulsoup #multiprocessing #multithreading

-~-~~-~~~-~~-~-
Please watch: "Making Scrapy Playwright fast and reliable"
-~-~~-~~~-~~-~-
Рекомендации по теме
Комментарии
Автор

Very well explained and structured video. I love the way you took us from without optimization till scrapy. Thank you for this video, it was very helpful!

anamashraf
Автор

Hello everyone. This time the text the smaller than my other videos. How is readability? Is it okay or larger would be better?
Looking forward to your comments.
PS: Please subscribe and like (or dislike) this video 🙂

codeRECODE
Автор

Hi Upendra, this is very useful, thanks a lot

ارمینمحمدجانی
Автор

When will you be uploading your new course, AI Agent Lecture? I am very excited and waiting eagerly for it.

danish
Автор

Thanks alot for this video, Helped me to solve a problem 💪🏿

ataimebenson
Автор

Keep up the good work, thanks for the video

bruce
Автор

Wow! Awesome video. Would you please let me know if it is possible to perform both multiprocessing and multithreading at the same time?

billygene
Автор

hm sadly Scrapy is single-threaded and Selenium is blocking if its called within a Spider, so the Spiders will not execute concurrently then (if they use Selenium instead of requests, to resolve an url). I wonder how it is possible to crawl that fast with Scrapy while also using Selenium for HTML-Rendering. Great video btw!

zone
Автор

Hi Upendra,

Thanks for the tutorial.

Can concurrent futures used to optimize "while True loop" with if then break at the end ?

I saw your tutorial and also did some googling and can't found any example.

Most of the example are 'for loop' or 'while loop' with predefined range.

DittoRahmat
Автор

the video I currently need. just curious, can you make scrapy faster than that?

chadGPT
Автор

Hi Sir,
Awesome Video
i am getting
"It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the setting for information on how to handle this deprecation.
return cls(crawler)"
Can you tell me about this? Please!

And also please tell me, where will i get output file?

vijay
Автор

I encountered a scenario. While using scraper_helper library to run spider directly from the script using vs code, I get below error:
"ImportError: attempted relative import with no known parent package"

I have to import the items file inside the spider which is why it throws this error, any solutions for this?

MohitAswani
Автор

Sir, make some videos on development part at server end

ashish
Автор

Do you have a video on how to implement multithreading in scrapy?

ataimebenson
Автор

Hello friend, congratulations for such an excellent video.

Friend I have the problem, and I don't know if I can solve it that way, I appreciate your great guidance.

I am creating a web service with FastApi, which has 2 endpoints where I extract to 2 websites.
.... /demo1
.... /demo2

When from postman for example I make a request. I want demo1 the browser opens and everything is fine, it does the extraction and it works perfect.

Following the example from postman, if I make a request to demo1 and at once I give it to demo 2... demo 2, I must wait for demo 1 to finish so that it opens the browser and does the extraction.

Can you please guide me on how to solve that.
I hope you can help me.
Greetings.

nelsongomez
Автор

Is it possible to automate cli using scrapy

hayathbasha
Автор

Python is not multithreaded unfortunately

CherifRahal
join shbcf.ru