Web Scraping 5: Reading All Blogs on One Page with Scrapy & Python (Scrapy Series)

preview_player
Показать описание
In this video, we will learn how to get all the blogs from a blog archive page of a Wordpress blog.

This is the fifth tutorial in the series that will teach you Web Scraping with Scrapy - The most powerful scraping library.

This is part of the Scrapy Crash Course. Take the full crash course for FREE:

Once you know the basics, learn how to Download all Files from any site with Scrapy.
Take the FREE course:

What is Web Scraping?
In a nutshell: Web Scraping = Getting Data from Websites with Code

What is Scrapy?
Scrapy is a Python library to make web scraping very powerful, fast and efficient.
There are other libraries too like BeautifulSoup, for web scraping. However, when it comes to true power and flexibility, Scrapy is the most powerful.
Why Learn Scrapy?
- Most powerful library for scraping
- Easy to master
- Cross-platform: doesn't matter which OS you are using
- Cloud-ready: Can be run on the cloud with a free account

Most Important: You would be able to earn by taking up some of the web scraping gigs as a freelancer right away.

-~-~~-~~~-~~-~-
Please watch: "Making Scrapy Playwright fast and reliable"
-~-~~-~~~-~~-~-
Рекомендации по теме
Комментарии
Автор

5:06 Update the code to use the correct selector
summary = ::text').getall())
Thanks to @Byte Riddler for pointing it out.

codeRECODE
Автор

nice exercise, I did the paragraph using BeautifulSoup

kenrosenberg
Автор

Please *SUBSCRIBE* and *Like* to make YouTube algorithm happy!

Please leave a comment with your questions, suggestions, or just a word of appreciation.

codeRECODE
Автор

thank you very much, NICE course and explanation

mahrouch
Автор

Hello,
Thank you for sharing your knowledge. I have a couple of questions though.

I noticed that you quickly scrolled past the "Introduce Yourself" summary being null.
I have been using xpath instead of the css selector
and i managed to get a result for that summary but it is only the first word, i assume this is because of the "em" tags inside the "p" tag, i have tried using the multiple path selections for xpath with no luck (maybe not using correctly?).

question:
1. how would we go about getting the text in the p tag when it has multiple tags inside it?
(note: i have omitted the leading bracket of the html tags)
EG: p> some text i> IS /i> em> important /em> and some text sub>is /sub> not /p>

so far the only possible solution i can see is multiple selections and joining of strings, surely there is an easier/better way?

resultant output in the JSON file contains many unicode (non-breaking space, in this case)entries.
the quotes JSON also contains multiple unicode entries for left and right double quotes.

question:
2. how would we go about removing these, either before or after writing the JSON file?

once again thank you for sharing your knowledge and experience.

ByteRiddler
Автор

sir how were you able to format multiple boxes of sentences
i tried ctrl+shift+alt + pg dn
but it selected only length of first box(title)

SaurabhKumar-ygfe
Автор

Hello sir, i tried following your steps but when selector gadget runs for the title, class name that i get is .entry-title and its not yielding any output.
When i did it to .entry-title a, i got the output..
I dont understand why its showing only .entry-title and not 'a' in the end
Can you please check and confirm ? Thanks a lot :)

virajpatel