Web Scraping with Python - Get URLs, Extract Data

preview_player
Показать описание

This is the third video in the series of scraping data for beginners. We're going to add functionality to scrape from the actual product pages rather than just the search page. Adding in dataclasses will also help us handle our data.

This is a series so make sure you subscribe to get the remaining episodes as they are released!

If you are new, welcome! I am John, a self taught Python (and Go, kinda..) developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content.

:: Links ::

:: Disclaimer ::
Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items.
Рекомендации по теме
Комментарии
Автор

John, you've made me re-enjoy scraping. I gave up due to how frustrating most tutorials are and the lack of real-world application with all of those stupid scraping demo sites. Thanks for all you do man

x_nietoh
Автор

Excellent video, great learning experience

eduardop
Автор

Hi, thanks a lot the video is super clear and rich, I'm about to apply it on a similar website to grab details on products

charlottegauthier
Автор

Excellent video series, much appreciated. Thank you for posting.

daveys
Автор

Another great presentation! Neat use of kwargs. Also, a very relevant use of data classes.

thebuggser
Автор

thank you! we need more of this sh!t
and i hope a serie like this of BeatifulSoup either

Lorem
Автор

"parse_page(html)" from lesson 2 suddenly became "parse_search_page(html: HTMLParser):" in lesson 3 without any explanation. Anyway great tutorial as well as a whole series. Very good for beginners.

Mac_Edits
Автор

you are genius man, thank you very much

abdifatahabdi
Автор

This is very helpful! I appreciate it a lot.

milyastroc
Автор

if we can combine playwright with this, then basically we can scrape any dynamic sites? (e.g: social media websites)
thank you so much John this series is very fulfilling.

AliceShisori
Автор

Hi kindly make a video of python with Selenium because no updated chrome driver available so I don't know how we run script now.
Thanks

muhammadhaddid
Автор

Good series! Personally I think the yield is a nice touch but probably not needed here based on the weight of the script (and the generator itself doesn't help iteration as was described as the reason for its inclusion), the dataclass is overkill vs a dict (we end up converting out to dict anyway), and so is **kwargs vs a single kwarg that defaults to something like False or None (gives an impression there may be more than a single kwarg, easier just to use a single one that defaults to a value when not passed in). Got a subscribe from me, thank you :)

darylkell
Автор

Man, your videos are great. Your videos on playwirght have really been helpful. I was able to follow your videos and then make my own playwirhgt script in my project. Until I got stuck dealing with dynamic pop-ups. I am unable to get past those. I am supposed to enter a piece of data in those pop-ups (not like captcha stuff). Just unable to make it work. It would help if you could cover dealing with dynamic pop-ups. Thanks.

KushalSharmatheOne
Автор

From this video is not understandible for beginners, untill you decided for some reason to change all the code

rovolqg
Автор

Great video! Question: How can I find the extension that provides you with the errors next to the code?

juampivitalevi
Автор

Also kindly add the product urls column for each product and make it clickable when writing to CSV

jaswanth
Автор

Great video! You've got a subscriber. After trying out the code a couple of times, I came across ReadTimeout error. How do we fix that?

abhin.v
Автор

Based on one of your previous videos figured out, how to get nested objects from tricky div's . Thank you!
Could you please advise, how in function below do I get not only <p>'s but also <h2>'s, <pre>'s and <ul><li>'s elements?
Should it be some sort of pipe like syntax "div.article-formatted-body > div > p | h2 | pre | ul | li |"?

def read_article(html):
article_body = > div > p")
paragraphs = [i.text() for i in article_body]
print(*paragraphs, sep='\n')

samoylov
Автор

Shouldn't item number an integer and price being float?

atatekeli
Автор

Nice job is there a way to put this whole stuff in a cron job or scheduler to run intermittently

acharafranklyn