I Don't Waste Time Parsing HTML (So I do THIS)

preview_player
Показать описание
OK so it can't be avoided and this is nothing inherently BAD about parsing loads of HTML for scraping data, but the reason I keep banging on about this is that so many times there is just a much better way to get all the data you need, and often more.

This video I will walk you all through a script I wrote to grab some data from a modern website and show you how and why I made decisions in my code, plus share some data that you can't get from the HTML.

There's some other useful tips along the way too, including using more of my current favorite python package, Pydantic.

Рекомендации по теме
Комментарии
Автор

Thanks John, you are helping me so much for my final project through these videos, and at the same time they are very enternained!

juanluidos
Автор

Nice, its so great to see how someone who is dedicated can advance their skills and go through better tools and knowledge.

kusunagi
Автор

been watching a lot of your awesome videos this week on scrapy, sqlite, cronjob, beautifulsoup, selenium, insomnia etc. and I'd love to see you make a video to tie all of these tools together in one example!

brothermalcolm
Автор

Your videos are really knowledgeable.
Let me share the approach i use to scrape a site with best possible optimisation:
• first i check the network tab
• script tag in the page source
• if they both doesn’t work, then i scrape directly through elements.

MuhammadHassan-smbf
Автор

USPS is United States Postal (parcel) Service. They are using that to pass on descriptors of the product for shipping requirements.

madusan
Автор

Super interesting video, thank you so much for including Pydantic and requests_html!
About the Optional checkbox you mention at 8:28 - it currently only applies on the last Pydantic model in the generator. (I think that's a bug.)

silkogelman
Автор

I am learning a lot from your vids, thank you and keep going! You do that better then 90% of pips demonstrating scraping techniques with python. Nice, neat, extendable pythonic code. Love it!

pypypy
Автор

Plebs, i just curl articles and parse the HTML with my eyes

tasosm.
Автор

You provide very valuable information. I can't thank you enough 😊

ambroseoyamo
Автор

I was stuck in something once and i gave up but I just saw this video and search for api call and found it love.. was just watching for fun but been really educational

Yatin
Автор

Awesome. We scrape several websites daily and compare the stock with the previous date. The number of sales is the difference in stock with the previous day. Mulitply it with the price and you have insight in the revenue. This process is automated and stored in the cloud (Big Query). Our procurement department is very happy with this data.

putyah
Автор

I really enjoyed this. I'm a completely self taught coder in HTML and javascript and wrote some in-game tools for a game I used to play. The game has a massive amount of data and while they encourage people to write tools/addons for it they have a very strict rule about not initiating any calls to the server so wouldn't have been able to use this approach as is. However the approach still has some very useful things that I wish I'd known about at the time I was writing my HTML scraping code. I may well re-visit the game at some point to see if I can improve my code using what I learned here.

daveturnbull
Автор

that JSON to Pydantic page is awesome!

bn_ln
Автор

Big fan of these different parsing videos.I'm trying to figure out which approach is best for a site like baseballsavant (leaderboard for exit velocity) as I couldnt find the json data through the network tab.
also a random content request: how different does the python look when you are working inside of a password protected site? I assume you have to make the python script login to establish a cookie or so, but not sure outside of that. Thanks!

dfsed
Автор

Awesome content John! Are there any sites you recommend to land web scraping jobs? :)

alx
Автор

Main point u need to think about:
Websites cant put hunderds if not thousands of results of anything (products, pics, reviews etc) on a single page cause its too much data. (Crashes browser, slows website etc) so they most of the time use a api to fetch the data based on a state. The state could be state of a scrollbar, pressing next on a pagination system etc. Find the api, reverse it and exploit basic abuse protections.
Scenario 1: javascript requirement
Use emulation for example to load up the page once, trigger the state to make the api appear, save the network log tab into an array itterate for the known api regex, save it and proceed scraping request based.
2 scenario: no js required, but cookie or smt simmilar
Any sessioning module can be used to initially request the main page to create a valid authed session and the proceed with the session to scrape the rest.
3. scenario: no protection at all
Just abuse it without anything special, watch out for rate limmiting and ip banning so maybe test it out first… most pages dont hard rate limit cause it could destroy actual users experience

Main skill needed in my opinion: recognizing patterns, reversing

This kinda info is worth couple thousand of $$ some of the biggest scraping services use exactly this kind of stuff to provide their services. At the end of the day a proxy pool network which is cheap makes the difference since most sites like google etc. Got anomaly detections and ban ips and rate limit (not exactly ban but require human verification and implementing 2captcha or any captcha bypass is too expensive for thousands of requests so proxies preferred)

capomodding
Автор

Hi John, love your videos.
I have some doubts hope you can help me. I just saw the other video where you simulate doing the ajax request and it returns a html file. That process gets you the html page, then you have to parse the html and finally scrap the desire info, did I get it right? Or you can get the json that is building the html response from ajax?
Thanks in advance!

techlabingenium
Автор

Thank you for the great content like always. Did I miss it, or is the the storing of a cookie to subsequently use in a request not shown? I find the challenge with some sites is that they don’t allow you to request unless you pass a valid cookie.

twiincentral
Автор

hi, john, where should I start watching your videos...I've learned a simple beginner parsing HTML before, from your channel there seems a lot of different way in scraping websites. I wonder if some of those ways are outdated. is there any specific playlist I should watch first...thx

wisjnujudho
Автор

Super interesting, my question is: how do you enable the CSS selector checker in minute 5:50???

LukenVidal