Python and Requests-HTML - Web Scraping Dynamic Content from JavaScript applications

preview_player
Показать описание
In this video, we'll learn how to scrape content that is NOT present in initial page loads, but instead is loaded dynamically by JavaScript.

This is a common problem with scraping the modern web: the initial response contains minimal HTML and a SPA-based JavaScript app (React, Vue, Angular, etc). The data that we want to scrape on the page is therefore not present, but is rendered later via API calls from the SPA application.

We will look at how we can use requests-html to solve this issue in Python when scraping such sites. We'll also look at using this with BeautifulSoup in order to find data on the page.

This video makes use of the following sample website (a React application):

📌 𝗖𝗵𝗮𝗽𝘁𝗲𝗿𝘀:
00:00 Intro
02:15 Sending GET request using Python requests library
04:00 Finding objects with BeautifulSoup
05:15 Installing requests-html
06:38 Executing JavaScript on page using requests-html

☕️ 𝗕𝘂𝘆 𝗺𝗲 𝗮 𝗰𝗼𝗳𝗳𝗲𝗲:
To support the channel and encourage new videos, please consider buying me a coffee here:

𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:

📚 𝗙𝘂𝗿𝘁𝗵𝗲𝗿 𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗮𝗻𝗱 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻:

#python #webscraping #datascience
Рекомендации по теме
Комментарии
Автор

Hi,

When I execute the response.html.render() function, I can see the terminal have downloaded the Chromium but it also throws the error down below:

File "C:\Users\sfida\OneDrive\Masaüstü\python\pythonn\venv\Lib\site-packages\pyppeteer\launcher.py", line 227, in get_ws_endpoint raise BrowserError('Browser closed unexpectedly:\n')
Browser closed unexpectedly:

Can you please help me out? There are no valid solutions on the web and I am using Windows at the moment.

Thanks,

serdarfidan
Автор

Thank you! Your tutorial really helped me in a web scrapping project.

GuilhermeBreda
Автор

Thank you SO MUCH for your WONDERFUL explanations. You are really GREAT in communicating the ideas in a very clear and simple way.

ahassan
Автор

Thank you! This worked in colab:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = 'URL'

from requests_html import AsyncHTMLSession

asession = AsyncHTMLSession()
response = await asession.get(url)
await response.html.arender()
resp=response.html.raw_html

print(response.status_code)
# print(response.html.html)
#

soup = BeautifulSoup(response.html.html, 'html.parser')
books = soup.find_all('article', class_= 'book')

for book in books:
print(book.find('h2').text)

willysnowman
Автор

I would love to watch a series of this! Talking about SPA's, how can we achieve SPA in Django? A series with Django and React/any SPAs would be great. 😃

valentino
Автор

I do not know how to thank you, if there were a million likes, I would have done for you ... Thank you, thank you

dobcs
Автор

Wow, your explanations are absolutely fantastic! I really appreciate how you make complex things so easy to understand. Your content is always top-notch and has helped me out a lot. This video actually solved a problem I was stuck on. Thank you so much for all your hard work!

slfgrlj
Автор

Great and brief material.
I've stucked after package instalation (succesful), but while executing a code get error (Python 3.10 I'm using VSC)

from requests_html import HTMLSession
NameError: name '_string' is not defined

I suppose I'll try selenium instead.

MrRedStream
Автор

Not sure if you still respond on this old video, But i have a question:

What if i have this "root", but it's a "tooltip-root"; that doesn't get filled with html unless I hover over the component. Noting that: it works without internet (if page is just loaded then i disconnect wifi, i can hover and see all contents).

Would this library help me? (I went selenium route, but it's too cumbersome and slow with issues over time).

Anu_was_here
Автор

Thank you for the good content, is there a possibility to have a serie about how to use tasks on django (schdule tasks to excute on background) like doing a check every midnight...

rjhvzfw
Автор

Great video, Selenium would certainly be interesting but for my use cases I reckon this would most likely already be enough.
BTW. how do deal with packages that haven't seen updates in a few years?
The latest commit from requests-html is three years old at this point.

andreaszweili
Автор

How can i Handel button like add to cart?

sasanandeh
Автор

having some many porblems with libraries, i think many of them at this date, all changed in some way...

brianaragon
Автор

Did you make a video on how to use selenium?

nnaemekacephas
Автор

Please add website that use login page and javacript and csrf token

SugengWahyudi