Python and Requests-HTML - Web Scraping Dynamic Content from JavaScript applications

Показать описание

In this video, we'll learn how to scrape content that is NOT present in initial page loads, but instead is loaded dynamically by JavaScript.

This is a common problem with scraping the modern web: the initial response contains minimal HTML and a SPA-based JavaScript app (React, Vue, Angular, etc). The data that we want to scrape on the page is therefore not present, but is rendered later via API calls from the SPA application.

We will look at how we can use requests-html to solve this issue in Python when scraping such sites. We'll also look at using this with BeautifulSoup in order to find data on the page.

This video makes use of the following sample website (a React application):

📌 𝗖𝗵𝗮𝗽𝘁𝗲𝗿𝘀:
00:00 Intro
02:15 Sending GET request using Python requests library
04:00 Finding objects with BeautifulSoup
05:15 Installing requests-html
06:38 Executing JavaScript on page using requests-html

☕️ 𝗕𝘂𝘆 𝗺𝗲 𝗮 𝗰𝗼𝗳𝗳𝗲𝗲:
To support the channel and encourage new videos, please consider buying me a coffee here:

𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:

📚 𝗙𝘂𝗿𝘁𝗵𝗲𝗿 𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗮𝗻𝗱 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻:

#python #webscraping #datascience

BugBytes

Рекомендации по теме

Комментарии

Hi,

When I execute the response.html.render() function, I can see the terminal have downloaded the Chromium but it also throws the error down below:

File "C:\Users\sfida\OneDrive\Masaüstü\python\pythonn\venv\Lib\site-packages\pyppeteer\launcher.py", line 227, in get_ws_endpoint raise BrowserError('Browser closed unexpectedly:\n')
Browser closed unexpectedly:

Can you please help me out? There are no valid solutions on the web and I am using Windows at the moment.

Thanks,

serdarfidan

Thank you! Your tutorial really helped me in a web scrapping project.

GuilhermeBreda

Thank you SO MUCH for your WONDERFUL explanations. You are really GREAT in communicating the ideas in a very clear and simple way.

ahassan

Thank you! This worked in colab:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = 'URL'

from requests_html import AsyncHTMLSession

asession = AsyncHTMLSession()
response = await asession.get(url)
await response.html.arender()
resp=response.html.raw_html

print(response.status_code)
# print(response.html.html)
#

soup = BeautifulSoup(response.html.html, 'html.parser')
books = soup.find_all('article', class_= 'book')

for book in books:
print(book.find('h2').text)

willysnowman

I would love to watch a series of this! Talking about SPA's, how can we achieve SPA in Django? A series with Django and React/any SPAs would be great. 😃

valentino

I do not know how to thank you, if there were a million likes, I would have done for you ... Thank you, thank you

dobcs

Wow, your explanations are absolutely fantastic! I really appreciate how you make complex things so easy to understand. Your content is always top-notch and has helped me out a lot. This video actually solved a problem I was stuck on. Thank you so much for all your hard work!

slfgrlj

Great and brief material.
I've stucked after package instalation (succesful), but while executing a code get error (Python 3.10 I'm using VSC)

from requests_html import HTMLSession
NameError: name '_string' is not defined

I suppose I'll try selenium instead.

MrRedStream

Not sure if you still respond on this old video, But i have a question:

What if i have this "root", but it's a "tooltip-root"; that doesn't get filled with html unless I hover over the component. Noting that: it works without internet (if page is just loaded then i disconnect wifi, i can hover and see all contents).

Would this library help me? (I went selenium route, but it's too cumbersome and slow with issues over time).

Anu_was_here

Thank you for the good content, is there a possibility to have a serie about how to use tasks on django (schdule tasks to excute on background) like doing a check every midnight...

rjhvzfw

Great video, Selenium would certainly be interesting but for my use cases I reckon this would most likely already be enough.
BTW. how do deal with packages that haven't seen updates in a few years?
The latest commit from requests-html is three years old at this point.

andreaszweili

How can i Handel button like add to cart?

sasanandeh

having some many porblems with libraries, i think many of them at this date, all changed in some way...

brianaragon

Did you make a video on how to use selenium?

nnaemekacephas

Please add website that use login page and javacript and csrf token

SugengWahyudi

Python and Requests-HTML - Web Scraping Dynamic Content from JavaScript applications

Python Tutorial: Web Scraping with Requests-HTML

Python and Requests-HTML - Web Scraping Dynamic Content from JavaScript applications

requests HTML - Python requests on sterioids

Create A Web Scraper Class in Python and requests-html

A Quick Guide to Web Scrapping with Python - using requests-html

Requests-HTML: A Python Library For Scraping The Web

I Don't Waste Time Parsing HTML (So I do THIS)

Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More

Creating a Transaction website with Python and Django

How I Scrape JAVASCRIPT websites with Python

Python Tutorial: Web Scraping with BeautifulSoup and Requests

Web Scraping in Python - Requests HTML

Easy Web Scraping With Python Requests-HTML: Extract and Parse Data

Fill out an HTML Form with Python Requests

Inspecting Web Pages with HTML | Web Scraping in Python

Using Python and Requests to Scrape Static Websites

Python WEB SCRAPING in 30 Seconds! 🔥👨‍💻 #shorts

Python Web Scraping - Append to CSV, Cleaning Data, Requests HTML

Slow Web Scraper? Try this with ASYNC and Requests-html

BeautifulSoup + Requests | Web Scraping in Python

Python Requests: How to Send Form Data

Web scraping with Python using Requests to pull HTML code (Part 1)

How To Scrape Woocommerce products with Python & requests-html

A Short and SIMPLE HTML Web Scraper in 6 lines of CODE