Website to Dataset in an instant

preview_player
Показать описание
1000 items in one API request... creating a dataset from a simple API call. I enjoyed this one, there will be a part 2 where I clean the data with Pandas.

This is a scrapy project using the sitemap spider, saving the data to an sqlite database using a pipeline.

If you are new, welcome! I am John, a self taught Python developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content.

:: Links ::

:: Disclaimer ::
Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items.
Рекомендации по теме
Комментарии
Автор

Super neat!! Also as a Swede I chuckled at "this is a pretty standard e-commerce site" when talking about Sweden's most valuable brand haha

stevenlomon
Автор

you are a bloody animal mate, love your work a ton!

theonlynicco
Автор

I never comment on youtube videos but this has been so helpful. Thank you. Subscriber++

shubhammore
Автор

I bet you can't make a video on how to avoid cloudflare websites, not simple test cloudflare website but proper ones where cloudflare detection works properly

LuicMarin
Автор

I'm currently working on a project that involves scraping Amazon's data. I have tried a few methods that didn't work which led me to your video. However, when I loaded amazon and looked through the JSON files, I couldn't find any of them that included the products. Why is that? What do you recommend I should do?

RyanAI-kkkv
Автор

I use polars instead of pandas.
Everything improved with rust will have better performance ;-)

TheJFMR
Автор

Thank you very much John, great series - I am a bit stuck between this video and the cleaning with Polars video in taking the JSON terminal output and converting for use in Polars. Is there a def and function I can add to the code to output to csv (or JSON)? I considered importing csv and json libraries and creating a def and print but unsure on this step. Many thanks again

matthewschultz
Автор

Thanks! Another really useful video. What would be the best way to either remove unwanted columns or extract only the required columns then output a json file containing only the required data? This and your 'hidden API' video have been so helpful.

mattrgee
Автор

thank you so much for this! i always had the issue of trying to scrape data from sites which paging is based on "Load More"

ying
Автор

Good stuff as always. I will try use this with fotmob website. 👍😉

graczew
Автор

How long had u been using linux or archlinux distro would you recommend it?

milesmofokeng
Автор

Kind of magic thank you very much 😭😭😭
Is this can be used on scraping multiple pages ??

mohamedtekouk
Автор

Thanks for the video, as always. In my attempt, the website's response didn't include a 'metadata' key. Instead, the page restriction was specified under the 'parameter' key, as shown below. Despite setting the 'pageSize' to 1000, I only received a maximum of 100 items, which suggests a system preset limit by the admin. I'm uncertain about how to bypass this apparent restriction of 100 items.

params = {
...
...
'lang': 'en-CA',
'page': '1',
'pageSize': '1000',
'path': '',
'query': 'laptop',
...
...
}

schoimosaic
Автор

I discovered this method three years ago🙂

viratchoudhary