filmov
tv
Web Scraping Tutorial | Complete Scrapy Project : Scraping Real Estate with Python

Показать описание
A complete and detailed project based Scrapy tutorial - web scraping 3000 real estate properties and saving the details to a CSV in a structured format.
It features some non-standard logic to extract geo data from the detail page, then uses yield, and returns to the main listing to extract the price, title, href, date and 'hood'.
The 'next page' navigation was fairly 'normal' but I had to use yield in tandem with a method 'parse_detail' to extract out the "lon" and "lat" coordinates.
Being able to understand 'yield' and some familiarity with using Object Oriented Programming are the key to being able to use, modify and troubleshoot this project.
Note : This is a complete tutorial, but it is aimed at anyone who already has had some experience with Scrapy as there are some modifications to the standard framework.
I also show a detailed run through of how to identify the parts to use for the Scrapy selectors using "inspect element" in the browser then write the XPATH selectors and test them in Scrapy shell.
Chapter timings in the video:
0:00 Intro
1:42 Details of redandgreen website with documentation of project
4:14 Finding the source html we need
11:54 Scrapy shell
15:36 Testing XPATH selectors
19:56 Get Vs Extract
25:10 Getting the 'date' selector using 'inspect element'
28:36 Create the virtualenv [optional]
30:13 'scrapy startproject craigslistdemo'
34:33 Writing python code in Atom
35:21 Import the packages
41:24 Create the spider class
46:11 Checking the geo data (latitude & longitude)
1:00:43 Iterating through the 'ads' (thumbnail/properties on main listing)
1:07:18 'parse_detail'
1:39:14 Testing
1:40:03 Success
1:44:23 CSV check
► date
► title
► price
► hood
► link
► misc
► lon
► lat
** Update - see my updated video as well, which updates the 'lon' and 'lat' part :
This tutorial covers the real-world challenges you face when web scraping, non standard listings, and a good example of using "Yield" to move between iterations of items, details, and pages (aka Main Listings Page(s) and Detail Pages).
You can download the code for this project from GitHub :
✸ pip install scrapy
✸ pip install virtualenv
It is also viewable here:
I use Atom editor :
✸ sudo snap install atom --classic
And have just installed these packages:
With 'permission denied' in Atom I used this to allow me to save the .py file :
✸ sudo chown -R username:sudo ~/Documents
Please leave comments/suggestions and if you like this video, don't forget to .......✅
Who would like to see me attempt to run this on a Raspberry Pi, and schedule the spider to run as a cron job?
⚠ Disclaimer : Any code provided in this tutorial is for educational use only, I am not responsible for what you do with it. ⚠
It features some non-standard logic to extract geo data from the detail page, then uses yield, and returns to the main listing to extract the price, title, href, date and 'hood'.
The 'next page' navigation was fairly 'normal' but I had to use yield in tandem with a method 'parse_detail' to extract out the "lon" and "lat" coordinates.
Being able to understand 'yield' and some familiarity with using Object Oriented Programming are the key to being able to use, modify and troubleshoot this project.
Note : This is a complete tutorial, but it is aimed at anyone who already has had some experience with Scrapy as there are some modifications to the standard framework.
I also show a detailed run through of how to identify the parts to use for the Scrapy selectors using "inspect element" in the browser then write the XPATH selectors and test them in Scrapy shell.
Chapter timings in the video:
0:00 Intro
1:42 Details of redandgreen website with documentation of project
4:14 Finding the source html we need
11:54 Scrapy shell
15:36 Testing XPATH selectors
19:56 Get Vs Extract
25:10 Getting the 'date' selector using 'inspect element'
28:36 Create the virtualenv [optional]
30:13 'scrapy startproject craigslistdemo'
34:33 Writing python code in Atom
35:21 Import the packages
41:24 Create the spider class
46:11 Checking the geo data (latitude & longitude)
1:00:43 Iterating through the 'ads' (thumbnail/properties on main listing)
1:07:18 'parse_detail'
1:39:14 Testing
1:40:03 Success
1:44:23 CSV check
► date
► title
► price
► hood
► link
► misc
► lon
► lat
** Update - see my updated video as well, which updates the 'lon' and 'lat' part :
This tutorial covers the real-world challenges you face when web scraping, non standard listings, and a good example of using "Yield" to move between iterations of items, details, and pages (aka Main Listings Page(s) and Detail Pages).
You can download the code for this project from GitHub :
✸ pip install scrapy
✸ pip install virtualenv
It is also viewable here:
I use Atom editor :
✸ sudo snap install atom --classic
And have just installed these packages:
With 'permission denied' in Atom I used this to allow me to save the .py file :
✸ sudo chown -R username:sudo ~/Documents
Please leave comments/suggestions and if you like this video, don't forget to .......✅
Who would like to see me attempt to run this on a Raspberry Pi, and schedule the spider to run as a cron job?
⚠ Disclaimer : Any code provided in this tutorial is for educational use only, I am not responsible for what you do with it. ⚠
Комментарии