Web Scraping Tutorial | Complete Scrapy Project : Scraping Real Estate with Python

preview_player
Показать описание
A complete and detailed project based Scrapy tutorial - web scraping 3000 real estate properties and saving the details to a CSV in a structured format.

It features some non-standard logic to extract geo data from the detail page, then uses yield, and returns to the main listing to extract the price, title, href, date and 'hood'.

The 'next page' navigation was fairly 'normal' but I had to use yield in tandem with a method 'parse_detail' to extract out the "lon" and "lat" coordinates.

Being able to understand 'yield' and some familiarity with using Object Oriented Programming are the key to being able to use, modify and troubleshoot this project.

Note : This is a complete tutorial, but it is aimed at anyone who already has had some experience with Scrapy as there are some modifications to the standard framework.

I also show a detailed run through of how to identify the parts to use for the Scrapy selectors using "inspect element" in the browser then write the XPATH selectors and test them in Scrapy shell.

Chapter timings in the video:
0:00 Intro
1:42 Details of redandgreen website with documentation of project
4:14 Finding the source html we need
11:54 Scrapy shell
15:36 Testing XPATH selectors
19:56 Get Vs Extract
25:10 Getting the 'date' selector using 'inspect element'
28:36 Create the virtualenv [optional]
30:13 'scrapy startproject craigslistdemo'
34:33 Writing python code in Atom
35:21 Import the packages
41:24 Create the spider class
46:11 Checking the geo data (latitude & longitude)
1:00:43 Iterating through the 'ads' (thumbnail/properties on main listing)
1:07:18 'parse_detail'
1:39:14 Testing
1:40:03 Success
1:44:23 CSV check

► date
► title
► price
► hood
► link
► misc
► lon
► lat

** Update - see my updated video as well, which updates the 'lon' and 'lat' part :

This tutorial covers the real-world challenges you face when web scraping, non standard listings, and a good example of using "Yield" to move between iterations of items, details, and pages (aka Main Listings Page(s) and Detail Pages).

You can download the code for this project from GitHub :

✸ pip install scrapy
✸ pip install virtualenv

It is also viewable here:

I use Atom editor :
✸ sudo snap install atom --classic

And have just installed these packages:

With 'permission denied' in Atom I used this to allow me to save the .py file :
✸ sudo chown -R username:sudo ~/Documents

Please leave comments/suggestions and if you like this video, don't forget to .......✅

Who would like to see me attempt to run this on a Raspberry Pi, and schedule the spider to run as a cron job?

⚠ Disclaimer : Any code provided in this tutorial is for educational use only, I am not responsible for what you do with it. ⚠
Рекомендации по теме
Комментарии
Автор

Great tutorial. Solved one problem for me and introduced me to another.

I started getting the status 403 from CL before I read about autothrottle. Even now I don't think I have autothrottle configured properly yet as I keep getting blocked. But that's a new problem to solve.

Finally got an appreciation for xpath after struggling with css selectors in beautifulsoup. Thanks again for taking the time to do this.

thedavegtoo
Автор

this guy really threw efforts in editing... hats off

pythusiast
Автор

Man, your new video structuring technique is absolutely fantastic, I felt like I'm watching Traversy Media tutorial on steroids - your video editing work is just awesome and obviously I deeply respect the fact you're going your own way while using scrapy.
Keep it up!

monkey_see_monkey_do
Автор

I learnt a lot from this tut. This guy is awesome and hard working.

pythusiast
Автор

Quality stuff!

Going to try to figure out the new FEEDS setting.

Please keep the videos coming.

SirAdaox
Автор

thankyou, nice video, I always use CSS but am definitely going to try Xpath now

stupidsoft
Автор

Hi, very nice video!
Is it possible to webscrape some data and use that information to make a nice dashboard, tailored only to some data we are interested in?
Topic: real estate auctions

axel
Автор

Hi! Thank you for the very well explained video. I really appreciate. I have a question. Is this craiglist website static or dynamic website?

ellazova
Автор

Hi! Thanks for this great and complete walk through of your process! Greatly appreciated.
It was hilarious to see you looking for the bugs :)
Question:
How would you handle to avoid getting blocked/banned? What are the steps? Could you incorporate this is a complete video like this as well? Like for example scraping prices on amazon.

RonZuidema
Автор

A question sir... How can we remove &nbsp using xpath..
A sample here is attached
<span class="a-size-base a-color-price price a-text-bold">
₹&nbsp;999.00
</span>
I want to extract value 999.00

MRINAL
Автор

Dr. Pi, Lon and Lat are not exported correctly to csv. Values of records are wrong. Please comment on that

pythusiast
Автор

Excellent video! Very useful. I learned a lot new stuff.
Question:
I don't know if its me but it looks like some latitude and longitude data repeat more than once in the hole dataset? It is that ok?

waltercerritos
Автор

Very well dude ✨ Is there a way to scrap iframe?

noorrida