Craigslist Scraper with Python and Selenium: Part 2

preview_player
Показать описание
In this video series, we will be writing a script in Python using web scraping modules such as Selenium, Beautiful Soup, and urllib to extract information from the website Craigslist. Specifically, this script will be responsible for forming a query to search, i.e. a set of criteria such as items to look for in a given location, zip code, etc. Once we form this query, we use our script to automatically perform a search and extract two key pieces of information from this search. Namely, we will extract the titles of each of the postings along with the links for each post.

This project is purposefully simple as it will optimistically serve as a springboard for you to build upon. For instance, perhaps you want to keep tabs on when a certain item is listed in your area. Perhaps you could modify the script to automatically email you if any items of interest pop up in your area. The possibilities are quite vast, and I hope you use this to build something useful and cool. If you do, please share it!

Related Links:

This video is part of a larger series on "Web Scraping and Automation". You can watch the other videos in this series here:

Further videos on Selenium:

Do you like the development environment I'm using in this video? It's a customized version of vim that's enhanced for Python development. If you want to see how I set up my vim, I have a series on this here:

If you've found this video helpful and want to stay up-to-date with the latest videos posted on this channel, please subscribe:
Рекомендации по теме
Комментарии
Автор

Loved it man... Very clean instructions.

rayhansardar
Автор

good shit. subscribed. you have a great style of teaching. havent seen the rest of the series yet but im sure you know that its not necessary to launch the browser when using urllib.request. im guessing you used those two different functions to showcase two different technologies. launching the browser slows everything way down. if thats addressed later on, disregard :)

DocPosture
Автор

Will this work in IDLE? Part 1 worked okay but now I'm getting a bunch of errors (when I run at 7:58). Thanks

Zooooman
Автор

You can improve the technique on Beautifulsoup by passing the HTML content of the driver

JadaKingdom
Автор

I was able to springboard this to scrape all of the relevant data multiple pages of a craigslist search result. def extract_post_titles is able to get all of the data from one page, then the next, and so on, but def extract_post_urls is stuck on getting the html links of listings only on the first page. I put def extract_post_titles in def load_craigslist_url so it will get the stuff when we go to next page, but def extract_post_urls is stuck on the first page since self.url is static and doesn't dynamically change every time we go to a new page. Any recommendations to modify def extract_post_urls to get html links of listings on every page or having self.url change every time we go to next page?

uqyuiyryiq
Автор

Can you explain why you choose 'searchform ID' as the wait.until parameter? what's the benefit from setting that?

liuxu
Автор

Can’t you do the same thing with Selenium without having to load the page twice and parse it with bs?

snoopyjc
Автор

Appears that the `wait` or `delay` is unnecessary. Making the `delay = 0` will not throw any exceptions.
As in, regardless of code, the page will take it's time to load completely. Why is this?

simonj
Автор

It's giving me an error when I try to extract_post_urls. It's raising an HTTPError: Bad Request.

Any help?

iNotSoTall
join shbcf.ru