Python Programming Tutorial - 26 - How to Build a Web Crawler (2/3)

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

4 am thenewboston you are awesome... thanks for the tutorials

Elduque
Автор

Wow, I'm impressed, I fell asleep during watching 20th video (it wasn't boring, I just didn't sleep much lately) and now you're aleady teaching some (almost) exciting stuff :D

motylanoga
Автор

for first time felt programming is fun thank you for making it possible to get interested in python and all what it does

missghani
Автор

I tried this using craigslist and got it working! Love the tutorials Bucky, keep them coming!

jasongodson
Автор

omfg I can't believe I actually made one by myself and it also takes the pictures and sub-titles thanks man

genosingh
Автор

"The good meat of the website" - Bucky Roberts 2014

zachariahwalston-leo
Автор

@thenewboston, I've tried to follow this tutorial but it's hard to make it work due to most websites this days running scripts on browser. I don't know if it was the same back when you made this video. I had to use Selenium to access the actual html you see in the "inspect element". Selenium is a web driver that works with Chrome, Firefox and others. It works a bit different than "requests" but I think it's more powerful. You should do a Python-Selenium tutorial?

ernestselman
Автор

Since his site is down, reddit works pretty well. I think they have a timeout if you run your crawler too often, but if you wait a bit it should work again. (it doesn't look like you can easily change the pages though, so you'll have to omit that part of the code unless you're smarter than me haha)

AnderMalkus
Автор

I am getting the html parser error so i added "html.parser"after plain_text, but then I am still getting the following error

Traceback (most recent call last):
File "file path", line 22, in <module>
trade_spider(1)
File "file path", line 13, in trade_spider
soup = BeautifulSoup(plain_text, "html parser")

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html parser. Do you need to install a parser library?

Phoebusjosh
Автор

This tutorial was great and easy to follow. It worked perfectly. Thanks Bucky!!

jeremykerrigan
Автор

its great... i finally can crawl any website and get required data from it! thnkz bucky!!

facitoo
Автор

Complete solution with url:
import requests
from bs4 import BeautifulSoup

def trade_spider(max_page):
page=1
while page <= max_page :
source_code= requests.get(url)
plain_text= source_code.text
soup= BeautifulSoup(plain_text, "html.parser")

for link in soup.findAll('a', {'class':'title text-semibold'}):
href= link.get('href')
print(href)
page +=1

trade_spider(1)


Note: I used bucky's github page in the description and it worked

ankitaroy
Автор

I like how you say string almost like "shtring" :D

Zwerggoldhamster
Автор

now thats getting excited !!!! all other stuffs were just normal not fun boring but you need to know the basics to jump for bigger my head is kinda fked up after getting all that things soo i need to watch the video more couple times and gj buddy keep it up you are doing great!

SixtyNeptune
Автор

i am trying to get it to work with craigslist and it just starts printing out the word none a bunch until i hit stops then about 5 errors pop up

rydermcbride
Автор

Bucky, you're the man!
Thanks for the awesome tutorial!

malabikasen
Автор

In case you get stuck, try with following modifications :
(It is working for me)

import os
import requests
from bs4 import BeautifulSoup

# uncomment the line below and set the user_id n pass if working on college proxy

def trade_spider(max_pages):
page = 2
while page <= max_pages:
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

for link in soup.findAll('h2', {'class': 'entry-title'}):
href = link.a['href']
print(href)

page += 1

trade_spider(2)

SunilKumar-iffd
Автор

If, around the 9:05 mark, you need to get the title out of a child element (direct child or not), BeautifulSoup offers the use of elements as functions as such:

Instead of:
title = link.string

Use:
title = link.h3.find(string=True)

Of course given that the element that houses the title is an <h3> child element of the element you hook the for loop onto. This is useful if the element with the actual link to the entry differs from the element with the title of the entry.

rayromanov
Автор

thank you very much, i designed a web crawler for my college website

amoghkulkarni
Автор

Thankyou for such a beautiful tutorial

devarshsanghvi