Python Web Crawler Tutorial - 15 - The First Spider

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

@thenewboston - I've been following you for years and I have to say thank you. You are quite possibly one of the best coding resources I've found on all of youtube. And you are one of the best resources for tutorials that I've found on the entire internet. You are great at explaining things in simple and easy to understand ways. Thank you.

kykel
Автор

I had a bug in the The condition in Its IF-statement never returned TRUE because returns *text/html; charset=utf-8*. 

It was fixed by replacing the condition with *if 'text/html' in

Hopefully that helps anyone

SrWakka
Автор

If you have everything working but don't have anything inside your queue.txt file,
edit your python.py file the if statement in [def gather_link]
inside this line of code: if == 'text/html'
change 'text/html' to 'text/html; charset=utf-8' or 'text/html' to 'text/html; charset=ISO-8859-1'

it worked for me and now i have results in my queue.txt

ericahellscythe
Автор

I just followed the codes in the video, but there is bug when I want to run the main file, the error message shows like:
class LinkFinder(HTMLParser):
TypeError: Error when calling the metaclass bases
module.__init__() takes at most 2 arguments (3 given)

----I think there is some mistake in the LinkFinder?
Anyone has the same problem?

loye
Автор

If you are getting an error try this:
in link_finder.py
change super.__init__() to super().__init__()

SuperCombatarms
Автор

Why is there a bug in the spider.py file?


^
SyntaxError: invalid syntax

eyenstein
Автор

@thenewboston

When i run the main.py the first spider prints that it will begin crawling, it places the home page in queue and then takes a few seconds and displays this error:

<urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>

When it finishes i am left with my project directory, 2 txt files, but queue.txt is empty and crawled contains the home page.

Basicaly the spider wont connect to your site. Is this your site blocking me? or my internent?

i was on a work network with fire wall, but ran it off my phones hot spot, and same issue.

Please help bucky!!

Thanks!

blakeashby
Автор

Ran the program and got the following error:

Traceback (most recent call last):
File Portfolio/Python/Web Crawler/main.py", line 14, in <module>
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
File Portfolio\Python\Web Crawler\spider.py", line 23, in __init__
self.boot()
TypeError: boot() missing 1 required positional argument: 'self'

Any suggestions?

Nitrofreez
Автор

if(page_url not in spider.crawled):
TypeError: argument of type 'NoneType' is not iterable, someone please help me with this.

VivekKumar-bxle
Автор

I keep getting the error:
File "/Users/seantilson/PycharmProjects/Web-Crawler-Project/main.py", line 15, in <module>
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
File "/Users/seantilson/PycharmProjects/Web-Crawler-Project/spider.py", line 23, in __init__
self.boot()
TypeError: boot() takes 0 positional arguments but 1 was given

but when I defined boot it took no arguments. also, in the __iniy__ I have self.boot() alone on a line.

Thanks for your time.

seantilson
Автор

Whenever I launch the program, I get no output at all. I even checked the queue and crawled text files to see if anything happened, but nothing changed at all.

mrmafia
Автор

hello, this series is very usefull
but now i have a problem
i use 3.5.2 interpreter but when i want to install spider i get:
"Could not find a version that satisfies the requirement spider (from versions: )
No matching distribution found for spider"

danielkovacs
Автор

I m stuck at 7:43 when try to run it:
File "/home/tom/a/crawl/general.py", line 39, in file_to_set
with open(file_name, 'rt') as f:
IsADirectoryError: [Errno 21] Is a directory: 'thenewboston'

ducdknet
Автор

When I run the program, it report 'type object 'Spider' has no attribute 'update_files', anyone has any idea abt it?

brucelee
Автор

Am getting this error .. any help?(I have made sure that init has two underscores.)
line 18, in <module>
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
TypeError: this constructor takes no arguments

onePunch
Автор

Can you show us a tutorial of how to develop refferal webs in php

Stevenbensonofficial
Автор

for some reason, it gives me

Process finished with exit code 0

Thechomania
Автор

I got an error in:
main - line 14, in <module>
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
and
spider - line 22, in __init__
self.boot() TypeError: boot() missing 1 required positional argument: 'self'

I'm new to this so i don't know how to proceed, the rest of the code so far has worked fine, this is the first error I have ran into. Can anyone help advise me please

gmac
Автор

I get an error in the spider.py file within the crawl_page function. in the "if page_url not in Spider.crawled:" i get an error that says: "TypeError: argument of type 'NoneType' is not iterable"... I have been stuck here for hours. Please help

kristoftorres
Автор

how does the threading queue have the queue content?

shirajkesaribaidya