Python Web Crawler Tutorial - 12 - Gathering Links

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

you are alive ! I was wondering... thx for all your work! Can't wait to see your new stuff :D

briliant
Автор

i think it's better to check that 'text/html' is in since I've seen some sites which have a header of 'text/html; charset=UTF-8'.. also i would suggest setting the user agent string as well

tappiera
Автор

My gather_links function in the spider.py doesn't become true... problem is I don't get any errors, anyone suggestions on how to resolve this issue?

Thanks :)


@staticmethod
def gather_links(page_url):
html_string = ''
try:
response = urlopen(page_url)
if 'text/html' in
html_bytes = response.read()
html_string = html_bytes.decode("utf-8")
finder = LinkFinder(Spider.base_url, page_url)
finder.feed(html_string)
print('gathered_links!')
except:
print("Error: Unable to connect for some reason...")
return set()
return finder.page_links()

jimmysoonius
Автор


^
SyntaxError: invalid syntax

Help?

ryanseideman
Автор

When I call finder.page_links() the .page_links doesn't automatically appear - does that matter?

josephdevlin
Автор

Hi. Since we are making a gather_links() function then what is the need of separate LinkFinder class??? Can we merge that code in this same gather_links() function?? Also, I don't get the use of finder.feed() function. How is it automatically getting links from the html content read?

amanmaheshwari