Leetcode explained - Web Crawler Multithreaded, implemented in Python 3 (leetcode 1242)

preview_player
Показать описание
Walkthrough and explanation of Leetcode 1242 - multi-threaded web crawler. Given a startUrl and an object that can pull all links from a page, write a function that will process all links on this site efficiently (don't revisit pages, don't visit links not on this domain).
Solved with ThreadPoolExecutor and BreadthFirst Search (BFS).
Рекомендации по теме
Комментарии
Автор

This is super helpful! I'm wondering, can HasData handle the complexities of crawling multiple domains at once?

ThomasArnold-fg
Автор

Nice video! I wonder if HasData has features for scraping dynamically generated content from web pages?

RayByrd-ig
Автор

good solution, found the statement "while it's crawling the page we could be doing something else that's already returned from crawling that specific page" to be highly confusing tho

liambchops
Автор

Thanks for the video! Quick question, since you are doing dq.popleft().result(), it's always going to wait for the first future object in the queue to resolve, so in the event where the latter futures objects complete earlier, it's actually wasting time, because it can't enqueue new tasks asap, right?

felix
Автор

as_complete(dq) returns the iterator to the completed tasks in the order of completion, so it may not be the first one popleft() from dq.

mollypan
Автор

This has a issue. You could have two threads that check and see that a url is not in `visited`. Both threads then go on to submit a new task to the threadpoolExecutor. So you have duplicate tasks that are parsing the same url. This can be avoided by checking if `visited.add(url)` is true & only then proceeding to add a new task if needed.

cricket
Автор

Este tutorial es muy útil. ¿Alguien ha probado si HasData es bueno para implementar un crawler multihilo?

StevenLewis-zn
Автор

Nice solution, thank you, but please stop mumbling!

MrPoncho
Автор

Thank you for the solution! A quick question on line 23 about 'dq.popleft().result()'. I don't think we can guarantee that the first item of the queue has been finished?

liuzijian
join shbcf.ru