Python 3 Programming Tutorial - Parsing Websites with re and urllib

preview_player
Показать описание
In this video, we use two of Python 3's standard library modules, re and urllib, to parse paragraph data from a website. As we saw, initially, when you use Python 3 and urllib to parse a website, you get all of the HTML data, like using "view source" on a web page. This HTML data is great if you are viewing via a browser, but is incredibly messy if you are viewing the raw source. For this reason, we need to build something that can sift through the mess and just pull the article data that we are interested in.

Bitcoin donations: 1GV7srgR4NJx4vrk7avCmmVQQrqmv87ty6
Рекомендации по теме
Комментарии
Автор

Thanks for your little message! :
"
Programming is a superpower.
Programming allows you to achieve and accomplish things that no ordinary human-being ever could, at amazing speed.
Programming enables you to increase your production and performance nearly infinitely and it can grow exponentially.
Programming allows you to extend your logic and your will out of your body and into the world in a way that only programming can achieve through the use of machines. It can become an extension of your self.
Programs work while you sleep, they work while you take a vacation. They continue working while you work on something else.
Programming has allowed me to start up and scale multiple businesses, almost completely without any direct help.
Programming has allowed me to work for myself. I choose when I work, what I work on, and what I learn about.
Programming has given me freedom. Others gave it to me.
I want to give it to you.
"

MrNicfeller
Автор

Amazing. I was doing this in such a complicated way with list comprehensions before. thanks so much :)

exoice
Автор

been watching entire series. no clue whats going on lol. hope i can make my own tutorials one day

jakeambrose
Автор

Amazing. Love you. You make parsing so easy to understand.

TXfoxie
Автор

Thank you for taking the time to make these videos... You are a great teacher

whistler
Автор

I think in this context ? Enforce lazy matching, not 0, 1 reps. Ex. If the string have more than 1 p tags. The () part will not just include everything from the first open p tags to the last closing one. i.e no p tags will be matched in the middle of the re.

Grassmpl
Автор

awesome video, many thanks. Could you possibly do a series on building APIs ?

matthewmatthee
Автор

Hi,

Great vedio. Wonderful explanation.

I have a small doubt.

I have to copy the website url which is currently opened in a browser using a python code instead of manually copy pasting the URL.

And assign it to the URL variable.

And use the code which is given by you in this vedio.

Please help me with the code to copy the URL using the python code.

Regards,
Jaideep.

jaideepbommidi
Автор

How would you take attributes into account? For example, a <p> tag could contain a style attribute, e.g. <p style="xxx:yyy;">. The point is you can have a pretty broad set of potential tag/attribute permutations.

EricZimerman
Автор

@sentdex Hi there!! Thanx for your great tutorial! I'm a newbie on python and programming in general and I have a problem right now that's kinda like what you show here. I've extracted a table from a website (using the api) and the results come in text (csv). I get around 20 different statistics (it's sports-related) and I only need 3 of them. So I would like to eliminate all the data that I don't need and just get those 3. Would you recommend the same Library modules (re and urllib) or another module to do that? As I said, it looks to be the same kinda thing you're showing here, the difference being that I need to basically remove stats instead of text when I scrape it and just get the one I need. Thanx again for your great tutorials!!

frenchyfred
Автор

recommendable contribution, appreciate your effort to teach others

jagmohanyadav
Автор

Help me plz when i run the program it gives me that error AttributeError: module 'urllib' has no attribute 'encode'

problem
Автор

Could you explain how to parse HTML data which has two columns and have to go via login authtentication system

Myview_Aravind
Автор

These are awesome tutorials! Thanks so much for sharing them. It would be great to have one on parsing data out of tables. Also, how do we get data from a table that is generated dynamically and therefore does not have HTML code produced? Thanks!

vtcruzr
Автор

Thanks for your video.
I have one question.. instead of specifying the sample URL in the code, would it be possible to make it via input?
what I mean is, I work with web based tools that contain same data fields with different values of course.. like support tickets lets say.
I want script where I can paste my ticket URL and then to be parsed for specific fields like ticket number, customer name, etc and populate the excel table with the parsed data
I have lot of tickets to deal with sometimes and opening all the URLs in separate tabs is just not an option so I'm trying to consolidate everything in excel file (for now) to quickly see which ticket is in what state, when they are scheduled, etc.

alexlasareishvili
Автор

Instead of importing urllib.request and urllib.parse individually, is it possible to just import urllib as a whole library?
In the same respect, since in the last vid you said you mostly only use re.findall(), can we just import re.findall instead of the whole re library module?

jamesjemima
Автор

Thank you so much for your help! I have a question for you if you don't mind. Using your code I was finally able to get python to pull the info I needed from a html page. Now that I have it displaying a list using the 'Findall' function, I want to be able to use it to make decisions about what else to copy.

I currently have my 'FindAll' function set to a HTML tag that will always return a number from 0 to 200. My goal is to make it so any number it returns above 8 will also return the items name from a different, yet corresponding HTML tag.

I know how to set up the if/and/or clauses, but where I am getting stuck is how to tell python to choose the right HTML tag that corresponds with the correct number (instead of just giving me the first on the list). This is because each item has it's own version of the same HTML tag.

What is the best way to correlate HTML tags together? In example: each item has both <div class="name"> and <div class="quantity">. How can I make it so any of the "quantity" tags returned above 8 will also give me the "<div class="name">" of that specific item? I only want the names of the tags with a value above 8, the rest can be skipped.

Preferrably, I'd like it to list out all items above 8 with it's corresponding name.

I hope that makes sense. Thank you! Sorry for bugging you, If you cover this in any vids please let me know and I will gladly watch. Thanks for the guides, you are a great help man!

HelloImNoob
Автор

Thank you for the Python video. Can you please add a video for SOAP/REST API calls and xml parsing?

vishvips
Автор

Hi,
Thanks for your explanation
I have a question:

When I wanted to print the respData like you did, I didn't get the results even though there
was no error!

allaalzoya
Автор

with this website's api I get returned and I need the sessionid to post new request. what the best way to get and use that session id in new request?

playnationstation