str vs bytes in Python

preview_player
Показать описание
strings vs. bytes, what's the diff?

SUPPORT ME ⭐
---------------------------------------------------
Sign up on Patreon to get your donor role and early access to videos!

Feeling generous but don't have a Patreon? Donate via PayPal! (No sign up needed.)

Want to donate crypto? Check out the rest of my supported donations on my website!

Top patrons and donors: Jameson, Laura M, Dragos C, Vahnekie, Neel R, Matt R, Johan A, Casey G, Mark M, Mutual Information

BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------

CHAPTERS
---------------------------------------------------
0:00 Intro
0:20 str and bytes syntax
0:50 str and bytes functions
1:29 they don't mix
2:17 amazing sponsor
2:40 smiley
3:33 the meaning of bytes
4:53 encodings
6:07 dangers of not specifying encoding
7:21 warn default encoding
7:35 utf-8 mode
8:06 Outro and thanks
Рекомендации по теме
Комментарии
Автор

1:34 Fun fact: separating bytes from strings was the most important major breaking change between Python 2 and Python 3. Trying to keep strings as byte-encoded led to all kinds of unfortunate trouble in Python 2, which could not be fixed without sacrificing backward compatibility.

And they thought, while they were breaking things anyway, they might as well fix a few other things in a cleaner, non-backward-compatible way while they were at it.

lawrencedoliveiro
Автор

When I started learning Rust, this was something that actually comes up quite a bit, since you can't iterate over a string object (you don't necessarily know its encoding at compile time). It was the first time I realized that the difference between ascii, utf, and others is actually really important!

BenjaminWheeler
Автор

Windows encodings are a real nightmare.
There are the OEM/MS-DOS codepages used by the console which make almost impossible to consistently write non-English characters from a .bat script.
Then there are the "ANSI" codepages which are used by the Win32 functions accepting strings as char pointers (e.g. MessageBoxA). It is usually Windows-1252 in western countries which is a slightly incompatible variant of ISO 8859-1 (also known as "Latin1").
Then there are the "Unicode" strings/MBCS/wchar_t pointers which are actually UTF-16 (even MS documentation states wrongly that "Unicode is a 16-bit character encoding"), meaning that Emojis will probably work in some places and not in others (try calling MessageBoxW with an emoji...). Except not really because in some cases it is UCS-2 instead of UTF-16 (another slightly incompatible variant). BTW, at least until recently you needed to add the BOM character to make stuff like notepad to recognize a UTF-16 file.

Note that NONE of those encodings are UTF-8.

japedr
Автор

Wow really informative. I wrote most of a project in windows, started using it in a Linux Google cloud VM, but I realized some of my data in a csv file was invalid.

In the interest of getting a proof of concept out quick, I just quickly wrote a script in the VM that opens the file as a pandas dataframe, removes the invalid rows, and stores it as a csv file again. Except when I went to open this new file before giving it to my ML algorithm, it kept telling me the file was corrupted. I couldn't understand it, I was at a total loss, and I ended up just writing another hacky solution in which if I encountered an error loading one of the rows during the training process, I would just default to loading the first row instead.

This makes total sense that this could have been the problem. Thanks James!

timogden
Автор

When you're completely unsure what the encoding of any file that you're processing is, the chardet package is really helpful.

WalterVos
Автор

Its crazy how one day I am wondering about something and a week later you have a great video on it. Thanks for another great one!

kyleaustin
Автор

Looking forward to "from __future__ import default_encoding".

mrtnsnp
Автор

It's crazy how much you come across decoding/encoding issues in the wild. I sometimes work with large text datasets with mixed encodings, sometimes even in the same line! The worst is that if you try and decode with the wrong encoding it can raise a runtime error, so I ended up writing a short program with a bunch of try/excepts for the different possibilities (utf-8 first of course). I did the same thing when I worked in C and Tcl. Gotta be a better way...

cleverclover
Автор

As I was searching the interwebs as to what a type 'byte' was and how to convert it to a string, my YT refreshed and there was this video at the top of my subscription, 4 minutes old. This timing was apropos.

SeanCrites
Автор

Thank you! I' ve always struggled to understand the diff between str, bytes and what is encoding. And now I finally understand! Thank you 😊

che_kavo
Автор

Decoded the mystery in a few minutes, thank you! ☺

finnthirud
Автор

Yet another informative and well put video. Thanks!

nitishvirtual
Автор

5:25 Not just the most popular, but some languages, including Python, have embraced Unicode to the point that identifiers can contain any Unicode characters that are classed as “letters”. So for example while “in” is a reserved word, “ın” is not, and can be used as an identifier.

lawrencedoliveiro
Автор

Hey James, I’d love to see a video showcasing how to use the Textual package. It’s really neat, and fits your style.

AngryArmadillo
Автор

That was the best message from a sponsor I've ever seen.

tiagomacedo
Автор

New subscriber here. Just wanted to say that I love your videos. Very informative and fun to watch! Keep up the good work

eternlyytc
Автор

I absolutely love your videos. Regardless of my familiarity with a topic, every video seems to have some piece of information that I would not have discovered on my own. I never knew that files were encoded with the system encoding unless specified. It has never been an issue, but I know that one day it will be and without this knowledge, I would have really struggled to identify the issue. Future me really appreciates your hard work.

lethalantidote
Автор

interesting theme. waiting for your next video :)

denyspisotskiy
Автор

very good explanation, Cleared my head :)

Автор

Excellent video, thank you.
I've had a couple of issues where I've had to use the IO and locale libraries to "fix" encoding shenanigans, but I think if I revisited those lines I'd now have an actual understanding of what was happening, how the changes worked and, most importantly, how /to do it better/.

jasonhenson