Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

preview_player
Показать описание
Ever been bit by a Unicode bug? Maybe you weren't treating UTF-8 encoded data correctly, or tried to read it as ASCII? Maybe you mixed up UTF-8 vs UTF-16? Unicode and character encoding might seem like a tricky topic, but let's break them down and learn about them piece by piece, from ASCII to code points to graphemes to combining character modifiers and more.

00:00 Intro
00:12 All Data Is Stored As Bits
00:41 How To Store Characters Using ASCII
01:38 What About All The Other Languages?
02:14 Graphemes Map To One Or More Code Points
03:18 Code Points Map To One Or More Bytes
03:44 UTF-32 Is Wasteful
04:43 UTF-8 Is Better
05:40 Unicode is Western-Language-Centric
06:41 Mid-Video Recap
07:52 Live Python Demo
09:47 Unicode Rules Of Thumb
10:14 Looking At A Code Point
Рекомендации по теме
Комментарии
Автор

this alone is the best unicode video explanation in the entire youtube, 100x better than the, maybe, second place from Computerphile.

edwincloudusa
Автор

Another fun fact about the way letters are laid out in ASCII: A capital letter's corresponding lowercase counterpart is exactly 32 values ahead. This is because 32 is a power of 2, and makes it so that you only need to flip the 6th bit (from the right) of a byte to change the case of a letter.

EmilMacko
Автор

FINALLY someone is able to explain this clearly. Most other videos complicate this so much. Thank you!

winstonmisha
Автор

7:50 Worth pointing out that Python 2 is way past its end of life now and in Python 3, all strings are Unicode aware and the u modifier does nothing (it's only there for backwards compatibility). Using Python 2 is a good way to showcase the difference between Unicode aware and unaware functions but seems like this could confuse some beginners trying to replicate what you're doing who will likely be using Python 3 and might not be aware of the difference.

vader
Автор

This is the type of content that should be suggested by YouTube to everyone. Great explanation, thankyou

sharbelokzan
Автор

One small correction: grapheme is a part of a particular writing system. Writing systems are always language-related. Unicode does not reflect any particular writing system. Unicode, and this is probably the smartest choice that could ever be made, maps numbers (code points) to character descriptions or names. This way Unicode is detached from any font and therefore from any particular shapes. This results in a code point relating to not what we want to see, but what we want, making it more abstract.

A good examples are <g> U+0067 and <ɡ> U+0261. In most writing systems based on Latin script, they are allographs, variants of <g> grapheme, so if Unicode were to contain graphemes, there should be only one character. But this is not the case. A writing system may prefer particular glyph (a shape of a letter) and that glyph could be the main variant (allograph) of the grapheme in that system. In most Latin-based writing systems <g> is the main variant, but in the International Phonetic Alphabet it is <ɡ>, because of it correlation with other similar characters like <ŋ j>.

piotrrybka
Автор

I've listened to quite a few Unicode tutorials. This one blows the others out of the water. Clear. Concise. Good tempo. Thx

darianleduc
Автор

Great explanation, thank you! For anyone who may still not understand, UTF-8 CAN get up to 32 bits and bit as big as UTF-32, but only if it has to. Otherwise, it just uses the minimum amount of space (8 bits) and expands as required, depending on the grapheme.

gti
Автор

I took a long searching to understand the Unicode (also UTF-8 and the others), I did, but some things were still ambiguous to me, this guy literally taught me the whole topic with only 10 minutes

MahmoudKhudairi
Автор

You earned a subscriber today.
I've worked in the past with non-english characters scratching my head.
This alone sums up all the concepts in details. Very good use of examples and video production 💌

PabitraPadhy
Автор

Wow being the messy programmer that I am I always got encoded and decoded mixed up... Much clearer now, thanks

aliceg.
Автор

Man this channel is golden. The most difficult topics here are explained so easily and in such a unique way

tanzimchowdhury
Автор

Traditionally major Chinese coding methods would just consider a Chinese grapheme as 2 characters though, because they would use 2 byte coding while the the 128 ASCII code points only use 1 byte coding. And also, traditionally a Chinese grapheme would always take up double the width of an ASCII grapheme in fixed width console font. This kinda makes everything neatly aligned (the amount of storage bytes needed is the same as the amount of character printing space needed), but basically falls apart when Internet and UTF-8 become more popular.

And that's also basically the very reason there are double width Latin letters in Unicode. Traditionally it's used to improve readability of English words in vertical text arrangement for Chinese and Japanese, and is called full-width letters full width as in it's the full width of a Chinese character.

FlameRat_YehLon
Автор

This video is like the sum of the most important things about unicode and ascii, very well done.

florianvanbondoc
Автор

I keep coming back to this video to refresh the concept, thank you :)

a_maxed_out_handle_of__chars
Автор

You have an amazing talent for teaching m8! Keep it up, this video like your others is so helpful. I love that you cover basic concepts of coding, not "how do I implement a server in node.js" but what really is the essence of becoming a good coder.

Mane
Автор

Thanks, a ton!!
Here is what I learnt from the video (I have added a few things that I knew earlier):

- UTF-8 (Unicode Transformation Format, 8-bit) is an encoding scheme for representing Unicode characters.

- In UTF-8, ASCII characters are represented using a single byte, which means that any valid ASCII text is also valid UTF-8 text.

- Therefore, UTF8 is backward compatible with ASCII.

- In UTF-8, characters that can be represented using a single byte (i.e., ASCII characters) are represented as themselves.

- Characters that require more than one byte are encoded using a combination of multiple bytes.

- A code point refers to a numerical value assigned to each character or symbol in the Unicode standard.

- Code points are represented using hexadecimal notation and are typically prefixed with "U+" to distinguish them from other numerical values.

- For example, character "é" (Latin Small Letter E with Acute) consists of two Unicode code points: the base character "e" (U+0065) and the combining acute accent (U+0301). When encoded in UTF-8, "é" is represented by the bytes 0xC3 0xA9.

- A grapheme refers to a visual unit of a written language. It represents a single user-perceived character or a combination of characters that are displayed together.

- len() function returns the number of bytes, not the number of characters in a Unicode-unaware string.

- len() function returns the number of characters in case of a Unicode-aware string.

Mehraj_IITKGP
Автор

Where have you been all my life?! 🤩 What a great explanation! Thank you so very much! 🥰

learninggeekspeak
Автор

Trying to be better in IT and dreaming to be an awesome programmer! I have always just skipped on learning Unicode and never cared due to laziness. But now realize the very importance! thank you so much for this video.

jslee
Автор

this is such an appealing video due to the detail, music, subtle animations, even the colour theme. thanks for the video!

lbibrzh