Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

Показать описание

Ever been bit by a Unicode bug? Maybe you weren't treating UTF-8 encoded data correctly, or tried to read it as ASCII? Maybe you mixed up UTF-8 vs UTF-16? Unicode and character encoding might seem like a tricky topic, but let's break them down and learn about them piece by piece, from ASCII to code points to graphemes to combining character modifiers and more.

00:00 Intro
00:12 All Data Is Stored As Bits
00:41 How To Store Characters Using ASCII
01:38 What About All The Other Languages?
02:14 Graphemes Map To One Or More Code Points
03:18 Code Points Map To One Or More Bytes
03:44 UTF-32 Is Wasteful
04:43 UTF-8 Is Better
05:40 Unicode is Western-Language-Centric
06:41 Mid-Video Recap
07:52 Live Python Demo
09:47 Unicode Rules Of Thumb
10:14 Looking At A Code Point

Studying With Alex

Рекомендации по теме

Комментарии

this alone is the best unicode video explanation in the entire youtube, 100x better than the, maybe, second place from Computerphile.

edwincloudusa

Another fun fact about the way letters are laid out in ASCII: A capital letter's corresponding lowercase counterpart is exactly 32 values ahead. This is because 32 is a power of 2, and makes it so that you only need to flip the 6th bit (from the right) of a byte to change the case of a letter.

EmilMacko

FINALLY someone is able to explain this clearly. Most other videos complicate this so much. Thank you!

winstonmisha

7:50 Worth pointing out that Python 2 is way past its end of life now and in Python 3, all strings are Unicode aware and the u modifier does nothing (it's only there for backwards compatibility). Using Python 2 is a good way to showcase the difference between Unicode aware and unaware functions but seems like this could confuse some beginners trying to replicate what you're doing who will likely be using Python 3 and might not be aware of the difference.

vader

This is the type of content that should be suggested by YouTube to everyone. Great explanation, thankyou

sharbelokzan

One small correction: grapheme is a part of a particular writing system. Writing systems are always language-related. Unicode does not reflect any particular writing system. Unicode, and this is probably the smartest choice that could ever be made, maps numbers (code points) to character descriptions or names. This way Unicode is detached from any font and therefore from any particular shapes. This results in a code point relating to not what we want to see, but what we want, making it more abstract.

A good examples are <g> U+0067 and <ɡ> U+0261. In most writing systems based on Latin script, they are allographs, variants of <g> grapheme, so if Unicode were to contain graphemes, there should be only one character. But this is not the case. A writing system may prefer particular glyph (a shape of a letter) and that glyph could be the main variant (allograph) of the grapheme in that system. In most Latin-based writing systems <g> is the main variant, but in the International Phonetic Alphabet it is <ɡ>, because of it correlation with other similar characters like <ŋ j>.

piotrrybka

I've listened to quite a few Unicode tutorials. This one blows the others out of the water. Clear. Concise. Good tempo. Thx

darianleduc

Great explanation, thank you! For anyone who may still not understand, UTF-8 CAN get up to 32 bits and bit as big as UTF-32, but only if it has to. Otherwise, it just uses the minimum amount of space (8 bits) and expands as required, depending on the grapheme.

gti

I took a long searching to understand the Unicode (also UTF-8 and the others), I did, but some things were still ambiguous to me, this guy literally taught me the whole topic with only 10 minutes

MahmoudKhudairi

You earned a subscriber today.
I've worked in the past with non-english characters scratching my head.
This alone sums up all the concepts in details. Very good use of examples and video production 💌

PabitraPadhy

Wow being the messy programmer that I am I always got encoded and decoded mixed up... Much clearer now, thanks

aliceg.

Man this channel is golden. The most difficult topics here are explained so easily and in such a unique way

tanzimchowdhury

Traditionally major Chinese coding methods would just consider a Chinese grapheme as 2 characters though, because they would use 2 byte coding while the the 128 ASCII code points only use 1 byte coding. And also, traditionally a Chinese grapheme would always take up double the width of an ASCII grapheme in fixed width console font. This kinda makes everything neatly aligned (the amount of storage bytes needed is the same as the amount of character printing space needed), but basically falls apart when Internet and UTF-8 become more popular.

And that's also basically the very reason there are double width Latin letters in Unicode. Traditionally it's used to improve readability of English words in vertical text arrangement for Chinese and Japanese, and is called full-width letters full width as in it's the full width of a Chinese character.

FlameRat_YehLon

This video is like the sum of the most important things about unicode and ascii, very well done.

florianvanbondoc

I keep coming back to this video to refresh the concept, thank you :)

a_maxed_out_handle_of__chars

You have an amazing talent for teaching m8! Keep it up, this video like your others is so helpful. I love that you cover basic concepts of coding, not "how do I implement a server in node.js" but what really is the essence of becoming a good coder.

Mane

Thanks, a ton!!
Here is what I learnt from the video (I have added a few things that I knew earlier):

- UTF-8 (Unicode Transformation Format, 8-bit) is an encoding scheme for representing Unicode characters.

- In UTF-8, ASCII characters are represented using a single byte, which means that any valid ASCII text is also valid UTF-8 text.

- Therefore, UTF8 is backward compatible with ASCII.

- In UTF-8, characters that can be represented using a single byte (i.e., ASCII characters) are represented as themselves.

- Characters that require more than one byte are encoded using a combination of multiple bytes.

- A code point refers to a numerical value assigned to each character or symbol in the Unicode standard.

- Code points are represented using hexadecimal notation and are typically prefixed with "U+" to distinguish them from other numerical values.

- For example, character "é" (Latin Small Letter E with Acute) consists of two Unicode code points: the base character "e" (U+0065) and the combining acute accent (U+0301). When encoded in UTF-8, "é" is represented by the bytes 0xC3 0xA9.

- A grapheme refers to a visual unit of a written language. It represents a single user-perceived character or a combination of characters that are displayed together.

- len() function returns the number of bytes, not the number of characters in a Unicode-unaware string.

- len() function returns the number of characters in case of a Unicode-aware string.

Mehraj_IITKGP

Where have you been all my life?! 🤩 What a great explanation! Thank you so very much! 🥰

learninggeekspeak

Trying to be better in IT and dreaming to be an awesome programmer! I have always just skipped on learning Unicode and never cared due to laziness. But now realize the very importance! thank you so much for this video.

jslee

this is such an appealing video due to the detail, music, subtle animations, even the colour theme. thanks for the video!

lbibrzh

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

ASCII, Unicode, UTF-8: Explained Simply

Understanding text for C Programmers (UTF-8, Unicode, ASCII)

Characters, Symbols and the Unicode Miracle - Computerphile

Understanding Text Encoding in Go: ASCII, Unicode, and utf-8 Explained

ASCII, Unicode, UTF-32, UTF-8 explained | Examples in Rust, Go, Python

Unicode 1: What means: ASCII, ANSI, Code Page

What are UTF-8 and UTF-16? Working with Unicode encodings

Unicode, UTF 8 and ASCII

this Unicode character is impossible to type 😱😱😱😱😱

⍼ - Why Nobody Knows What This One Unicode Character Means

Ep 020: Unicode Code Points and UTF-8 Encoding

ASCII, Extended ASCII and Unicode | 9618 | AS Level Computer Science

Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach Y...

What is Unicode? How does it work and how do you use it?

Unicode be like: Episode 1

What Is Unicode? And Why Do I Need To Use Unicode?

ASCII, UNICODE et UTF8 - Spé NSI - Première Informatique

Data Representation - ASCII vs Unicode

C# Tutorial - Basic - 031 - Character encoding & Unicode

Unicode in Rust - Illustrated by Kanji - Jenny Manning

ASCii and Unicode Questions

C# Programming Tutorial 15 - Char Data Type and ASCII Unicode

CppCon 2014: James McNellis 'Unicode in C++'