Unicode and Byte Order

Показать описание

In this computer science video you will learn about text files. Specifically, you will see how Unicode code points are encoded into binary and why the byte order, that is the endianness, of some Unicode Transformation Formats could be an important consideration if you’re a programmer handling text data, or if you build websites.
The video demonstrates how Unicode code points are encoded in ASCII, UCS-2, UTF-16, UCS-4, UTF-32 and UTF-8, and it discusses some of the advantages and disadvantages of these encodings. The UTF-16 high surrogate and low surrogate format is explained, including its effect on the available range of code points. The UTF-8 bit patterns are also described in detail.
When saving UTF-16 or UTF-32 text files, it is possible to specify the byte order, which can be either big endian or little endian. The need for a byte order mark (BOM) in a UTF-16 text file is demonstrated by examining it encoding as hexadecimal data. The so called UTF-8 with BOM format is also discussed.
In this computer science video, you will also see why it is important to include a charset meta tag in the head section of a web page, to specify the character encoding. Problems that might occur if, for example, a web page has been encoded with ISO-8859-1 or Windows 1252 are demonstrated.
Chapters:
00:00 Introduction
00:59 Unicode code points
01:45 ASCII
02:24 Universal Character Set UCS-2
03:50 Unicode Transformation Format UTF-16
09:26 Unicode Transformation Format UTF-32
10:48 Unicode Transformation Format UTF-8
14:03 Byte Order
15:07 Byte Order Mark BOM Demonstration
19:47 UTF-8 with BOM Demonstration
20:54 Web page character encoding
24:04 Summary

Рекомендации по теме

Комментарии

I must have watched half a dozen videos about encoding text over the years, but this one is easily the best one! Very easy to understand with all the examples. Now to hope I don't ever have to deal with anything else than utf-8...

RepertoireSix

I have been waiting for this clear and precise explanation for 20 years I believe. Thanks so much.

paulfontaine

I'm a CS graduate and I needed a refresher about this subject and this explanation had everything that I needed. Thanks.

mhn

Probably the clearest explanation I've seen on the topic. There's lots that cover the basics pretty well, but completely skip over the byte order aspect.

lostcarpark

This was super interesting and well done, thank you so much! I had an issue with BOM yesterday that broke my script because the shebang wasn't being recognized. I love discovering Hex Editor Neo, gave me a lot of peace of mind, actually seeing the "ef bb ef" in front of my code haha.

MoeTavern

Seriously underrated video. Should have a LOT more view than this.

paulg

As a Japanese speaker, the fact that the first four glyphs in your thumbnail are "I❤日本" made me wonder if you're a fan of Japan! (日本 means Japan in Japanese)

smallerz

Watched all the videos in this playlist. Can't thank you enough.

booboo-ohvd

23:15 that's the kind of nonchalant joke I like lmao

TheKurama

Quite excellent. So many little places it is possible, even easy to trip up and give sub-par or confusing information on this topic, and you just tap dance right across that minefield. Beautiful.

jvsnyc

another good one keep doing the great work man

surman

One question at the end. There are 17 * 256 * 256 possible characters across the 17 planes, not including the various blackout or restricted sections like those used for surrogates...isn't that more like: 1, 114, 111 than well over 2 million? They do require 21 bits, rather than 20, because they left all but 2048 of the original BMP to lie where they were, and added another 1024 * 1024 possible new ones on top of that, so it is (256 * 256) * 16 + (256 * 256) - 2048...either way, way more than we should ever need if we don't go nuts...as they point out, Unicode is for characters, not glyphs or fonts, so if there are 500 ways you want to write A, that is still just one code point...

jvsnyc

thanks a lot for the awesome explanation!

sadBytes

Thanks for the great video.

What is the default Endianness for most Windows file types? Is it always Big-Endian unless otherwise specified?

MroAlio

So why is UTF16 the only standard not compatible with ASCII?

kaos

oh i thought UTF-16LE Order bytes will be for all 4 bytes .. but its only for 2 bytes-2bytes
Thank you man ..

obeid_s

Great lecture!
I had a question: what is that weird characters we see on the Hex editor? Because Notepad recognizes the encoding used, while Hex editor doesn't.

jefersonwillian

what web site do i get these things from? you talk about it 16:00 but were are they? how do i find them? what do i search for? this would make a good video.

charlesklein

may i know why this video got re-uploaded ; as if any you've any changes it would be useful for to spot

kqvanity

Dear Sir,

I have tried to convert '€', in pure binary It is and in denary 128, to UTF-8 format. As It is 1 byte character but also starts with 1 . Hence I am confused how to convert it to UTF-8 format.

Please help.
Thanks in Advance.

NoName-tjdm

Unicode and Byte Order

Unicode and Byte Order

Why I like the UTF-8 Byte Order Mark (BOM)

ASCII, Unicode, UTF-8: Explained Simply

Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach Y...

what is a BOM (byte-order-marker) (intermediate) anthony explains #560

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

4 Big Endian vs Little Endian Byte Ordering

What are UTF-8 and UTF-16? Working with Unicode encodings

Byte Order (Endianness)

Characters, Symbols and the Unicode Miracle - Computerphile

A Brief History of Unicode

Character Encoding - 🅷🅰🅽🅳🆂 🅾🅽 🅲🆁🅰🆂🅷 🅲🅾🆄🆁🆂🅴

Data Conversions and Byte-Order-Marks

Understanding text for C Programmers (UTF-8, Unicode, ASCII)

Ep 020: Unicode Code Points and UTF-8 Encoding

Apple: TextEdit removes Byte-Order-Mark (BOM) from Unicode/UTF files. How to fix? (3 Solutions!!)

Bytes and encodings in Python

! | Unicode Lore

U is for Unicode: Solving the Mystery

ASCII and Unicode Character Sets

Talk: James Bennett - A 🐍's guide to Unicode

ASCII, Unicode, UTF-32, UTF-8 explained | Examples in Rust, Go, Python

Unicode vs UTF-8

C# : There is no Unicode byte order mark. Cannot switch to Unicode