What are UTF-8 and UTF-16? Working with Unicode encodings

preview_player
Показать описание
UTF-8 and UTF-16 are the two most commonly used encoding for Unicode characters. Unicode defines a large character repertoire (1.1 million in theory, of which 145k are defined in Unicode 14.0) which begs the question how to encode all these characters. UTF-8 and UTF-16 are two of the encodings that Unicode defines, and the most popular ones today.

UTF-8 is a variable length encoding that encodes each character in 1-4 bytes, where the standard ASCII repertoire is encoded in 1 byte per character. This encoding makes UTF-8 compact, but it also is a relatively complex encoding.

UTF-16 is less complex and encodes most Unicode characters (and pretty much all in practical use today) in 2 bytes, with some others being encoded in 4 bytes. This means UTF-16 takes up more space for most cases, but it is easier to encode and decode.

Since Unicode is very popular today a lot of tooling has built in support for Unicode and some of its encodings. In addition, there are standalone tools that can be used to investigate files, and to convert them. We demonstrate two such tools with the Unix "od" and "iconv" commands, which allow us to have a close look at a demo file, and to convert it between the two encodings.

Additional Resources:

00:00 Introduction
00:23 UTF-8 and UTF-16 are Text Encodings
00:55 Character Sets
01:26 Unicode as the universal character repertoire
02:27 UTF-8
03:25 UTF-16
04:09 Demo time: Starting with a demo file
04:50 od as a tool for dumping files
05:46 iconv for converting files
07:24 Summary
08:50 Wrap-up
Рекомендации по теме
Комментарии
Автор

Thank you for a great talk with useful visuals! I make a Unix program (durdraw) for drawing Unicode and other text art, and find myself working with different character encoding regularly. Perhaps I missed it, but Utf-8's backwards compatibility with ASCII is worth considering when choosing an encoding scheme. I also liked the useful "od" syntax. I rarely encounter Utf-16, but thanks to your video I will now be able to recognize it in a hex dump.

IndianaJoenz
Автор

That is exactly what I want to know, showing some great tools. Thank you for sharing, and have a great day:-)

lalpremi
Автор

Excellent visual explanation! Couldn't be clearer! I didn't know that it would choose the correct length to each character. I thought it always has a fixed length. I really would like to know more, about this in general... Headers, BO, LE, etc. I also find it very interesting and very useful to work with ETL in data engineering. If you think of something else besides the links you already shared in the description please let me know. Thank you for making this video.

nervocalm
Автор

If you have a CPU where every address is 16 bits wide, you may as well use UTF-16 as default. If memory is 8 bits wide, use UTF-8.
For 32 bit (or 64 bit) you can store multiple characters per RAM address, no matter what system you choose.

Soupie
Автор

Please how do I find encoding of my file

akshardrashti
Автор

Hi Erik, congratulations on the video and thanks for sharing your knowledge. I am migrating an Oracle database on Solaris Sparc that is using UTF-16BE, while the destination uses UTF-8. In your opinion, what would be the best approach to converting the data source?

flaviomelo
Автор

you have a linkedin handle?
I find this very interesting

zldmmgd
Автор

6:29 please go into the details "byte order mark" in utf 16

parsifal
Автор

It seems that windows switched to utf8 either, speaking of win10 21H2 and later.

sabitkondakc
Автор

I wonder Microsoft's office still can't open files in utf8 😳

MrJloa
Автор

UTF-16 should be abandoned because it is so problematical.

Tapajara