filmov
tv
What are UTF-8 and UTF-16? Working with Unicode encodings

Показать описание
UTF-8 and UTF-16 are the two most commonly used encoding for Unicode characters. Unicode defines a large character repertoire (1.1 million in theory, of which 145k are defined in Unicode 14.0) which begs the question how to encode all these characters. UTF-8 and UTF-16 are two of the encodings that Unicode defines, and the most popular ones today.
UTF-8 is a variable length encoding that encodes each character in 1-4 bytes, where the standard ASCII repertoire is encoded in 1 byte per character. This encoding makes UTF-8 compact, but it also is a relatively complex encoding.
UTF-16 is less complex and encodes most Unicode characters (and pretty much all in practical use today) in 2 bytes, with some others being encoded in 4 bytes. This means UTF-16 takes up more space for most cases, but it is easier to encode and decode.
Since Unicode is very popular today a lot of tooling has built in support for Unicode and some of its encodings. In addition, there are standalone tools that can be used to investigate files, and to convert them. We demonstrate two such tools with the Unix "od" and "iconv" commands, which allow us to have a close look at a demo file, and to convert it between the two encodings.
Additional Resources:
00:00 Introduction
00:23 UTF-8 and UTF-16 are Text Encodings
00:55 Character Sets
01:26 Unicode as the universal character repertoire
02:27 UTF-8
03:25 UTF-16
04:09 Demo time: Starting with a demo file
04:50 od as a tool for dumping files
05:46 iconv for converting files
07:24 Summary
08:50 Wrap-up
UTF-8 is a variable length encoding that encodes each character in 1-4 bytes, where the standard ASCII repertoire is encoded in 1 byte per character. This encoding makes UTF-8 compact, but it also is a relatively complex encoding.
UTF-16 is less complex and encodes most Unicode characters (and pretty much all in practical use today) in 2 bytes, with some others being encoded in 4 bytes. This means UTF-16 takes up more space for most cases, but it is easier to encode and decode.
Since Unicode is very popular today a lot of tooling has built in support for Unicode and some of its encodings. In addition, there are standalone tools that can be used to investigate files, and to convert them. We demonstrate two such tools with the Unix "od" and "iconv" commands, which allow us to have a close look at a demo file, and to convert it between the two encodings.
Additional Resources:
00:00 Introduction
00:23 UTF-8 and UTF-16 are Text Encodings
00:55 Character Sets
01:26 Unicode as the universal character repertoire
02:27 UTF-8
03:25 UTF-16
04:09 Demo time: Starting with a demo file
04:50 od as a tool for dumping files
05:46 iconv for converting files
07:24 Summary
08:50 Wrap-up
Комментарии