Unicode in Rust - Illustrated by Kanji - Jenny Manning

preview_player
Показать описание
Have you ever wondered why you can’t look up a character in a string by its index? Or why the length of a string can be wildly different from the number of characters in the string? In this talk, we’ll dive into Unicode by looking at how Kanji is represented in Rust. You’ll learn about things like the Han unification, the origins of CJK languages from Oracle bone script, and why Rust handles strings differently than we expect.
Рекомендации по теме
Комментарии
Автор

I feel like this talk painted a fairly unfavourable picture of Han unification. Although the limited space in Unicode could perhaps be seen as one of the driving forces behind it, in reality what was unified and what wasn't was decided mostly on the basis of what East Asian encodings already did. And those East Asian encodings decided their principles for unification based on fairly sound principles. You can't encode every minor glyph variation separately or else you end up with a mess of a system where text is encoded incredibly inconsistently due to the large number of duplicates. Even if Unicode were to be designed from scratch today without any space limits, Han unification would end up very similar to how it is now.

noname
Автор

TBH, Most pragmatically useful content is at the beginning and end. IMO, the content in the middle while nice background information is a little light on utility.

@ ~ 1:20 Ms. Manning points out difference between string.len() & string.char().count()

@ ~ 15:40 Ms. Manning talks over UTF-8, UTF-16, & UTF-32

Chris-onbt