Plain Text - Dylan Beattie - NDC Copenhagen 2022

preview_player
Показать описание
Software is complicated. Machine learning, microservice architectures, message queues... every few months there's another revolutionary idea to consider, another framework to learn. And underneath so many of these amazing ideas and abstractions is text.

When you work in software, you spend your life working with text. Some of those text files are source code, some are configuration files, some of them are documentation. Editors, revision control systems, programming languages - everything from C# and HTML to Git and VS Code is based on the idea of "plain text files". But... what if I told you there's no such thing? When we say something is a "plain text file", we're relying on a huge number of assumptions - about operating systems, editors, file formats, language, culture, history... and, most of the time, that's OK. But when it goes wrong, "plain text" can lead to some of the weirdest bugs you've ever seen... why is there Chinese in the event logs? Why is the city of Aarhus in the wrong place? And why does Magnus Mårtensson always have trouble getting into the USA? Join Dylan Beattie for a fascinating look into the hidden world of text files - from the history of mechanical teletypes to encodings, collations and code pages. We'll look at some memorable bugs, some golden rules for working with plain text - and we'll even find out the story behind the mysterious phrase "pike matchbox" and what it has do with driving in Belarus.

Check out more of our featured speakers and talks at
Рекомендации по теме
Комментарии
Автор

I was a little bit skeptical: how can anyone give a one-hour talk speaking just about 'plain text'? But I have to admit: it was simply AMAZING! Well done!!!

jonnilazzerini
Автор

23:30 That is the most beautiful thing about human beings that I've heard in a long, long while. God bless that postman who really cared for his job and even was smart enough to figure out that problem. This will make me happy the rest of the day :D

f.d.
Автор

One of my favourite sorting rules is that for Scottish surnames "Mac" and "Mc", both with and without following space, are considered the same letter that comes after L but before M

malcolmhutchison
Автор

This ascii issue is also a cause of cultural tension in (Republic of) Ireland and (Northern) Ireland, where birth registrations at some hospitals are refused or incorrectly assigned when a child's parents opt to use a Gaelic name, which often includes a bunch of non-ASCII chars. Hospital software is usually pretty archaic and predates a lot of the elegance of UTF.

Also. Amazing talk. Funny and interesting the whole way through. Dylan Beattie is a legend!

merthyr
Автор

At the risk of being one of those YouTube comments shown in your next talk, the diacritic you discuss at 29:18 is a diaressis not an umlaut. They look the same and are encoded with the same codepoints, but are pronounced differently. An umlaut changes the quality of the vowel, and can appear on lone vowels in any language that uses them. A diaresis tells readers that the second of two vowels is not to be read as a diphthong, but a separate vowel. That's why English has one on, for example, naïve (nigh-eve, not knave). Coöperation is co + op not co͞op.

NicholasShanks
Автор

The 7-bit encoding for SMS messages in GSM is the same as ASCII for most characters but many of the control characters have been replaced text characters that were missing from ASCII. In particular it does not have NUL, 0 encodes the '@' character. So, as one of my colleagues at Ericsson found out the hard way, you cannot use C NUL terminated strings to process SMS messages.

chascuk
Автор

This is one of the best presentations I've seen in a long time. Amazing content!

nsulikow
Автор

Recently, a student of mine opened a text file and it was all Chinese gibberish.
I remembered your talk and switched the encoding from UTF-8 to UTF-16 or vice versa, and there was a readable file again :)

notthedroidsyourelookingfo
Автор

Absolutely one of the best presentations that I've seen and it was a total shock. I watched this because I'm a geek and I like Dylan Beattie. I never expected it to be this awesome!

drullo
Автор

Watching this for the second time (I watched the video referenced several times in this talk). Absolutely brilliant and I learned a lot

HasanSIM
Автор

Since Dylan does read comments, here's one of my favorite examples, in Polish: "Zrób mi łaskę" means do me a favor. Most of the characters can be turned to their ASCII lookalikes without any issue whatosever. Except one. "Zrób mi laskę" is asking for a specific sexual act. Just turning ł into l changes the entire meaning of the whole sentence.

jandorniak
Автор

Very good talk. Regarding ASCII and punchcards, it's unlikely they would ever meet in the first place. You do course correct a bit w/r/t the DEL character, but punch cards were originally in 6-bit BCDIC (binary-coded decimal interchange code). This was extended to 8-bits to become "Extended" BCDIC, or EBCDIC. The layout of the character set aligned w/ the rows of the punchcard, such that all alphabetic chars were x1 - x9, so in late variants 'A' is 0x11 and 'Z' is 0x39. To get 3 rows of 9 columns to line up, there's a "/" at the start of the last row, 0x31.

Interestingly, ASCII was created by Bob Bemer at IBM to solve interop problems between the BCDICs. However, IBM was in so deep w/ their card-based (E)BCDIC, they couldn't use it in any of their operating systems. Note also, EBCDIC is still very much in use.

Finally, Multics did not influence Unix, except to serve as a counter-example of design principles.

jeberle
Автор

44:14 Glad that my comment in the previous talk video was found helpful :)

NicolasChanCSY
Автор

The youtube comment near the beginning of this updated version of his previous presentation illustrates the point of the talk powerfully. Dylan is always amazing, but this talk from him is perhaps uniquely important to everyone in the field! From 1st year associates to the most seasoned senior architect, plain text is always less than plain.

JeremyAndersonBoise
Автор

30:00 ij is a dutch letter, not a typesetter's ligature! It's in the extra block at 19:50 left of Ö. Most fonts don't support it and ASCII led to it being written as 2 letters (i and j) because it was the only non-ascii letter in dutch, but all dutch typewriters before PCs were popularized had a dedicated key for it. Fonts that turn it into a ligature often run into problems with words like minijack, Beijing and bijoux. It used to have the same problem as å, with some people turning it into a Y (most famously Cruijff) until it got standardized as I+J.

ayle
Автор

Yay I love this guy, I binged all his talks like a month ago

braveatnight
Автор

I spent a fair part of my career designing and implementing serial terminals and emulators of the same. For terminals from DEC starting with the VT100 (and other "ANSI" terminals), there was something called "code extension", along with character set designators, graphic sets, and shifts (both locking and single) that were used to mix text from multiple character sets on one screen/page using either 7 or 8 bits per character. This was fine on terminals and printers that had the same character sets available, but caused a lot of grief when a device receiving the text didn't support all of the character sets used. Also, very few editors at the time could handle storing such text.
It was a mess, but at least it was better than what it replaced, which was National Replacement Character Sets (NRCS), where it was 7-bit ASCII with the glyphs for some of the code points replaced. There was no way to tell which NRCS had been selected when the file was created, even with a hex editor.

filker
Автор


Basically before Greek language was fully supported, Greek people interacting with electronics came up with mappings between ASCII and Greek.
These mappings were unofficial and there are several variations.

Even after UTF-8 was implemented and got more and more adoption, lots of young people still utilized Greeklish in SMSs to send messages to each other because you'd get charged by the number of bytes you used (in groups of bytes) and not by the actual number of characters used.
This is also an issue in a lot of fields that have a byte limit instead of a character limit.

On a parallel note...

MrIkariaman
Автор

In days of 7 bit ASCII, there were lots of workarounds in non-English speaking countries. For example, in order to be able to print umlauts, printers had special character sets that had umlauts where normally the characters {, [, ], }, \ and | were, because nobody needed them when writing a letter.
However, if a C or C++ programmer would use such a printer, his code would look quite funny. In parts that's the reason why some languages have special replacements for these characters, called digraphs and trigraphs. This all sound like multiple layers of duck tape putting on top of one another but it kind of worked.

heinzk
Автор

The reason why you get smiley faces when DOS crashes is not because there is something trying to generate the stop character. The reason is that often it starts executing random garbage or tries to print a message that became random garbage due to memory corruption. In a piece of program data the values 1 and 2 would be quite common if you have some counters that did not fit into your registers, and maybe they encode some common x86 instruction as well. The string terminator in the common OS interface for printing strings was the dollar sign rather than nul on DOS operating system. The dollar sign is much less common than nul and smiley faces in random garbage so you will likely get some smiley faces printed.

Note also that 'plain text' is just a binary format (or more precisely a family of binary formats with ASCII, EBCDIC, various code pages, JIS, BIG5, GB 18030, UCS-2, UTF-7, UTF-8, big endian and little endian UTF-16/UTF-32, ...) for which there happens to be a lot of editors and viewers. In the end it's all binary bits. One specific property that 'plain text' has over many other binary formats is that it has very little structure and can still be of some use when some bits are flipped or bytes missing as opposed to, say a compressed JPEG image with the caveat that the multibyte encodings are much more fragile.

vincentvega