Code Pages and Kohuepts: The Chaos of 8 Bit Extended ASCII

Показать описание

"But it's plain text! What do you mean it looks weird?"

Friends, let's take a walk around the wonderful world of code pages, the cause of more encoding headaches, bizarre punctuation and inventive workarounds than just about anything else in IT history - and along the way, we'll meet some other things which aren't really encoding standards, but which have cast their long shadow on the way folks interact with digital data in the age of the World Wide Web.

Dylan Beattie

Комментарии

Neat fact about the keyboard layout thing. In the 2002 movie "The Bourne Identity" the protagonist assumes the fake identity of a Russian citizen named "Foma Kiniaev". He gets a fake Russian passport but his Russian passport in Cyrillic says "Ащьф Лштшфум". The prop department just set their keyboard to Russian and wrote "Foma Kiniaev" as if it was a qwerty keyboard.

Turns out it was actually quite realistic. A few years back a guy tried to present a fake Israeli passport in Barbados under the name "Assulin Hormoz", But instead of "Hormoz" his surname in the passport in Hebrew was also typed as if it was Latin so it became "יםרצםז", which was further mangled by being rendered backwards as "זםצרםי" (bidirectional text is something you haven't covered and it's a whole other can of worms). There were also several other Hebrew mistakes in the passport such as text rendered upside down or similar looking letters being mixed up.

mrmimeisfunny

Dylan is easily one of the best presenters I've ever seen. Fantastic work.

vektracaslermd

Excellent. Can't wait to see what you say about UTF-8.

TeVolt

In the early 1980s, during one of my unemployment phases, I read various books in the nearby university science library - thus I happened to read about the committee processes for establishing ASCII. No idea now what the exact book was. Some of the contentions were whether and what to have for things like logical NOT and OR (as AND was obviously covered by &). I have a more vague memory that collation order was also much debated. Like some other commenters here, I've also written about character encoding history and issues, see my bio for links.

GeraldWaters

These little anecdotes from the world of computer history are super entertaining!

MusicEngineeer

Well, you've been ASCIIing for it

mrJety

This brought back memories of typing Romanized Sanskrit letters into my dissertation. I had Wordperfect macros typing the letter, then backspacing one position and typing the diacritical mark.

clasqm

Growing up in Zimbabwe was not a piece of Dylan Beattie Lore I expected to learn today

DragoniteSpam

Of all your stories, that Harry Potter one is one of the very best 😂

MeriaDuck

Its not just forgetting to change keyboards: Sometimes it doesn't switch, sometimes you try to switch but it was already on the right language, sometimes the operating system gives you an extra keyboard or two for fun...

eliavrad

The octothorpe (#) was used in the late 19th and early 20th C for the pound avoirdupois (weight) in the US, especially for pounds of goods sold by the pound... So while it may not have been a Pound sign in the UK, in the US, as a postfix, it indicated weight pounds (not to be confused with pounts force, pounds thrust aka poundals, pounds mass, .nor pounds sterling aka £), and when prefixed, it starts a numeric sequence..
I encountered this use a lot in late 19th and early 20th C US federal records from the then Territory of Alaska (now a US State) and Hawaii (also now a US state). Especially for the pounds of supplies ordered and delivered. It is still used in the US to indicate either numericity of the following characters, or to indicate a weight in pounds of the preceeding digits.
As for Cyrillic, it is used in Alaska for Russian, and some dialects of Yupic and Inungan... (most Alaska Natives have now switched to using accented Latin...).

WilliamHostman

That WordStar screenshot is such a goldmine of nostalgia, I used a lot of different CP/M and DOS machines back in the olden days and they all had their differences and "killer apps"... but WordStar was the ONE constant. At college we had realised that the CP/M text editor was the ninth circle of hell and some bright spark realised you could use WordStar in "non document mode" as quite a decent text editor and so, until Microsoft put a cut down version of QBASIC in DOS-5 and called it `EDIT`, WordStar followed me around for many years.

edgeeffect

There were also a ISO 646, which to us Norwegians meant that we could find words like bl}b{rsyltet|y.

bauckrob

Ah, encodings... I work in integration, and let me tell you, the chaos is still with us to this day. I have written lengthy articles about various encoding problems, but here I will just touch on a single issue, the "MS Office character replacement problem".

First, a bit of history. ISO-8859-1 is a single-byte text encoding scheme which extends the 7-bit ASCII character set with most of the characters used in western European languages. When Microsoft made Windows 1.0, they decided to copy this encoding, but rename it Windows-1252. Ever since then, this has been the default encoding on most Windows machines in the western world. Since Windows was for many years only a workstation OS (there was no server version of Windows initially), this led to a situation where a lot of text data was being produced on Windows machines but would eventually have to be processed by other operating systems. Since Windows-1252 was not an official international standard encoding, other operating systems did not have support for it. However, since Windows-1252 was initially identical to ISO-8859-1 which other operating systems _did_ support, it became common for data written in Windows-1252 to be marked as ISO-8859-1. This allowed other OS's to read Windows-1252 data with no problems, and it seemed like a good idea at the time...

Now, ISO-8859-1 has a gap in its printable character definitions. The byte values 7F-9F (33 characters in all) are undefined. When Microsoft developed Windows 2.0, someone had the thought that it would be great to have a few more characters available, and wouldn't you know it, here were a bunch of character codes that weren't used for anything. So they added a few more character definitions in the space unused by ISO-8859-1. Then they did it again for Windows 3.1 and a final time for Windows 98, so that today all but 5 of the original 33 unused byte values in ISO-8859-1 have character definitions in Windows-1252. As a result, text data written in Windows-1252 can now potentially contain quite a lot of byte values which are undefined in ISO-8859-1. So what are these extra characters? Well, they are mostly typographical characters, by which I mean characters meant to make text "prettier" than the standard characters in ISO-8859-1 allows for. These are things like left and right hand versions of both single and double quotes, two dash characters of different lengths, a bullet point and an ellipsis (three dots). Recall that Windows-1252 encoded data has historically often been intentionally mislabeled as being ISO-8859-1 data, and we begin to see how this could potentially lead to problems.

Then one glorious day, someone at Microsoft had the brilliant idea of helping end users write prettier text. How, you ask? By having all the MS Office programs (Word, Excel, Outlook, etc.) automatically replace some of the characters the users were typing _as they typed them_ with the "prettier" versions added to Windows-1252. And not as a function people had to switch on if they decided they _wanted_ this to happen to their text, no they did it as a function which was switched on by default when Office was installed and you had to manually find the setting and switch it off if you _didn't_ want it. Unsurprisingly, this aggravated the problem enormously, since so much text data was produced using MS Office. Instead of there being a mere _possibility_ that Windows-1252 encoded data might be decoded as ISO-8859-1 _and_ might contain characters not present in that encoding, it now became _highly probable_ that this would happen. And it did. A lot. And still does. All the time. And then I'm the one who has to fix it 😞

There is _a lot_ more I could say on this subject, but I think this is probably enough for a YouTube comment 😀

Wishbone

The Russian postal worker who translated that code page mistake was doing the Lord's work.

edwardallenthree

Lovely historical info trove. In the 1960's I grew up on Dartmouth Timesharing Basic on a TeleType ASR 33, so I got to know 6-bit ASCII pretty well. Fast forward to 2004 trying to maintain a Spanish website on JDK1.4, which didn't support UTF-8 in property files. Had to copy and paste UTF-8 from Word documents into an app that converted UTF-8 to Unicode backslash escape characters. You've nicely covered quite an historical odyssey from Baudot to ASCII to EBCIDIC to Code Pages and finally Unicode. Thank you, sir!

ArduinoRR

"Kohuept" reminds me of people calling "Moscow" as "Mockba".

pleappleappleap

It is nearly as hilarious as working with dates and timestamps.

kupferdrachevideosfurdich

I had never encountered “Kohuept” until I heard of it in Tom Scott’s “Lateral”, Now I can’t _unsee it_ and it seems to pop up somewhere at least once a month

euromicelli

I'm just happy I never have to do a latin-1 to utf-8 database conversion again.

I'm also happy I never have to fix utf-8 stored with latin-1 connection to utf-8 stored with utf-8 connection again.

Merrinen