CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

preview_player
Показать описание


UTF-8 is taking on an increasingly important role in text processing. Many applications require the conversion of UTF-8 to UTF-16 or UTF-32, but typical conversion algorithms are sub-optimal. This talk will describe a fast, correct, DFA-based approach to UTF-8 conversion that requires only three simple lookup tables and a small amount of straightforward C++ code.

We'll begin with a quick review UTF-8 and its relation to UTF-16 and UTF-32, as well as the concept of code units and code points. Next, we'll look at the layout of bits within a UTF-8 byte sequence, and from that, show a simple algorithm for converting from UTF-8 to UTF-32. Along the way will be a definition of overlong and invalid byte sequences. Following that will be a discussion of how to construct a DFA to perform the same operations as the simple algorithm. We'll then look at code for the DFA traversal underlying the basic conversion algorithm, and how to gain an additional performance boost by using SSE intrinsics.

Finally, we'll compare the performance of this approach to several commonly-available implementations on Windows and Linux, and show how it's possible to do significantly faster conversions.

Bob Steagall, KEWB Computing
CppCon Poster Chair

I've been working in C++ since discovering the second edition of The C++ Programming Language in a college bookstore in 1992. The majority of my career has been spent in medical imaging, where I led teams building applications for functional MRI and CT-based cardiac visualization. After a brief detour through the world of DNS and analytics, I'm now working in the area of distributed stream processing. I'm a relatively new member of the C++ Standardization Committee, and launched a blog earlier this year to write about C++ and related topics. I hold BS and MS degrees in Physics, and I'm an avid cyclist when weather permits.


*-----*
*-----*
Рекомендации по теме
Комментарии
Автор

As dry as the topic is, this talk was amazing. It was clearly structured, most topics built on the previos ones and the wording was easy to understand. I watched it at a far too late hour, but understood most of it, although struggeling at the DFA explanation.
If someone asks me about utf-8 and parsing, I will recommend this talk.

OperationDarkside
Автор


Thank you for the great presentation!

gabrielaubut-lussier
Автор

Skip to somewhere around 14:00 if you already know how UTF-8 and friends work to get straight to talking about the code.

YSoreil
Автор

mm_set and mm_set1 intrinsics compile to RAM references. You should use _mm_setzero_si128 to zero registers.

soonts
Автор

I would've liked to see some kind of analysis of why his code is so much more efficient -- especially compared to the other DFA implementation. I suspect his performance advantage may evaporate once he handles the cases that the other implementations deal with.

ZiggyGrok
Автор

Want to have strlen() results to feel the speed.

mrmodtube
Автор

I actually tested many 'optimized' UTF-8 decoders the other day, unfortunately, many could not correctly handle overlong or incorrect codepoints (even if they claimed they could), report error positions for invalid/corrupted bytes accurately, or beat a naive UTF-8 decoder when decoding mostly ASCII (Due to branch mispredicts). I like the presentation of this topic, but unfortunately, most of these 'optimized' decoders just aren't practical for production software.

yyny
Автор

Extremely interesting presentation. IMHO it's C (but really high C quality) and not C++, but this doesn't remove anything to the quality of the talk.

robinmoussu
Автор

"DFAs can recognize simple regular expressions" thats kinda backwards, regular expressions were first defined as a way to describe regular languages, which are, by definition, accepted by DFAs. The problem is that Perl then implemented regexes via backtracking and started adding features that are not regular, but easy to implement when you have a backtracking solver...

dragdu
Автор

I'm surprised that Microsoft had the best results among the competition.

llothar
Автор

Energy consumption should also be compared.

bernadettetreual