CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

Показать описание

—
—
UTF-8 is taking on an increasingly important role in text processing. Many applications require the conversion of UTF-8 to UTF-16 or UTF-32, but typical conversion algorithms are sub-optimal. This talk will describe a fast, correct, DFA-based approach to UTF-8 conversion that requires only three simple lookup tables and a small amount of straightforward C++ code.

We'll begin with a quick review UTF-8 and its relation to UTF-16 and UTF-32, as well as the concept of code units and code points. Next, we'll look at the layout of bits within a UTF-8 byte sequence, and from that, show a simple algorithm for converting from UTF-8 to UTF-32. Along the way will be a definition of overlong and invalid byte sequences. Following that will be a discussion of how to construct a DFA to perform the same operations as the simple algorithm. We'll then look at code for the DFA traversal underlying the basic conversion algorithm, and how to gain an additional performance boost by using SSE intrinsics.

Finally, we'll compare the performance of this approach to several commonly-available implementations on Windows and Linux, and show how it's possible to do significantly faster conversions.
—
Bob Steagall, KEWB Computing
CppCon Poster Chair

I've been working in C++ since discovering the second edition of The C++ Programming Language in a college bookstore in 1992. The majority of my career has been spent in medical imaging, where I led teams building applications for functional MRI and CT-based cardiac visualization. After a brief detour through the world of DNS and analytics, I'm now working in the area of distributed stream processing. I'm a relatively new member of the C++ Standardization Committee, and launched a blog earlier this year to write about C++ and related topics. I hold BS and MS degrees in Physics, and I'm an avid cyclist when weather permits.
—

*-----*
*-----*

Рекомендации по теме

Комментарии

As dry as the topic is, this talk was amazing. It was clearly structured, most topics built on the previos ones and the wording was easy to understand. I watched it at a far too late hour, but understood most of it, although struggeling at the DFA explanation.
If someone asks me about utf-8 and parsing, I will recommend this talk.

OperationDarkside

Thank you for the great presentation!

gabrielaubut-lussier

Skip to somewhere around 14:00 if you already know how UTF-8 and friends work to get straight to talking about the code.

YSoreil

mm_set and mm_set1 intrinsics compile to RAM references. You should use _mm_setzero_si128 to zero registers.

soonts

I would've liked to see some kind of analysis of why his code is so much more efficient -- especially compared to the other DFA implementation. I suspect his performance advantage may evaporate once he handles the cases that the other implementations deal with.

ZiggyGrok

Want to have strlen() results to feel the speed.

mrmodtube

I actually tested many 'optimized' UTF-8 decoders the other day, unfortunately, many could not correctly handle overlong or incorrect codepoints (even if they claimed they could), report error positions for invalid/corrupted bytes accurately, or beat a naive UTF-8 decoder when decoding mostly ASCII (Due to branch mispredicts). I like the presentation of this topic, but unfortunately, most of these 'optimized' decoders just aren't practical for production software.

yyny

Extremely interesting presentation. IMHO it's C (but really high C quality) and not C++, but this doesn't remove anything to the quality of the talk.

robinmoussu

"DFAs can recognize simple regular expressions" thats kinda backwards, regular expressions were first defined as a way to describe regular languages, which are, by definition, accepted by DFAs. The problem is that Perl then implemented regexes via backtracking and started adding features that are not regular, but easy to implement when you have a backtracking solver...

dragdu

I'm surprised that Microsoft had the best results among the competition.

llothar

Energy consumption should also be compared.

bernadettetreual

CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

CppCon 2018: Bob Steagall “Fancy Pointers for Fun and Profit”

C++Now 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

C++Now 2018: Bob Steagall “If I Had My 'Druthers: A Proposal for Improving the Containers in C+...

C++Now 2018:Bob Steagall “Fancy Pointers for Fun and Profit”

Back to Basics: Classic STL - Bob Steagall - CppCon 2021

Back To Basics: Debugging Techniques - Bob Steagall - CppCon 2021

Adventures in SIMD-Thinking (part 1 of 2) - Bob Steagall - CppCon 2020

Back to Basics: Templates (part 1 of 2) - Bob Steagall - CppCon 2021

CppCon 2018: Fabian Renn-Giles “A Semi Compile/Run-time Map with (Nearly) Zero Overhead Lookup”

Back to Basics: The Structure of a Program - Bob Steagall - CppCon 2020

The Business Value of a Good API - Bob Steagall - CppCon 2019

CppCon 2018: Juan Manuel Martinez Caamaño “Easy::Jit: A Just-in-Time compilation library for C++”...

Back to Basics: Templates (part 2 of 2) - Bob Steagall - CppCon 2021

Adventures in SIMD-Thinking (part 2 of 2) - Bob Steagall - CppCon 2020

CppCon 2017: Bob Steagall “How to Write a Custom Allocator”

CppCon 2018: Robert Ramey “Safe Numerics”

C++Now 2019: Bob Steagall “Linear Algebra for the Standard C++ Library”

A Multi-threaded, Transaction-Based Locking Strategy for Containers - Bob Steagall - CppCon 2020

CppCon 2018: Rishi Wani “Datum: A Compact Bitwise Copyable Variant Type”

CppCon 2018: “Latest and Greatest in the Visual Studio Family for C++ Developers 2018”

CppCast Episode 166: CppCon Poster Program and Interface Design with Bob Steagall

CppCon 2016: Bob Steagall “Lightweight Object Persistence With Relocatable Heaps in Modern C++'...

CppCon 2018: “Secure Coding Best Practices: Your First Line Is The Last Line Of Defense (1 of 2)”...