Advanced Indexing: Json and Text

Показать описание

00:00:00 Welcome
00:01:31 Jackie Jiang Intro
00:02:28 Plug and Play with Apache Pinot Json Index
00:30:50 Siddharth Teotia Intro
00:32:38 Text Indexing in Pinot
01:07:30 Q&A

Talk 1: Plug and Play with Apache Pinot Json Index
-----------------------------------------------------

When ingesting data from an event stream (such as Kafka), the source events can be stored as nested or unstructured records. In order to ingest these records into a structured data store for further analysis, one common problem is to flatten and extract fields from the records. Usually that is done by setting up another stream processing job (e.g. Flink) to pre-process the stream and produce a new stream with structured records. This requires users to maintain a separate job and system, which is quite a heavy overhead and bad experience especially when users want to try out some use cases quickly.

With Apache Pinot Json Index, these unstructured records can be directly consumed and stored as json strings, and Pinot can automatically flatten the records and build an index on top of them to accelerate the value lookup. Users can enjoy a plug and play experience with impressive performance, no longer worrying about maintaining another system.

Presented by:
Jackie Jiang
Founding Engineer at StarTree, PPMC and Committer for Apache Pinot

Jackie got his bachelor's degree from Tsinghua University and master's degree from Carnegie Mellon University. Then he started his career at LinkedIn for 4 years and became the PPMC and one of the top contributors for Apache Pinot. Jackie's goal is to make Apache Pinot the fastest online analytics platform in the market.

-----------------------------------------------------
Talk 2: Text Indexing in Pinot
-----------------------------------------------------

Pinot supports super fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters on terms are run efficiently through a combination of our highly optimized native storage structures such as dictionary encoding, inverted index and sorted index. What if the user is interested in doing arbitrary text search instead of exact matches? Pinot supported this through the in-built function REGEXP_LIKE.

Unlike exact matches, indexes can’t be used to evaluate the regex filter and we resort to full table scan which becomes inefficient. For arbitrary text data which falls into the BLOB/CLOB territory, we need more than exact matches. Users are interested in doing regex, phrase and fuzzy queries on BLOB like textual data. To efficiently handle such queries, Pinot added support for text indexes on STRING columns where each column value can be a blob of heterogeneous text. In this talk, we will go into the design, implementation of text index support, challenges encountered, future work, performance numbers along with insight into how we are using it at huge scale within LinkedIn.

Presented by:
Siddharth Teotia
Senior Software Engineer @ LinkedIn, PPMC Apache Pinot, PMC Apache Arrow

Siddharth works at LinkedIn in the Pinot team part of Systems and Infrastructure group. Prior to LinkedIn, he worked at Oracle for 3.5 years in the Database kernel group on storage, indexing and in-memory columnar query processing. Prior to Oracle, Siddharth worked at Dremio for 2 years as one of the early engineers building out the distributed data lake query engine. He is also a PMC member for Apache Arrow and has previously given talks at multiple conferences and meetups.

-----------------------------------------------------
Resources
-----------------------------------------------------