ElasticSearch in Python #18 - Deep pagination: Search after VS From/Size

Показать описание

Hello everyone! In this video, I will show you the concept of pagination in Elasticsearch, specifically focusing on deep pagination.

Imagine you have 100,000 documents stored in an index. How do you efficiently retrieve the information you need? This is an important question, because attempting to fetch all documents at once can be both slow and inefficient.

Pagination comes into play here, allowing you to retrieve data in smaller chunks. This approach enhances the performance and efficiency of your system. Users benefit from a quicker search experience, and if you're utilizing Elasticsearch in the cloud, pagination proves to be cost-effective.

In Elasticsearch, there are two primary methods for implementing pagination: the "from/size" method, which is typically used for smaller indexes, and the "search_after" method, which is better suited for larger indexes.

1. "from/size" method. In this index, the "from" parameter indicates where Elasticsearch should begin fetching documents, while the "size" parameter specifies how many documents to return. For instance, if "from" is set to 0 and "size" is set to 8, we will retrieve the first eight documents. If we change "from" to 5, Elasticsearch will skip the first five documents and start retrieving from there.

This method becomes ineffective with more than 10,000 documents, as it demands significant memory for deep pagination. Each time documents are skipped, performance slows down, making this approach unsuitable for large datasets.

2. the "search_after" method. For this method, each document must include a sortable field, such as the document ID or a timestamp.

You begin by setting the "size" parameter, and Elasticsearch returns the first eight documents along with corresponding sort values. For subsequent requests, these sort values are used to skip documents. With each request, you continue to obtain sort values until you reach the last document in the index.

The "search_after" method does not have the same 10,000 document limit and does not require the "from" parameter. The results returned by Elasticsearch must be sorted based on the sortable field. The method relies on a pointer derived from the sort values of the last document from the previous page.

When benchmarking both methods, I concluded that the "search_after" is way better at handling larger indexes. When handling small datasets, both methods work perfectly fine.

In this series, we focus on using the Python client to interact with Elasticsearch.

Here is the link to the GitHub repository:

Useful links:

Don't forget to like, subscribe, and leave a comment if you have any questions or feedback!

Support us at:

⭐️ Contents ⭐️
(00:00) Intro + slides
(06:40) Code time
(16:23) The end

#3_code_campers #ElasticSearch #ElasticSearchPython