Speed up your SQL queries using Aerospike secondary indexes
With Aerospike Server 6.0, we have enhanced secondary index queries to allow for querying by partition, throttling, etc. The Trino connector now leverages this functionality to accelerate your secondary key queries at scale.
What is a secondary index?
A secondary index is a data structure used to quickly locate all the records in a namespace, or a set within it, based on a bin value in the record. When a value is updated in the indexed record, the secondary index automatically updates. You can retrieve records whose indexed value matches specified criteria using a secondary index query.
A query in Trino against a non-primary key bin that does not have a secondary index would invariably trigger a set (table) scan leading to an inefficient query. You can imagine the performance penalty that you would incur when dealing with a billion record table. Creating a secondary index (sindex) on a high-cardinality bin can help significantly speed up your Trino queries.
As shown in Figure 1, Aerospike secondary indexes are stored in DRAM for fast look-up, built on every partition in each cluster node, and are co-located with the primary index. Each secondary index entry contains only references to records local to the node.
Figure 1: Secondary index query
How do I get started?
Creating a secondary index on a bin in Aerospike is easy. You can use an Aerospike tool, such as asadm, or the API to dynamically create and remove indexes based on bins and data types you want to index. For an indexed bin, updating the record to include the bin updates the index. See how to create a secondary index for more information.
The Trino connector automatically analyzes the query and applies a secondary index to it, if one is available. If you know the cardinality of the secondary indexes, you can declare which sindex should be used for your query using the `sindex_name` session property. The `__sindex` table is created for each schema and provides details on available secondary indexes. You can change its name using the `aerospike.index-table-name` configuration property.
Here is an example of secondary indexes that we’ll see in action later:
We recommend that you provide the sindex to use in the query. The statement below shows an example of declaring a sindex to use:
set session aerospike.sindex_name=idx1;
If you do not declare a sindex for your query when multiple sindexes are available, the connector will pick the sindex with the highest lexical order from the list of available secondary indexes.
Will this really improve my Trino SQL query performance?
Prior to supporting secondary indexes, each query that was not a primary key search ended up with a full table scan on the Aerospike side. To illustrate my point, I will use a large (~20M records) data set to compare query response times with and without a secondary index.
A query for a particular `pkup_datetime`:
With a string secondary index:
Without a secondary index:
Another example of a range query on the `id` column:
Using a numeric secondary index:
Without a secondary index:
The response time is in seconds and we can see that the performance gain is significant. The primary index (PI) query was ~80 times slower than the secondary index (SI) query.
Aerospike secondary indexes can make a huge performance difference to your SQL queries. Make sure you read the official documentation regarding the compatibility and installation.
And finally, process your data faster than ever with the new Aerospike Connect for Presto!