Accelerating SQL on Aerospike

Director - Developer Ecosystem

November 4, 2022|7 min read

Aerospike Database is deployed by large-scale real-time applications in a wide range of verticals. Businesses need “as it happens” visibility over these systems - sometimes in near-real time - via notifications, ad-hoc queries, dashboards, and reports.

SQL is broadly used as a data access language for analytics, and Trino provides a powerful engine for SQL access to multiple data sources. Aerospike Trino Connector enables SQL access to Aerospike data through Trino, and more broadly, allows Aerospike to be used to expand fast analytics data accessible from Trino.

For analytics use cases, you can synchronize transactional and operational data from globally distributed Aerospike clusters to the Aerospike system of record (SOR) or analytics store in near real-time using Aerospike’s XDR protocol.

Aerospike: Enabling Fast Queries At Scale

Aerospike provides fast access to large volumes of data with parallelism at multiple levels.

Large volume of fast storage

The Aerospike Database cluster architecture spans multiple nodes for scale, performance, and reliability. High density of fast storage is achieved through SSDs. Aerospike pioneered the use of SSDs in databases over a decade ago, and it has since been deployed by many web-scale, mission-critical applications across several industries. Hybrid memory architecture (HMA) stores indexes and data in DRAM, SSD, and other devices to provide cost-effective fast storage capacity.

Low latency access

Aerospike optimizes disk (SSD) reads and writes for latency and throughput.

Like most other databases, indexes play a key role in realizing fast access to data. Aerospike supports secondary indexes on integer, string, geospatial, map, and list columns.

Multiple levels of parallelism

The thread architecture on a cluster node is optimized to exploit parallelism of multi-core processors of modern hardware, and also to minimize conflict and maximize throughput. The data is distributed uniformly across all nodes to maximize parallelism and throughput. The client library connects directly to individual cluster nodes and processes a request in a single hop by distributing the request to nodes where it is processed in parallel, and assembling the results.

Fine grained sub-queries

Aerospike distributes records in 4096 uniform partitions, and allows separate sub-queries over them for maximum parallelism. In other words, a query can be split into independent parallel sub-queries over one or more partitions, for the needed parallelism to match the required throughput. Further, each data partition can be subdivided into N sub-partitions by adding the modulo filter expression digest % N == i for 0 <= i < N, where digest is the hashed key of the record. Since digest is held in memory with the other record metadata, the filter expression evaluation for a record’s membership in a sub-partition requires no access to data on the SSD. Therefore, a sub-partition query reads only the data in its sub-partition, minimizing the SSD reads across the multiple sub-partitions. This sub-partitioning scheme allows for an arbitrary number of parallel streams.

Using this scheme, a large number of parallel tasks in Trino worker nodes can uniformly split the data for processing via an equal number of mutually exclusive and collectively exhaustive splits or streams using partition queries in combination with the modulo filter expression. The appropriate data scale, throughput, and response time can be achieved by adjusting the Aerospike cluster size as well as the number of attached SSD devices per node.

Stream access

Query results can be retrieved and processed in a stream of smaller chunks by repeatedly asking for a specific number of remaining records.

Aerospike Trino Connector

The Aerospike Trino Connector enables access to Aerospike data through Trino for analytics use cases such as ad-hoc SQL queries, reports, and dashboards. Data in multiple Aerospike clusters can be queried together using Trino’s data federation, which also makes it possible to merge Aerospike data with data from other sources.

Additional details about the Trino Connector can be found in the blog posts Deploy Aerospike and Trino based analytics platform using Docker and Aerospike Trino Connector - Chapter Two.

Starburst is a SQL-based MPP query engine based on Trino that enables you to run Trino on a single machine, a cluster of machines, on-prem or in the cloud. The blog post Analyze Data with Aerospike and Starburst Anywhere describes how to use Starburst Enterprise. Recently released Aerospike SQL Powered by Starburst 1.1.0 supports Starburst Enterprise Platform (SEP), which is "a fully supported, enterprise-grade distribution of Trino that adds integrations, improves performance, provides security, and makes it easy to deploy, configure, and manage your clusters."

Accelerating SQL Queries

The Connector aligns Trino and Aerospike to provide fast query processing.

Access to large volumes of fast storage

Using Aerospike expands the data available for fast processing on the Trino platform, as Aerospike is built to provide fast access to large volumes of data stored on SSDs.

Parallelism

The Trino Connector can be configured to provide an arbitrary number of splits or streams over a data set to match the desired parallelism on the Trino cluster.

Query optimization

The Trino Connector works with Trino’s Cost Based Optimizer (CBO), which optimizes data access by pushing down predicates and leveraging secondary indexes when they are present.

Additional points

In order to extract best performance, here are a few things to keep in mind:

Data modeling: How data is organized in Aerospike plays a crucial part in query performance. Related objects should be aggregated in a single record to the extent possible. While Trino is capable of performing joins across tables, a join over large data can be slow and resource intensive.
Indexes: Appropriate indexes should be defined in Aerospike to avoid full scans on the server. The embedded objects may be flattened in order to define and leverage indexes on embedded object fields.

Also note that group-by aggregates and ordering operations are always processed on Trino. While some aggregates may be pre-computed and stored in Aerospike for best performance, an ad-hoc aggregate request is processed on Trino.

Other SQL Options

Applications can access the data through SQL in a few other ways.

Spark

You can use Spark SQL to manipulate Aerospike data on the Spark platform. Aerospike Spark Connector provides parallel access to the Aerospike cluster from Spark.

Spark SQL merges two abstractions: Replicated Distributed Datasets (RDDs) and relational tables, and is used to manipulate and process data in RDDs. Find examples of importing and storing Aerospike data to and from RDDs in the Aerospike Spark tutorials.

More details on the Spark Connector are available in the blog posts Using Aerospike Connect for Spark and Accelerate Spark queries with Predicate Pushdown using Aerospike.

JDBC

Application developers can use simple SQL to access Aerospike data with the community-contributed JDBC Connector. Please read more details in the blog post Introducing Aerospike JDBC Driver.

Aerospike API

Applications requiring the full functionality of the Aerospike API can leverage the SQL patterns available in the APIs to implement specific SQL CRUD operations easily.

Aerospike Database’s fast high-capacity storage and parallel processing aligns with Trino’s distributed SQL query engine to accelerate query processing over large data sets. Aerospike’s hybrid memory architecture (HMA) leverages SSDs along with DRAM to greatly expand fast storage capacity in its cluster. Further, Aerospike distributes data and processing over a large number of partitions, providing a high degree of parallelism. For Trino queries over data in Aerospike, the result is accelerated performance at scale.