Source: Photo by Alex wong on Unsplash Unsplash
SQL is broadly used as a data access language for analytics. Even if you are an application developer, chances are you have used it or at least are familiar with it.
Aerospike has broad support for SQL, enabling you to use SQL to access Aerospike data in multiple ways.
For analytics, you can access Aerospike data on Trino with the Aerospike Trino Connector.
Through Trino, analytics use cases such as ad-hoc SQL queries, reports, and dashboard have access to data in one or more Aerospike clusters, and they can also merge Aerospike data with data from other sources.
For more details of the Trino Connector, see the blog posts Deploy Aerospike and Trino based analytics platform using Docker and Aerospike Trino Connector - Chapter Two.
Starburst is a SQL-based MPP query engine based on Trino that enables you to run Trino on a single machine, a cluster of machines, on-prem or in the cloud. The blog post Analyze Data with Aerospike and Starburst Anywhere describes how to use Starburst Enterprise.
The data browser described in the blog post Aerospike Data Browser uses Trino with the Trino Connector underneath.
You can use Spark SQL to manipulate Aerospike data on the Spark platform. Aerospike Spark Connector provides parallel access to the Aerospike cluster from Spark.
Spark SQL merges two abstractions: Replicated Distributed Datasets (RDDs) and relational tables. Find examples of importing and storing Aerospike data to and from RDDs in these Aerospike Spark tutorials. You can use Spark SQL to manipulate and process data in RDDs.
More details on the Spark Connector are available in the blog posts Using Aerospike Connect for Spark and Accelerate Spark queries with Predicate Pushdown using Aerospike.
Application developers can use simple SQL with JDBC with the community-contributed JDBC Connector. Please read more details in the blog post Introducing Aerospike JDBC Driver.
While the various connectors allow broad SQL access for multiple purposes, the connectors may not be suitable for general applications as they do not provide the full Aerospike API functionality that a general application needs. For example, update capabilities are limited through SQL.
We recommend that you use the Aerospike API to access its full functionality and performance. Aerospike, a NoSQL database, does not directly support all SQL features. Inversely, Aerospike has many capabilities that cannot be expressed in SQL. This is to be expected because SQL is designed to provide physical data independence, which means the user need not worry about the physical details of the data such as the data distribution, size, selectivity, indexes, and so forth. The query optimizer deals with these details and selects the best execution plan. The goal of the Aerospike API is to provide full control to developers for optimal performance of their applications.
This article describes how a developer who is familiar with SQL can quickly implement specific SQL CRUD operations using the Aerospike API. The goal is not to discuss the many mechanisms to control optimal performance (although it points to some of them), but to provide a ramp for a developer who has some knowledge of SQL to map the basic CRUD queries into the Aerospike API. We encourage you to learn about the performance features using the pointers provided.
While Aerospike supports many languages, we have used the Java client API in our examples as it is most widely used. The functionality is similar across all client libraries, and you can find equivalent functions in each.
Mapping SQL to Aerospike
While there is no direct mapping of full SQL to Aerospike API, simple CRUD functionality can be easily mapped to Aerospike API as the underlying data models are similar: Aerospike’s set-record-bin organization matches the SQL’s table-record-column organization (see below).
We point out differences and unsupported constructs below. They need to be handled through alternative means such as specific features, libraries, and application code.
Aerospike has a record-based data model. An Aerospike Database holds multiple namespaces, which are equivalent to databases in the relational model. A namespace holds records (rows), organized in sets (tables) that are accessed using a unique key that serves as the record ID. A record can contain one or more bins (columns), and a bin can hold a value of different data types. Sets and records do not conform to any schema. The primary index provides fast access to a record by key, which is a unique record identifier, while secondary indexes defined on a bin are supported for content based access.
|SQL concept||Aerospike equivalent|
|Database or schema||Namespace|
|Index||Primary and Secondary indexes|
|Stored Procedures||User Defined Functions (UDFs)|
Aerospike is a NoSQL database, and its API has many differences from SQL databases. Following are some key differences:
- Set: A set is a tag on the record that gets created when the first record is created in the set. A set is schemaless and can hold records holding different bins.
- Record: A record is schemaless, and can hold any combination of bins.
- Bin: A bin is typeless, and can hold a value of any type.
- Index: Integrity constraints, such as uniqueness, cannot be specified on an index.
- Transactions: All single record requests are transactional. The transaction boundary does not span multiple records. For a detailed discussion, see the blog post Developers: Understanding Aerospike Transactions.
Constructs Not Directly Supported
Due to the differences in its data and execution models, Aerospike API does not directly support the following SQL constructs, however, they can be implemented using data modeling, alternative features, and application code. We will discuss them later.
- Aggregations (max, min, top, average, sum, etc.)
- Order By, Distinct, Union
- Constraints: NULL, Foreign Key, Default
- Built-in functions
For the purpose of our discussion, SQL queries can be organized in these categories:
- SELECT or read operations,
- CREATE, UPDATE, DELETE or write operations,
- Metadata operations, and
- Other functionality.
Use these interactive tutorials to work along with this text.
A Word on Key, Metadata, Policy, and API Variants
Before we dive in, it is useful to know record key, record metadata, operation policy, and API variants.
Each record is uniquely identified by a key or id, consisting of a triple: (namespace, set, user-key) where user-key is a user-specified id that is unique within the set. The key (also called the digest) is returned in all read APIs.
Each record has metadata associated with it: generation (or version) and expiration time (or time-to-live in seconds). This metadata is returned in all read operations. It is possible to retrieve only the metadata without the record's bins through the "getHeader" operation explained below.
Aerospike API calls take a policy parameter which includes many details of the how and what of the request. For example, timeout, retries, filter expression, and additional write semantics are specified in the policy object. We will specify significant policy info that is relevant to the operation semantics in each operation below.
Aerospike API is designed for control and simplicity. As such, a read and write operation that has one form in SQL has multiple variations in Aerospike API:
- By number of records involved: Single record, batch, and query
- By the processing mode: Sync, async, and background
In the following examples, only the synchronous APIs are shown when available, but you can easily discover the asynchronous variants in the documentation.
SQL SELECT and Equivalent Read Operations
Single-Record Read Operations
|SQL Query||Equivalent Aerospike API (Java)|
There is a variant of single record retrieval to check a record's existence.
It is possible to only obtain a record's header info or metadata, consisting of generation (or version) and expiration (time-to-live in seconds).
Batch Read Operations
A batch request operates on a list of records identified by the keys provided. It works similar to a single record retrieval, except multiple records are returned.
Batch requests are critical for high performance applications as they eliminate multiple client-server round trips, one for each record.
There is a variant of batch retrieval to check record existence.
It is possible to obtain header info or metadata consisting of generation (or version) and expiration time (time-to-live in seconds) for a specified set of records.
Composite Batch Read
A more general form of batch reads is also available that provides a union of simple batch results with different namespace, set, and bin specification. The
records argument takes the input record keys and populates record details on return.
Predicate-Based Read Operations
In predicate-based read operations (aka queries), records matching a general predicate or condition are retrieved. In SQL, the predicate is specified in the WHERE clause. Aerospike provides two ways of performing an SQL query:
- Using a secondary index based predicate, which can optionally be ANDed with an expression filter
- Using a scan (which uses the primary “key” index), which can optionally be ANDed with an expression filter
Secondary Index Query
While a query in SQL doesn’t require an index to exist, the query API in Aerospike requires that the corresponding secondary index exists.
The namespace, set, and secondary index based predicate is specified in the
statement argument. The expression filter is optionally specified in the
policy argument for additional conditions to be ANDed.
The scan operation takes a
callback object which is called for every record in the result (within the scope of the call which remains blocked until the operation completes).
The expression filter is optionally specified in the
SQL CREATE, UPDATE, DELETE and Equivalent Write Operations
Aerospike combines Create and Update in a single write operation. The following
record-exists-action options specified in the write-policy define the operation semantics if the record already exists:
- create-only: Create if record doesn't exist, fail otherwise.
- update: Create if record doesn't exist, update otherwise.
- update-only: Update if record exists, fail otherwise.
- replace: Create if record doesn't exist, replace otherwise.
- replace-only: Replace if record exists, fail otherwise.
SQL INSERT maps to
create-only and SQL UPDATE maps to
update-only options. SQL does not have a way to specify other options, such as replace, which removes an existing record.
Single-Record Write Operations
INSERT and UPDATE
The put operation handles Create (Insert) and Update.
Type-Specific Write Operations
Aerospike allows type-specific update operations. For integer and string types, they include the following. The
bins argument holds multiple bin objects, each with the bin name and the operand value.
Batch Write Operations
A batch write operates on multiple records specifically identified with a list of keys. There is a batch API for insertion, update, and deletion of multiple records. Two forms of batch writes are shown below. Other forms including one with a UDF (described below) and key-specific operations are described in the blog post Batch Operations.
The argument ops is a list of operations to be performed in the specified sequence on each record, and can include read as well as write operations. The argument
batchPolicy contains the specifics of how the batch is processed, whereas the arguments
deletePolicy have the specifics of how the respective individual record operation is performed.
Predicate-Based Write Operations
Predicate-based updates and deletes are possible by specifying the WHERE condition using the secondary index predicate (specified in a statement object) and expression filter (specified in the write policy) as explained earlier.
Predicate-based updates and deletes can involve a large number of records, and therefore are processed in background execution mode with the execute API. Sync and callback async modes are not available. Two forms of execute are possible:
- Using a list of bin updates and deletes: A multi-op request provides a list of bin operations. Multi-op requests are further described below.
execute performs in a background mode with no returned results, the operation list in the
statement object cannot have a read operation, only updates.
- User Defined Functions (UDFs): UDFs are equivalent to stored procedures, and are described further below. Record-oriented UDFs implement arbitrary logic in a Lua function that is registered with the server and invoked through an API call.
SQL Stored Procedures and Aerospike User Defined Functions (UDFs)
User Defined Functions (UDFs) are equivalent to stored procedures in SQL systems. A custom User Defined Function (UDF) is written in Lua, registered on the server, and invoked for a specified record(s). You can find further details in the documentation on User Defined Functions (UDFs).
In the following example, the UDF is specified using the arguments
functionName, and supplied a list of arguments it expects in
functionArgs. The API returns a generic
Object which can be anything like a single value or a map of key-value pairs.
A UDF can have arbitrary logic combining CRUD operations.
In Aerospike, aggregation functions such as MIN, MAX, AVERAGE, SUM, etc, over multiple records are implemented with Stream UDFs. This article does not cover the specifics of Stream UDFs; please refer to the tutorials on SQL: Aggregates.
Multiple single bin read and write operations are possible through the
operate API. It differs from the dedicated “single-op” requests, which allow just one operation. The operations in the argument
operations are executed atomically and in the order specified,
Unlike in SQL, read and write operations can be combined in the same request (for single-record and batch requests) as illustrated below.
Multi-op operate APIs are available for a single record, batch, and query operations. See this tutorial that illustrates multi-ops.
There is no API to create a namespace. A namespace is added through the config and requires a server restart.
The truncate API removes all records in a set or the entire namespace.
There is no API to delete a namespace. A namespace has one or more dedicated storage devices, and they must be wiped clean to delete the namespace.
There is no explicit operation to create a set. A set is created when the first record is inserted in the set.
A set is schemaless, and can hold records that have different schemas or bins. A bin has no type associated with it, and can hold values of any type. Therefore ALTER operation on a set to modify its schema is not needed.
All records in a set can be truncated using the truncate API:
There is no notion of deleting a set as a set is just a name that a record is tagged with. The namespace must be deleted to remove the set name.
An index is created on a bin for a specific value type. Integer, string, and GeoJSON types are currently supported for indexing.
serverPath below define the path to UDFfile on the client and server respectively.
Other SQL Capabilities
Most NoSQL databases do not have the Join operation as it is slow and complex. You can avoid Joins by storing the joined objects in aggregate form. Alternatively, the join can be performed in the application by retrieving the referenced object.
The policy parameter
max-records can be specified as a hint. Fewer objects may be returned as the limit gets divided among participating nodes.
Order By, Top, Union, Distinct
List and Expressions can be used to implement these operations. Alternatively, they can be performed in the application.
Aggregations involving Group-By, Having, and Aggregate Functions (such as Max, Min, Top, Average, Sum) can be implemented using Stream UDFs as shown in the tutorials SQL Aggregates - Part 1 and Part 2.
Integrity constraints such as NULL, Foreign Key, Default should be handled in the application logic. The uniqueness constraint can be enforced in a List or Map.
Many built-in functions like UPPER, TRIM, can be implemented with Expressions or UDFs.
Going Beyond SQL with Aerospike
In order to get the most out of Aerospike for speed-at-scale, thinking beyond SQL is necessary. The process starts with modeling your data for performance, scale, and other needs of the application. Please review the series Data Modeling for Speed At Scale.
- Collection Data Types (CDTs)
- Multi-op requests
- Batch requests
- Secondary indexes
- Set indexes
- Complex Data Types - Binary, HLL, GeoJSON
- User Defined Functions (UDFs)
You can use SQL to access Aerospike data through the Trino, Spark, and JDBC Connectors. While the connectors work quite well for the environment and intent they are built for, they do not provide the full Aerospike API functionality that the application may need. Therefore, use of the Aerospike API is recommended for full functionality and performance. The Aerospike API is designed with the goal of enabling developers of high performance applications who need to control performance specific details to make better decisions.
The article describes how a developer who is familiar with SQL can quickly implement specific SQL CRUD operations using the Aerospike API. Coming from a SQL background, it is important to remember that through NoSQL data modeling one should be able to avoid certain SQL features entirely such as the Join to maximize the benefit of using Aerospike for performance and scale. With the introduction provided in this article, you should be able to take the next step to learn the mechanisms in Aerospike API to optimize your application’s performance and scale.
- SQL Operations (interactive tutorials)
- Aerospike Documentation (documentation)
- Aerospike Tutorials (interactive tutorials)
- Aerospike Developer Hub
- Developers: Understanding Aerospike Transactions (blog post)
- Collection Data Types (CDTs) (documentation)
- Collection Data Types (CDTs) (interactive tutorials)
- Batch Operations (blog post)
- Introduction to Transactions (interactive tutorial)
- User Defined Functions (UDFs) (documentation)
- Data Modeling for Speed At Scale (blog post)
- Deploy Aerospike and Trino based analytics platform using Docker (blog post)
- Aerospike Trino Connector - Chapter Two (blog post)
- Analyze Data with Aerospike and Starburst Anywhere (blog post)
- Aerospike Data Browser (blog post)
- Aerospike Spark Tutorials (interactive tutorials)
- Using Aerospike Connect for Spark (blog post)
- Accelerate Spark queries with Predicate Pushdown using Aerospike (blog post)
- Introducing Aerospike JDBC Driver (blog post)