Extending Redis API with a Native Time Series Data Type
Defining time series data
- Stock quote feed.
- Order history for a user in an online retailer.
- User activity in any application.
- Data gathered from IoT sensor devices.
In general time series data has the following characteristics:
- Each data point in a time series (e.g. historical cpu stats, event log from a device, stock quote feed etc.) is a <timestamp, value>‘ pair which denotes the observation recorded at a point in time.
- Generally, data arrives in increasing timestamp order and just needs to be appended to the end of the time series, but sometimes data points may arrive out of order and need insertion in the middle.
- Data is usually read by specifying a time window. For example, reading all data points collected for a metric between 5pm and 7pm today.
- There is often a time-to-live(TTL) associated with the data points, which depends on the timestamp of the data point. For example, for a given series, we may want to purge each data point in the time series if 7 days has elapsed since the timestamp of that data point.
Time series data in Redis Open Source
There are predominantly two ways this is achieved today in Redis.
- Using Sorted Sets
- Using plain key-values
Both of these are complex to model, and lead to poor performance, as explained below.
1. Modeling time series data with Redis Sorted Sets
One way to model the data in Redis Sorted Sets would be to use the score as the timestamp for the event and the value as the measurement at that time. For example:
ZADD cpu_usage 201708110501 “70%”
The problem with this approach is that sorted sets keep unique values and as a result if there is a measurement later on with the value 70%, then the older value will be overwritten. As a result, a common workaround to this is to add the timestamp to the value field:
ZADD cpu_usage 201708110501 “201708110501:70%”
This approach has the following drawbacks:
Higher memory and CPU consumption
We use extra memory space for the value (because the timestamp is repeated in the score and value) and also have some custom serialization and deserialization CPU overhead for the value.
Two maps for storage
Furthermore, Sorted Sets internally have the overhead of keeping two maps and updating both of them as the data is mutated:
- We need a map from the score to all associated values to support operations like ZRANGEBYSCORE and ZREMRANGEBYSCORE, since Sorted Sets are sorted based on the score.
- We need a map from the value to the score to efficiently support operations like ZSCORE, ZREM, ZRANK and ZADD.
For a time series like workload, we would use ZADD to add new data points (which would write to both maps), ZRANGEBYSCORE to get data for a range of timestamps that we are interested in (incurs a lookup for a single map) and ZREMRANGEBYSCORE to remove old timestamps (this would have to delete the entry from both maps).
Additionally, storing the value twice causes inefficiencies especially if the value size is very large. Time series use-cases do not need to lookup the timestamp given the value.
As you can see, there is a significant overhead associated with writing and deleting data if we model time series data on Sorted Sets. Ideally for a write-heavy workload like time series, we would like these operations to be fast and low on resource consumption.
No native TTL support
Additionally, lack of finer-grained TTL (time-to-live) in Sorted Sets make purging data tedious. Sorted Sets support a TTL at the set level which purges all the data together. Whereas for time series data, we would like each timestamp to live for the TTL amount from the insertion point. For example if we set a TTL of 7 days on a Sorted Set, after 7 days all the data would be purged, including data that was inserted a few minutes ago! What we really want is to always keep the data for the last 7 days. As a result, we need application-driven data purging logic to delete expired entries, which adds further complexity to the system.
2. Modeling time series data with plain key-values
We can model time series data using plain key-values by having a key which includes the name of the time series (e.g. cpu_usage), the window size of the data (minutely, hourly etc.) and the starting timestamp of the window. The value would be a serialized blob containing all the data for that window:
SET cpu_usage:hourly:2017–12–18:17:00 <blob>
This would store all the data points for the time series cpu_usage in the range [2017–12–18:17:00, 2017–12–18:18:00). The starting timestamps of each bucket could be well defined timestamps (for example: at the tick of each hour/minute). As a result for an arbitrary time range, we can retrieve the overlapping buckets and read the appropriate values from it. However, this approach has several drawbacks:
- For the current window, we need to perform a read-modify-write of the entire serialized blob to add a new data point until the window’s end time hasn’t passed. Since sometimes data might arrive out of order, we might also have to update older windows. This is an extremely expensive operation.
- In addition to this the read-modify-write also needs proper synchronization on the application side if we have multiple threads/processes/machines updating the same data.
- One way to avoid the read-modify-write would be to buffer all data in memory and write it in a single batch. However, if our application crashed before we could write the data, we’d lose all the buffered data.
- Also, keeping the database and cache in sync is tricky in this approach since we need to update both.
- If we store data hourly, reads have an overhead of always reading data in fixed hourly chunks and as a result we might read a lot of data that we probably don’t need. To avoid this we could have multiple buckets with different granularity (ex: minutely, hourly), although this incurs an overhead during write, since we need to update multiple key-values.
- We need to build some logic in our application to read data for a custom time range, since we need to read all the relevant time windows and pick out the appropriate values. This operation is also not efficient, since the data is not necessarily contiguous in memory.
YugaByte DB Time Series (TS) data type
Recognizing these issues in modeling time series data on Redis, we have worked closely with various customers to address this use case in YugaByte DB’s Redis compatible YEDIS API.
YugaByte DB’s Redis compatible YEDIS API supports a native time series (TS) data type that can handle all the above requirements with ease of data modeling and very high performance. Also, since YEDIS is actually a “Redis as a elastic, fault-tolerant database”, you do not need to worry about persisting your data in a separate data store!
The YEDIS TS data type is essentially a sorted map from a 64 bit integer to a single object/value. As a result, its much easier to model time series data using this data type. The 64 bit integer can be used as the timestamp for the time series data and the value as the associated measurement. This means that adding new data points is very easy (insert into a single map) and they are naturally ordered. Since the map is sorted on the 64 bit integer, it’s easy to retrieve a range of timestamps that we are interested in. In addition to this, we support setting a TTL for each timestamp.
In these examples, the timestamp is encoded as an integer in yyyymmddhhmm format for readability. We can also use unix timestamps if that is better suited.
To add data to the time series, we can use the TSADD operation. This operation just needs to insert a key-value pair into a single map for each (timestamp, value) pair.
> TSADD cpu_usage 201708110501 “80%” 201708110502 “60%” “OK”
To lookup the value for a single timestamp we can use the TSGET command, which needs to lookup a single map to return the value.
> TSGET cpu_usage 201708110501 “80%”
To retrieve all the data between a time range, we can use TSRANGEBYTIME. This operation is efficient since our map is already sorted on the timestamps.
> TSRANGEBYTIME cpu_usage 201708110501 201708110503 1) 201708110501 2) “80%” 3) 201708110502 4) “60%”
To delete a time series entry, we can use TSREM. This operation is efficient since its just deleting an entry from a map.
> TSREM cpu_usage 201708110501 “OK” > TSGET cpu_usage 201708110501 (nil)
We can use the EXPIRE_IN and EXPIRE_AT commands to set the TTL (time-to-live) for each timestamp.
// This entry would expire in 3600 seconds (1 hour) > TSADD cpu_usage 201708110504 “40%” EXPIRE_IN 3600 “OK” // This entry would expire at the unix timestamp 1513642307 > TSADD cpu_usage 201708110505 “30%” EXPIRE_AT 1513642307 “OK”
In this blog, we see why modeling time series data in vanilla Redis is complicated and not performant, whereas YugaByte’s TS datatype allows for simple and efficient storage and retrieval of time series data. You can find the full documentation for the TS commands in our docs:
If you are interested in modeling time series data in your Redis application, we encourage you to try the TS datatype in YugaByte DB’s YEDIS API by following our Quick Start guide.