A short survey of the timeseries databases of today
Having worked with some of the significant time-series database products and having explored others from a tech enthusiast’s point of view, I thought it might be helpful to summarize what I have learned in a quick write-up. I grew fond of time-series databases when I started mucking around with InfluxDB in 2017 for an algorithmic options trading platform a colleague wanted to build. We did a couple of days worth of brainstorming. Nothing came out of that, except that my mucking around with InfluxDB had sparked my interest in time-series databases. Since then, I have been keeping myself up to date with the goings-on in the world of TSDBs.
I have written about time-series databases previously. I have covered their unprecedented rise in popularity and some other specialized or, as AWS likes to call them, purpose-built databases. I have also talked about specific features of some of the popular databases like InfluxDB (Influx Line Protocol), TimescaleDB (Comparison using TSBS), and QuestDB (Ingestion patterns). In this article, I will go through the most popular time-series databases and several databases that were not built to solve the time-series problem specifically and can handle it pretty well. So, without further ado, let’s dive right into it.
Scalable datastore for metrics, events, and real-time analytics.
InfluxDB is an open-source database. This is, by far, the most popular and most used time-series database in the world. InfluxData earns money by offering enterprise versions of this database on-prem and in the cloud. Like most other TSDBs, you can deploy InfluxDB on either of the three major cloud platforms — AWS, Google Cloud, and Azure.
InfluxDB also provides you with templates for popular use cases. One fun use case is monitoring gaming metrics for Counter-Strike, as shown in the image below:
Without a doubt, InfluxDB has been instrumental in the sped-up progress of time-series databases in the last couple of years. One such example is the InfluxDB line protocol, a lightweight protocol to store and transfer timeseries data into a timeseries database. Another major timeseries database, QuestDB, which supports PostgreSQL wire protocol, decided to support and push for InfluxDB line protocol because of its implementation and performance.
You can read more about their journey on this blog.
An open-source SQL database designed to process time-series data faster
Vlad Ilyushchenko, working in the financial services industry in the early 2010s, had a good idea about the importance of low latency in the financial sector. He started QuestDB as a side project to create a superfast timeseries database that works at scale. Recently, Vlad talked about QuestDB’s architecture and exciting use cases at the Carnegie Mellon University Database Group:
QuestDB uses a columnar structure to store data. QuestDB appends any new data at the bottom of each column to preserve the natural time order of the ingested data. This database also supports relational modeling for timeseries data, which means that you can write joins and use SQL queries to read your data.
Like many other timeseries databases today, QuestDB is also inspired by the open-source relational database behemoth, PostgreSQL. Like running most of your PostgreSQL queries in Redshift, you can run most of those queries in QuestDB. This is possible in QuestDB as it supports Postgres wire protocol. This Postgres compatibility also means that most of the drivers you use to connect to PostgreSQL would also work with QuestDB. You can find the advantages and limitations of QuestDB’s PostgreSQL compatibility here.
It’s been a while, but I wrote about using TSBS (Timeseries Benchmarking Suite) for . You can essentially use TSBS to compare most of the timeseries databases on the market. Here’s the GitHub repo for that.
An open-source time-series SQL database optimized for fast ingest and complex queries. It is packaged as a PostgreSQL extension.
One of the great features of PostgreSQL is its extensibility. It’s not coincident that many amazing databases have been built on PostgreSQL. Redshift is one, for example. TimescaleDB is another amongst many others. TimescaleDB is essentially a package extension on top of PostgreSQL, solving for a specific read-write pattern with time-series data. In the following video, Erik Nordström talks about how TimescaleDB works and how it compares with other competitors:
TimescaleDB can be downloaded and self-hosted. It can also be hosted in the cloud on a platform of your choice via the multi-cloud management platform Aiven. I have talked about the different deployment options on AWS for many of the databases discussed here in this post:
Timeseries Databases on AWS
And the newest TSDBs on the AWS Marketplace
TimescaleDB also offers Promscale, a database backend for Prometheus to handle OpenTelemetry data efficiently. You can learn more about the finer details of how TimescaleDB differs from PostgreSQL and how the other product offerings on the official blog.
This is a unique one. kdb+ is a columnar time-series database that supports relational modeling and in-memory computing. It has been known in the high-tech financial trading industry for quite a few years now. kdb+ is written in a rarely used programming language called k. This language is known for its array-processing capabilities. The query language is also based on a variant of k, called q. See for yourself the esoteric nature of both these languages in this tutorial.
In all honesty, I don’t know anyone who works or has worked on kdb+, but that goes to say that this is a niche product far from the mainstream. You’ll find enough companies using it with less common architecture patterns, such as the one described here, which involves using Amazon FSx for Lustre with many EC2 instances, S3 buckets, etc. for kdb+’s services like Tickerplant, Realtime database, Historical database, Complex event processing (CEP), and Gateway. Similar architectures are possible on Azure, Google Cloud, and Digital Ocean.
High-speed Financial Time-series Analysis With KX and Databricks
This is a guest co-authored post. We thank Connor Gervin, partner engineering lead, KX, for his contributions. KX…
Databricks partnered with kdb+ to enable ultra-fast financial timeseries data analytics using Spark. You can use Databricks with kdb+ using several options, including PyQ, Databricks APIs, federated queries using JDBC, etc. The database ranking website db-engines.com ranks second to InfluxDB in the time-series database category. Although I reasonably doubt that you’d need kdb+ in your next job, it doesn’t hurt to know what it is.-
Druid isn’t necessarily only a timeseries database, but it is commonly used for superfast aggregations over time-ordered data. Hence, Druid is better identified as a time-based analytics database. Because of this, Druid’s architectural feat lies in the variety of use cases it serves. A few use cases overlap with timeseries databases like InfluxDB and TimescaleDB, such as network telemetry analysis and application performance analytics. Because of this overlap, Druid says that it combines a data warehouse, a time-series database, and a search system.
The Guide to Apache Druid Architectures
As we build out the Rill Data team, we often encounter folks who are new to Apache Druid and looking for ways to get up…
Its core focus on being a distributed MPP system with columnar storage and real-time and batch ingestion capabilities is an exciting tool for your data engineering stack. For super fast queries, Druid uses compressed bitmap indexes and time-based partitions for pruning the data you don’t need. Druid uses a JSON-based query language, similar to what you might have seen in MongoDB or Cassandra. Still, because everyone knows SQL and interacts with data using SQL, Druid also offers Druid SQL, a wrapper on top of the native query engine.
Companies like Netflix, Airbnb, Salesforce, Booking, Appsflyer, Criteo, and PayPal use Druid in production. Here’s an example of a case study from Netflix’s tech blog talking about how they used Druid with Kafka to deliver insights in real-time:
How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience
By Ben Sykes
I’d also suggest another blog post by Roman Leventov that talks about the difference between Druid, Pinot, and ClickHouse.
Timeseries databases are great for many use cases, especially when you get a natural time order with your data by default. Because of the demands, there’s been an uptick in the number of timeseries databases propping up now and then. This has also resulted in some cloud platforms coming up with their timeseries databases, partly or wholly inspired by their open-source counterparts. We will see more adoption in the timeseries database domain over the next few years. Being a SQL person myself, I might be biased, but I do think that most of those timeseries databases will try to support the ANSI SQL standard. Let’s see.
Time series data is often a continuous flow of data like measurements from sensors and intraday stock prices. A time-series database lets you store large volumes of timestamped data in a format that allows fast insertion and fast retrieval to support complex analysis on that data.What is time series database? ›
A time series database (TSDB) is a software system optimized to sort and organize information measured by time. A time series is a collection of data points that are gathered at successive intervals and recorded in time order.Why is Cassandra good for time series data? ›
Cassandra has good support for modelling time series data wherein each row can have dynamic number of columns. The viewing history data write to read ratio is about 9:1. Since Cassandra is highly efficient with writes, this write heavy workload is a good fit for Cassandra.Where are time series databases used? ›
A time-series database (TSDB) is a database management system optimized for storing and querying data that changes over time. Time-series data is often used in monitoring applications where it's important to quickly retrieve information about a system's current state and trends and patterns over time.Which type of plot is best used for time series data? ›
Line graph is probably the most simple way to visualize time series data. It uses points connected to illustrate the changes. Being the independent variable, time in line graphs is always presented as the horizontal axis.What is the main purpose of time series? ›
There are two main goals of time series analysis: identifying the nature of the phenomenon represented by the sequence of observations, and forecasting (predicting future values of the time series variable).What are the 3 key characteristics of time series data? ›
- Seasonal and nonseasonal cycles.
- Pulses and steps.
- Secular trend, which describe the movement along the term;
- Seasonal variations, which represent seasonal changes;
- Cyclical fluctuations, which correspond to periodical but not seasonal variations;
- Irregular variations, which are other nonrandom sources of variations of series.
A time series is a sequence of data points where each point is a pair: a timestamp and a numeric value. A time series database stores a separate time series for each metric, allowing you to then query and graph the values over time.What is time series Short answer? ›
A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series.
In plain language, time-series data is a dataset that tracks a sample over time and is collected regularly. Examples are commodity price, stock price, house price over time, weather records, company sales data, and patient health metrics like ECG.Why is time series data difficult? ›
The difficulty with time series is that it is not a binary task. If your test forecast is the same as your original data, there is a great great chance that your model is overfitting your data. And how to evaluate your forecast ?Why time series forecasting is the best? ›
Analysts can tell the difference between random fluctuations or outliers, and can separate genuine insights from seasonal variations. Time series analysis shows how data changes over time, and good forecasting can identify the direction in which the data is changing.Why Netflix uses Cassandra? ›
Cassandra. Netflix uses Cassandra for its scalability and lack of single points of failure and for cross-regional deployments. ” In effect, a single global Cassandra cluster can simultaneously service applications and asynchronously replicate data across multiple geographic locations.”Which method is used in time series data? ›
AutoRegressive Integrated Moving Average (ARIMA) models are among the most widely used time series forecasting techniques: In an Autoregressive model, the forecasts correspond to a linear combination of past values of the variable.How do you explain a time series plot? ›
A time series chart, also called a times series graph or time series plot, is a data visualization tool that illustrates data points at successive intervals of time. Each point on the chart corresponds to both a time and a quantity that is being measured.What are the limitations of time series? ›
Time series analysis also suffers from a number of weaknesses, including problems with generalization from a single study, difficulty in obtaining appropriate measures, and problems with accurately identifying the correct model to represent the data.What are the types of time series? ›
The three main types of time series models are moving average, exponential smoothing, and ARIMA. The crucial thing is to choose the right forecasting method as per the characteristics of the time series data.How much data does a time series have? ›
For most time series applications, this means that the submitted data should have as many observations as the period of the maximum expected seasonality. For example, if you have daily sales data and you expect that it exhibits annual seasonality, you should have more than 365 data points to train a successful model.What format is time series data? ›
Time series data usually has one of two formats: Wide format. Long format.
By a time series plot, we simply mean that the variable is plotted against time. Some features of the plot: There is no consistent trend (upward or downward) over the entire time span. The series appears to slowly wander up and down.What is a time series pattern? ›
Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.What is the minimum sample size for time series analysis? ›
40 observations is often mentioned as the minimum number of observations for a time-series analysis” (Poole et al., 2002. (2002).How can time series data be improved? ›
Do automatic data preparation – correlate the data sets, look for anomalies that can affect the forecast. Train the machine learning models – use a variety of algorithms to derive a forecast and then test for accuracy, select the best models for the use case.Is time series easy? ›
It is unfortunate that intro to data science is often split into regression / classification / clustering. There should be a fourth major category: time series. Time series are difficult to model. They are a completely different beast from the typical flat, tabular data.How do you evaluate a time series? ›
Steps for validating the time-series model
Compare the predictions of your model against actual data. Use rolling windows to test how well the model performs on data that is one step or several steps ahead of the current time point. Compare the predictions of your model against those made by a human expert.
Time series forecasting is a technique for predicting future events by analyzing past trends, based on the assumption that future trends will hold similar to historical trends. Forecasting involves using models fit on historical data to predict future values.Does Facebook use Cassandra? ›
In-box Search was launched in June of 2008 for around 100 million users and today we are at over 250 million users and Cassandra has kept up the promise so far. Cassandra is now deployed as the backend storage system for multiple services within Facebook.What language is Cassandra written in? ›
Cassandra is a distributed database management system which is open source with wide column store, NoSQL database to handle large amount of data across many commodity servers which provides high availability with no single point of failure. It is written in Java and developed by Apache Software Foundation.Does Netflix uses Kafka? ›
Netflix uses Apache Kafka throughout the organization. Consequently, a bridge converts MQTT messages to Kafka records.
Times Series With SQL
Working with a time series dataset can be conducive to your SQL learning for many reasons. Time series data, by nature, store records that are not independent of each other. Analyzing such data will require conducting more complex calculations between columns and between rows.
Time Series Data in MongoDB
MongoDB is a document-based general purpose database with flexible schema design and a rich query language. As of MongoDB 5.0, MongoDB natively supports time series data. You can create a new time series collection with the createCollection() command.
MySQL and a number of it's variants can be used as a time-series database. Using the MySQL example employees database we are going to provide a list of time-series analysis that we want performed, giving you a chance to try writing the SQL yourself.What is the best time series model? ›
AutoRegressive Integrated Moving Average (ARIMA) models are among the most widely used time series forecasting techniques: In an Autoregressive model, the forecasts correspond to a linear combination of past values of the variable.What is time series good for? ›
Time series analysis helps organizations understand the underlying causes of trends or systemic patterns over time. Using data visualizations, business users can see seasonal trends and dig deeper into why these trends occur. With modern analytics platforms, these visualizations can go far beyond line graphs.Why Realtime Database is important? ›
Real-time customer analytics are essential for improving experiences across marketing touchpoints. They can also ensure marketers serve the right information to the right customer at the right time.Which database has the best performance? ›
- The Oracle. Oracle is the most widely used commercial relational database management system, built-in assembly languages such as C, C++, and Java. ...
- MySQL. ...
- MS SQL Server. ...
- PostgreSQL. ...
- MongoDB. ...
- IBM DB2. ...
- Redis. ...
A real-time database is a database system which uses real-time processing to handle workloads whose state is constantly changing. This differs from traditional databases containing persistent data, mostly unaffected by time. For example, a stock market changes very rapidly and is dynamic.What are the 3 components of time series? ›
An observed time series can be decomposed into three components: the trend (long term direction), the seasonal (systematic, calendar related movements) and the irregular (unsystematic, short term fluctuations).Is Snowflake a time-series database? ›
From a data management perspective, Snowflake can serve as the central time-series data repository or data can be loaded in inexpensive cloud storage from Amazon S3, Azure Data Lake Storage or Google Cloud storage.
Time series databases ingest data and queries faster, and compress data more strongly. As a result, they are ideal for processing massive volumes of real-time data that can be used to improve the safety of self-driving vehicles.