TSDBs at Scale – Part Two

This is the second half of a two-part series focusing on the challenges of Time Series Databases (TSDBs) at scale. This half focuses on the challenges of balancing read vs. write performance, data aggregation, large dataset analysis, and operational complexity in TSDBs.

Balancing Read vs. Write Performance

Time series databases are tasked with ingesting concurrent metric streams, often in large volumes. This data ultimately needs to be persisted to permanent storage, where later it can be retrieved. While portions of the ingest pipeline may be temporarily aggregated in memory, certain workloads require either write queuing or a high-speed data storage layer to keep up with high inbound data volumes.

Data structures such as Adaptive Radix Trees (ARTs) and Log-Structured Merge-trees (LSMs) provide a good starting point for in-memory and memory/disk indexed data stores. However, the requirement to quickly persist large volumes of data presents a conundrum of read/write asymmetry. The greater the capacity to ingest and store metrics, the larger the volume of data available for analysis, creating challenges for read-based data analysis and visualization.

Analyzing time-series data reveals an inherent constraint: one must be able to read data at an exponentially higher rate than it was ingested at. For example, retrieving a week’s worth of time-series data within a single second to support some type of visualization and analysis.

Read/write asymmetry of analyzing 1 weeks worth of time-series based telemetry data

How do you scale reads for large amounts of data in a non-volatile storage medium? The typical solution to this asymmetry is data aggregation — reducing the requisite volume of read data while simultaneously attempting to maintain its fidelity.

Data Aggregation

Data aggregation is a crucial component of performant reads. Many TSDBs define downsampling aggregation policies, storing distinct sampling resolutions for various retention periods. These aggregations can be asynchronously applied during ingestion, allowing for write performance optimizations. Recently ingested data is often left close to sample resolution, as it loses value when aggregated or downsampled.

These aggregations are accomplished by applying an aggregation function over data spanning a time interval. Averaging is the most common aggregation function used, but certain TSDBs such as IRONdb and OpenTSDB provide the ability to implement other aggregation functions such as max(), sum(), or histogram merges. The table below lists some well-known TSDBs and the aggregation methods they use.

TSDB/Monitoring Platform Solution to Consistency Problem
IRONdb Automatic rollups of raw data after configurable time range
DalmatinerDB DQL aggregation via query clause
Graphite (default without IRONdb) In memory rollups via carbon aggregator
InfluxDB InfluxQL user defined queries run periodically within a database, GROUP BY time()
OpenTSDB Batch processed, queued TSDs, or stream based aggregation via Spark (or other)
Riak User defined SQL type aggregations
TimescaleDB SQL based aggregation functions
M3DB User defined rollup rules applied at ingestion time, data resolution per time interval

As mentioned in part one, histograms are useful in improving storage efficiency, as are other approximation approaches. These techniques often provide significant read performance optimizations. IRONdb uses log linear histograms to provide these read performance optimizations. Log linear histograms allow one to store large volumes of numeric data which can be statistically analyzed with a quantifiable error rates, in the band of 0-5% on the bottom of the log range, and 0-0.5% on the top of the log range.

Approximate histograms such as t-digest are storage efficient, but can exhibit error rates upwards of 50% for median quantile calculations. Uber’s M3DB uses Bloom filters to speed data access times, which exhibit single digit false positive error rates for large data sets in return for storage efficiency advantages. Efficiency versus accuracy tradeoffs should be understood choosing an approximation based aggregation implementation.

Default aggregation policies, broken down by type, within IRONdb

It is important to note a crucial trade-off of data aggregation: spike erosion. Spike erosion is a phenomenon exhibited where visualizations containing aggregated data over wide intervals display lower interval sample maximums. This occurs in scenarios where averages are used as the aggregation function (which is the case for most TSDBs). The use of histograms as a data source can guard against spike erosion by allowing application of a max() aggregation function for intervals. However, a histogram is significantly larger to store than a rollup, so that accuracy comes at a cost.

Analysis of Large Datasets

One of the biggest challenges with analyzing epic data sets is that moving or copying data for analysis becomes impractical due to the sheer mass of the dataset. Waiting days or weeks for these operations to complete is incompatible with the needs of today’s data scientists.

The platform must not only handle large volumes of data, but also provide tools to perform internally consistent statistical analyses. Workarounds won’t suffice here, as cheap tricks such as averaging percentiles produce mathematically incorrect results. Meeting this requirement means performing computations across raw data, or rollups that have not suffered a loss of fidelity from averaging calculations.

Additionally, a human-readable interface is required, affording users the ability to query and introspect their datasets in arbitrary ways. Many TSDBs use the “in place” query approach. Since data cannot be easily moved at scale, you have to bring the computation to the data.

PromQL, from Prometheus, is one example of such a query language. IRONdb, on the other hand, uses the Circonus Analytics Query Language (CAQL), which affords custom user-definable functions via Lua.

Anyone who has worked with relational databases and non-procedural languages has experienced the benefits of this “in place” approach. It is much more performant to delegate analytics workloads to resources which are computationally closer to the data. Sending gigabytes of data over the wire for transformation is grossly inefficient when it can be reduced at the source.

Operational Complexity

Operational complexity is not necessarily a “hard problem,” but is sadly an often ignored one. Many TSDBs will eventually come close to the technical limits imposed by information theory. The primary differentiator then becomes efficiency and overall complexity of operation.

In an optimal operational scenario, a TSDB could automatically scale up and down as additional storage or compute resources are needed. Of course, this type of idealized infrastructure is only present on trade-show marketing literature. In the real world, operators are needed, and generally some level of specialized knowledge is required to keep the infrastructure properly running.

Let’s take a quick look at what’s involved in scaling out some of the more common TSDBs in the market:

TSDB/Monitoring Platform Solution to Consistency Problem
IRONdb Generate new topology config, kick off
DalmatinerDB Rebalance via REST call
Graphite (default without IRONdb) Manual, file based. Add HAProxy or other stateless load balancer
InfluxDB Configure additional name and/or data nodes
OpenTSDB Expand your HBASE cluster
Riak RiakTS cluster tools
TimescaleDB Write clustering in development
M3DB M3DB Docs

There are other notable aspects of operational complexity. For example, what data protection mechanisms are in-place?

For most distributed TSDBs, the ability to retain an active availability zone is sufficient. When that’s not enough (or if you don’t have an online backup), ZFS snapshots offer another solution. There are, unfortunately, few other alternatives to consider. Typical data volumes are often large enough that snapshots and redundant availability zones are the only practical options.

A key ingestion performance visualization for IRONdb, a PUT latency histogram, shown as it appears within IRONdb’s management console

Another important consideration is observability of the system, especially for distributed TSDBs. Each of the previously mentioned options conveniently expose some form of performance metrics, providing a way through which one may monitor the health of the system. IRONdb is no exception, offering a wealth of performance metrics and associated visualizations such that one can easily operate and monitor it.

Conclusion

There are a number of factors to consider when either building your own TSDB, or choosing an open-source or commercial option. It’s important to remember that your needs may differ from those of Very Large Companies. These companies often have significant engineering and operations resources to support the creation of their own bespoke implementations. However, these same companies often have niche requirements that prevent them from using some of the readily available options in the market, requirements that smaller companies simply don’t have.

If you have questions about this article, or Time Series Databases in general, feel free to join our slack channel and ask us. To see what we’ve been working on which inspired this article, feel free to have a look here.

TSDBs at Scale – Part One

This two-part series examines the technical challenges of Time Series Databases (TSDBs) at scale. Like relational databases, TSDBs have certain technical challenges at scale, especially when the dataset becomes too large to host on a single server. In such a case you usually end up with a distributed system, which are subject to many challenges including the CAP theorem. This first half focuses on the challenges of data storage and data safety in TSDBs.

What is a Time Series Database?

Time series databases are databases designed to tracks events, or “metrics,” over time. The three most often used types of metrics (counters, gauges, and distributions) were popularized in 2011 with the advent of statsd by Etsy.


Time Series graph of API response latency median, 90th, and 99th percentiles.
The design goals of a TSDB are different from that of an RDBMS (relational database management systems). RDBMSs organize data into tables of columns and rows, with unique identifiers linking together rows in different tables. In contrast, TSDB data is indexed primarily by event timestamps. As a result, RDBMSs are generally ill-suited for storing time series data, just as TSDBs are ill-suited for storing relational data.

 

TSDB design has driven in large part by industry as opposed to academic theory. For example, PostgreSQL evolved from academia to industry, having started as Stonebraker’s Ingres (Post Ingres). But, sometimes ideas flow in the other direction, with industry informing academia. Facebook’s Gorilla is an example of practitioner driven development that has become noted in academia.

 

Data Storage

Modern TSDBs must be able to store hundreds of millions of distinct metrics. Operational telemetry from distributed systems can generate thousands of samples per host, thousands of times per second, for thousands of hosts. Historically successful storage architectures now face challenges of scale orders of magnitude higher than they were designed for.


How do you store massive volumes of high-precision time-series data such that you can run analysis on them after the fact? Storing averages or rollups usually won’t cut it, as you lose fidelity required to do a mathematically correct analysis.


 

From 2009 to 2012, Circonus stored time series data in PostgreSQL. We encountered a number of operational challenges, the biggest of which was data storage volume. A meager 10K concurrent metric streams, sampling once every 60 seconds, generates 5 billion data points per year. Redundantly storing these same streams requires approximately 500GB of disk space. This solution scaled, but the operational overhead for rollups and associated operations was significant. Storage nodes would go down, and master/slave swaps take time. Promoting a read only copy to write master can take hours. An unexpectedly high amount of overhead, especially if you want to reduce or eliminate downtime — all to handle “just” 10K concurrent metric streams.

While PostgreSQL is able to meet these challenges, the implementation ends up being cost prohibitive to operate at scale. Circonus evaluated other time series storage solutions as well, but none sufficiently addressed the challenges listed above.

So how do we store trillions of metric samples?

TSDBs typically store numeric data (i.e. gauge data) for a given metric as a set of aggregates (average, 50th / 95th / 99th percentiles) over a short time interval, say 10 seconds. This approach can generally yield a footprint of three to four bytes per sample. While this may seem like an efficient use of space, because aggregates are being stored, only the average value can be rolled up to longer time intervals. Percentiles cannot be rolled up except in the rare cases where the distributions which generated them are identical. So 75% of that storage footprint is only usable within the original sampling interval, and unusable for analytics purposes outside that interval.

A more practical, yet still storage efficient approach, is to store any high frequency numeric data as histograms. In particular, log linear histograms, which store a count of data points for every tenth power in approximately 100 bins. A histogram bin can be expressed with 1 byte for sample value, 1 byte for bin exponent (factor of 10), and 8 bytes for bin sample count. At 10 bytes per sample, this approach may initially seem less efficient, but because histograms have the property of mergeability, samples from multiple sources can be merged together to obtain N increases of efficiency, where N is the number of sample sources. Log linear histograms also offer ability to calculate quantiles over arbitrary sample intervals, providing a significant advantage over storing aggregates.

Data is rolled up dynamically, generally at 1 minute, 30 minute, and 12 hour intervals, lending itself to analysis and visualization. These intervals are configurable, and can be changed by the operator to provide optimal results for their particular workloads. These rollups are a necessary evil in a high volume environment, largely due to the read vs write performance challenges when handling time series data. We’ll discuss this further in part 2.

Data Safety

Data safety is a problem common to all distributed systems. How do you deal with failures in a distributed system? How do you deal with data that arrives out of order? If you put the data in, will you be able to get it back out?

All computer systems can fail, but distributed systems tend to fail in new and spectacular ways. Kyle Kingsbury’s series “Call Me Maybe” is a great educational read on some of the more entertaining ways that distributed databases can fail.

Before developing IRONdb in 2011, Circonus examined the technical considerations of data safety. We quickly realized that the guarantees and constraints related to ACID databases can be relaxed. You’re not doing left outer joins across a dozen tables; that’s a solution for a different class of problem.

 


Can a solution scale to hundreds of millions of metrics per second WITHOUT sacrificing data safety? Some of the existing solutions could scale to that volume, but they sacrificed data safety in return.


 

Kyle Kingsbury’s series provides hours of reading detailing the various failure modes experienced when data safety was lost or abandoned.

IRONdb uses OpenZFS as its filesystem. OpenZFS is the open source successor to the ZFS filesystem which was developed by Sun Microsystems for the Solaris operating system. OpenZFS compares the checksums of incoming writes to checksums of the existing on-disk data and avoid issuing any write I/O for data that has not changed. It supports on the fly LZ4 compression of all user data. It also provides repair on reads, and the ability to scrub file system and repair bitrot.

To make sure that whatever data is written to the system can be gotten out, IRONdb writes data to multiple nodes. The number of nodes is dependent on the replication factor, which is 2 in this example diagram. On each node, data is mirrored to multiple disks for redundancy.

 

Data written to multiple nodes, with multiple disks per node, and a replication factor of 2. Cross data center node replication, no single node can be on both sides.

 

The CAP theorem says that since all systems experience network partitions, they will sacrifice either availability or consistency. The effects of the CAP theorem can be minimized, as shown in Google’s Cloud Spanner blog post by Eric Brewer (the creator of the CAP theorem). But in systems that provide high availability, it’s inevitable that some the data will be out of order.

Dealing with data that arrives out of order is a difficult problem in computing. Most systems attempt to resolve this problem through the use of consensus algorithms such as Raft or Paxos. Managing consistency is a HARD problem. A system can give up availability in favor of consistency, but this ultimately sacrifices scalability. The alternative is to give up consistency and present possibly inconsistent data to the end user. Again, see Kyle Kingsbury’s series above for examples of the many ways that goes wrong.

IRONdb tries to avoid these issues all together through the use of commutative operations. This means that the final result recorded by the system is independent of the order of operations applied. Commutative operations can be expressed mathematically by f(A,B) = f(B,A). This attribute separates IRONdb from pretty much every other TSDB.

The use of commutative operations provides the core of IRONdb’s data safety, and avoids the most complicated part of developing and operating a distributed system. The net result is improved reliability and operational simplicity, allowing operators to focus on producing business value as opposed to rote system administration.

There are a lot of other ways to try and solve the consistency problem. All have weaknesses and are operationally expensive. Here’s a comparison of known consensus algorithms used by other TSDBs:

TSDB/Monitoring Platform Solution to Consistency Problem
IRONdb Avoids problem via commutative operations
DalmatinerDB Multi-Paxos via Riak core
DataDog Unknown, rumored to use Cassandra (Paxos)
Graphite (default without IRONdb) None
InfluxDB Raft
OpenTSDB HDFS/HBase (Zookeeper’s Zab)
Riak Multi-Paxos
TimescaleDB Unknown, PostgreSQL based

That concludes part 1 of our series on solving the technical challenges of TSDBs at scale. We’ve covered data storage and data safety. Next time we’ll describe more of the technical challenges inherent in distributed systems and how we address them in IRONdb. Check back for part 2, when we’ll cover balancing read vs. write performance, handling sophisticated analysis of large datasets, and operational simplicity.

High Volume Ingest

Like it or not, high-volume time series databases face one relentless challenge: ingestion. Unlike many other databases, TSDBs tend to record things that have already happened. Records are inserted with “past time stamps” (even if they’re often just a few milliseconds old) because the event has already happened. This has a consequence that is particularly hostile: being down, slow, or otherwise unavailable doesn’t mean the event didn’t happen… it still happened, it’s not going to disappear, and when you become available again, it will show up and the database will have to ingest it. Basically, any reduction in availability only serves to increase future load. It is literally an architecture in which you cannot afford to have unavailability. These constraints also require TSDBs to be operated with reasonable, sufficient capacity to accommodate expected events; if you do go offline, you need to be able to “catch-up quick” when service is restored.

Ingestion speed and system resiliency are dual targets for any production TSDB. IRONdb handles resiliency as a core competence through careful distributed system design and battle hardened implementation work that has seen hostile networks, systems, and even operators; but that’s not what we’ll focus on here… This is an article about speed.

First lets start with some assumptions. These might be controversial, but they shouldn’t be, since they’ve been assumed and shown correct again and again in large production installs.

  • TSDBs have high write volumes and ultimately store a lot of data (millions of new measurements per second and tens to hundreds of terabytes of data over time).
  • At these write volumes and storage sizes, bit error rate (BER) is a certainty, not a likelihood. Bit errors in a piece of data can cause wrong answers, but errors in metadata can result in catastrophic data loss. Techniques to get data onto disk and maintain integrity as it rests there is paramount. One must store their data on ZFS or otherwise employ the data safety techniques it does in some ancillary way. To be clear, you don’t specifically need ZFS; it’s just a reference implementation of safety techniques that happens to be free.
  • The data arriving at TSDBs tends to be “recent”. This is to say that you are unlikely to get one day of data for each event source: A0,A1,A2,A3,B0,B1,B2,B3,C0,C1,C2,C3. We call this pattern long-striped. You are instead likely to get all data from all event sources for a short time period, then get them again: A0,B0,C0,A1,B1,C1,A2,B2,C2,A3,B3,C3. We call this pattern short-striped.
  • When data is read from TSDBs, the vast majority of operations are sequential per event source: requesting A0,…,An.

These assumptions lead to a well understood, albeit poorly solved, challenge of choosing expensive writes or choosing expensive reads. Given that TSDBs never have a hiatus on their write demands, it is clear that it would be nonsensical to sacrifice anything consequential on the write performance side.

There are a variety of database techniques that can be used and to say that we use only one within IRONdb would be highly disingenuous. The ingestion pipeline is layered to apply time-based partitioning, on-disk format changes optimizing for different workloads, and pipelines of deferred work. That said, the data that arrives “right now” tends to follow a fairly simple path: get it to disk fast, cheap, and accommodate our other layers as efficiently as possible. Today, we use RocksDB for this early stage pipeline.

RocksDB itself is a complicated piece of software providing layers and pipelines itself. Data put into RocksDB is first written to a write-ahead-log (WAL) unordered, but kept in memory in efficient ordered data structures. Once that in-memory structure gets “too large”, the ordered data is written out to disk and the WAL checkpointed. This leaves us with a whole slew of files on disk that have sorted data in them, but data across the files aren’t sorted (so we need to look in each for the data). As the amount of those files becomes too great, we merge sort those into a “level” of larger sorted files… and so on for several levels. As the system runs at high volume, most of the data finds itself in a very large sorted layout. This has the effect (at a cost) of slowly churning out data from short-striped to long-striped.

Why is long-striped data better? For long-term storage and servicing read queries, all of the data you tend to read in a single query finds itself “together” on disk. You might say that with SSD (and even moreso with NVMe) that random reads are so cheap that I don’t need my data stored sequentially to be performant. You’re not as right as you might think. There are two things at play here. First, at very, very large storage volumes, spinning disks still make sense, so ignoring them completely would effectively sacrifice economic efficiencies by design in the software! Second, disks, HBAs, and filesystems still work together to get you data, and they use sectors, command queues, and blocks/records respectively. Even in the case of NVMe, where the HBA is largely erased from the picture, in order to fetch data the filesystem must pull the relevant blocks, and those blocks are backed by some set of assigned sectors on the underlying storage. Those things need to be big enough to be useful units of work (as we only get so many of those retrievals per second). Typically today, storage sector sizes are 4 kilobytes and filesystem block sizes range from 8 kilobytes to 128 kilobytes.

If I ask for some sequence of data, say A0,… An, I’d like that data to live together in large enough chunks (long enough stripes) that it fills a block on the filesystem.

As a part of our transformation into dense, read-optimized formatting for long-term storage, we need to read “all the data” and convert it. We used to do a simple set up for loops:

1: Foreach eligible partition:
    2: Foreach stream
        3: Foreach measurement
            Collect and transform
        Write range

A challenge with this is that the measurements (each stripe) might not be stored in the same order that they are in our stream iteration. This would cause us to read segments of data (stripes) in for loop 3 that would not be directly sequential to one another. This causes random read I/O which on spinning storage systems exhibits radically poor performance when compared to sequential read I/O. On such mechanical media, you need to wait for the heads to relocate, which induces latency (a “seek penalty”) that is not present on non-mechanical media such as SSDs and NVMe. We changed it thusly:

1: foreach eligible partition
    2: foreach measurement
        Cache lookup stream
        if stream different : write range
        Collect and transform

This changes the random I/O to the stream lookup (which is a small enough dataset to be cached aggressively) resulting in almost exclusively in-memory operations (or no I/O at all).

The benefits here should be obvious for spinning media. When we start reading the measurements block reads and read-ahead will always be subsequently used; no I/O is wasted. The interesting part is the efficiencies gained on SSD, NVMe, and other “random I/O friendly” storage media.

The issue is that RocksDB doesn’t pretend to know the underlying fileystem and block device layers and pays little attention to the unit sizes of those underlying operations. Assuming that we have a filesystem block size of 8k, we need to read at least that much data at a time. If I read a stripe of data out of RocksDB, the beginning of that data (and the end) are very unlikely to align with the filesystem block boundary. This means that when I issue the read, I’m pulling data with that read that isn’t relevant to the range request. If I read a very large amount of data randomly this way, the read amplification can be significant. In our setup, a day’s worth of data can have up to 17kb in a stripe which dictates 7k of unrelated read data. That manifests as 1.4x read amplification; for every 10 bytes of data we request, we actually pull 14 bytes of data off disk. This obviously varies with the stripe length, but in bad scenarios where you have short striping (say 32 bytes or less of data in a stripe within an 8k filesystem block) you can have colossal read amplification in the range of 256x. This read amplification is compounded if your filesystem is configured to optimistically “read-ahead” subsequent blocks “just in case” they are going to be read; this is common on rotating media.

It turns out that if you can long-stripe the data for reads and your reads have a large stride, you pull many sequential blocks. While SSDs and other non-mechanical media have enormous advantages for randomized reads, ignoring read amplification can be a performance killer. In optimizing our system for optimal performance on mechanical storage (by driving operations sequentially and read-amplification to near 1x), we also gained significantly on SSDs and NVMe storage technology.

Introducing IRONdb

Software is eating the world. Devices that run that software are ubiquitous and multiplying rapidly. Without adequate monitoring on these services, operators are mostly flying blind, either relying on customers to report issues or manually jumping on boxes and spot checking. Centralized collection of telemetry data is becoming even more important than it ever was. It is becoming significantly harder to monitor the volume of telemetry generated by the sheer number of devices and the increasingly elastic architecture of modern infrastructure. On top of this, your telemetry data store needs to be extremely reliable and performant in today’s world. The last thing you need while researching and diagnosing a production issue is for the stethoscope into your system to break down or introduce debilitating delays to diagnosing the problem. In a lot of ways, it is the most important piece of infrastructure you can run.

Circonus has been delivering a Time Series Database (TSDB) to our on-prem customers for many years, specifically architected to be reliable, scalable, and fast. Many of these on-prem customers run large installations with millions of time-series and Circonus’s TSDB has been meeting these demands for years. What has been missing up until today was the ability to ingest other sources of data, and to interoperate with other data collection tools and monitoring systems already in place.

Today, starting with Graphite, that is no longer a barrier.

Overview

IRONdb is Circonus’s internally developed TSDB, now with extensions to support the Graphite data format on ingestion and with interoperability with Graphite-web (and Grafana by extension). It is a drop-in solution for organizations struggling to scale Graphite or frustrated with maintaining a high availability metrics infrastructure during surges and outages and even routine maintenance. Highly scalable and robust, IRONdb has support for replication and clustering for redundancy of data, is multi-data center capable, and comes with a full suite of administration tools.

Features

  • Replication – Don’t lose data during routine maintenance or outages.
  • Multi-DC Capabilities – Don’t lose data if Amazon has an outage in us-east-1.
  • Performance – It’s fast to write and equally fast to read.
  • Data Robustness and compression – Keep more data and don’t worry about corruption.
  • Administration tools – View system health and show op latencies.
  • Graphite-web interoperability – Plug it right into your existing tooling.

Replication

One of the major issues with data management in Graphite or other time series databases is the management of nodes. If you lose a node or have to take it down intentionally for maintenance, what happens to all your data during this outage? Sadly, for many users of TSDBs, they live with the outage and any poll based alerts that happen to trigger are just part of the quirks of that system.

IRONdb completely side steps this problem by keeping multiple copies of your data in the cluster of IRONdb nodes. As data arrives, we store the data on the local node if we determine that it belongs there based on a consistent hash of the incoming metric name, and we also journal it out to other nodes based on your configuration settings for the number of copies you want to keep. We then replicate this data to the other nodes through a background thread process. You can see and hear more details about this process in Circonus Founder and CEO, Theo Schlossnagle’s talk on the Architecture of an Analytics and Storage Engine for Time Series Data. Theo and I discuss more TSDB design in this video:

The HTTP POST of data to the Graphite handler in IRONdb is guaranteed to not respond with a 200 OK until data has both been committed locally and written to the journals for the other nodes it belongs on.

The Graphite network listener cannot make these same claims because there is no acknowledgement mechanism in a plain network socket. It is provided for interoperability purposes, but keep in mind you can lose data if you employ this method of ingestion.

Multi Data Center Capabilities

As an extension of replication, IRONdb can be deployed in a sided configuration which makes it aware that a piece of data must reside on nodes on both sides of the topology.

A piece of data which was destined for n2-2 in the above diagram would be guaranteed to be replicated to a node on the other side of the ring in “Availability Zone 2”. By setting up your IRONdb cluster in this way, you could lose an entire availability zone and still have all of your data available for querying and your cluster available for ingesting. When the downed nodes come back online, the journaled data that has been waiting for them is then replicated in the background until they catch back up.

Since IRONdb is a distributed database, it would not be complete if you had to know where the data was to ask for it. You can ask any node in the IRONdb cluster for a time series and range and it will satisfy as much of the query as it can using local data for performance reasons, and it will proxy to other nodes for time series that don’t live on that node. Keep in mind that in sided configurations where one side is geographically distant this can lead to the speed of light penalty for data fetches. We have plans in the pipeline to fix this weakness and to try to prefer local replicas if they exist, but for now, if your data centers are far apart and you use a sided config you will likely pay the speed of light RTT to proxy data fetches.

Performance

There are lies, damned lies, and benchmarks as the saying goes. I encourage you to take all of this with a grain of salt and also test for yourself. In the wild, the actual numbers you can achieve with IRONdb are dependent on hardware, your replication factor of data within the cluster, and what your data actually looks like on the way in.

All of that caveat aside, Influx data has created a nice benchmark suite to compare its time series database to other popular solutions in the open source world. This isn’t an exact analog to IRONdb because IRONdb does not yet support stream based tags (that is coming soon), but the test of ingestion speed with a fixed cardinality can be mimicked. Basically, I synthesized a Graphite metric name from the unique set of tags + fields + measurement name that the Influx benchmark suite uses. This ensured that IRONdb was ingesting the same unique set of metrics that influx-db, Cassandra, or OpenTSDB would ingest in the same test.

In the influx-data comparison of Influx-db vs. Cassandra, Cassandra achieved 100K metrics ingested per second and Influx achieved 470K metrics per second. I repeated this test using IRONdb:

The Cassandra and Influx-db numbers were pulled from their original post. I did not repeat the benchmark for these other 2 databases on this hardware.

IRONdb (single node) is on the same scale of Influx-db for the same ingestion set and maybe slightly faster. There is no info in the original Influx-data post about the hardware this test was run on. This initial test was from a remote sender sending data to IRONdb over HTTP. I then repeated this test sending data from localhost on the IRONdb node itself:

When eliminating the roundtrip penalty of the benchmark test suite, IRONdb goes significantly faster.

I ran the IRONdb test on a zone on a development box with the following configuration:

CPU:

root@circonus:/root# psrinfo -vp
The physical processor has 8 cores and 16 virtual processors (0-7 16-23)
  The core has 2 virtual processors (0 16)
  The core has 2 virtual processors (1 17)
  The core has 2 virtual processors (2 18)
  The core has 2 virtual processors (3 19)
  The core has 2 virtual processors (4 20)
  The core has 2 virtual processors (5 21)
  The core has 2 virtual processors (6 22)
  The core has 2 virtual processors (7 23)
    x86 (GenuineIntel 306E4 family 6 model 62 step 4 clock 2600 MHz)
      Intel(r) Xeon(r) CPU E5-2650 v2 @ 2.60GHz
The physical processor has 8 cores and 16 virtual processors (8-15 24-31)
  The core has 2 virtual processors (8 24)
  The core has 2 virtual processors (9 25)
  The core has 2 virtual processors (10 26)
  The core has 2 virtual processors (11 27)
  The core has 2 virtual processors (12 28)
  The core has 2 virtual processors (13 29)
  The core has 2 virtual processors (14 30)
  The core has 2 virtual processors (15 31)
    x86 (GenuineIntel 306E4 family 6 model 62 step 4 clock 2600 MHz)
      Intel(r) Xeon(r) CPU E5-2650 v2 @ 2.60GHz

Disk:

       0. c0t5000CCA05CCEDCDDd0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca05ccedcdd,0
       1. c0t5000CCA05CCF8421d0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca05ccf8421,0
       2. c0t5000CCA07313BA35d0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca07313ba35,0
       3. c0t5000CCA073109B75d0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca073109b75,0
       4. c0t5000CCA073111CCDd0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca073111ccd,0
       5. c0t5000CCA073156BFDd0 
          /pci@0,0/pci8086,e06@2,2/pci1000,3020@0/iport@ff/disk@w5000cca073156bfd,0
       ...
       9. c3t55CD2E404C53549Bd0 
          /scsi_vhci/disk@g55cd2e404c53549b

Configured in a 6 way stripe under ZFS with an L2ARC on a single SSD drive:

root@circonus:/root# zpool status data
  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        data                     ONLINE       0     0     0
          c0t5000CCA07313BA35d0  ONLINE       0     0     0
          c0t5000CCA073156BFDd0  ONLINE       0     0     0
          c0t5000CCA073111CCDd0  ONLINE       0     0     0
          c0t5000CCA073109B75d0  ONLINE       0     0     0
          c0t5000CCA05CCEDCDDd0  ONLINE       0     0     0
          c0t5000CCA05CCF8421d0  ONLINE       0     0     0
        cache
          c3t55CD2E404C53549Bd0  ONLINE       0     0     0

The CPUs on this box were mostly bored during this ingestion test:

Being a 16 core box with hyper-threading on, there are 32 vCPUs to address here. The IRONdb process (called snowthd here) is eating about 5 of the cores (mostly in parsing the incoming ASCII text). The drives are pretty busy though:

Data Robustness and Compression

IRONdb runs on OmniOS. OmniOS uses ZFS as it’s file system. ZFS has many amazing features that keep your data safe and small. A few of the important ones:

“A 2012 research showed that neither any of the then-major and widespread filesystems (such as UFS, Ext,[12] XFS, JFS, or NTFS) nor hardware RAID (which has some issues with data integrity) provided sufficient protection against data corruption problems.[13][14][15][16] Initial research indicates that ZFS protects data better than earlier efforts.[17][18] It is also faster than UFS[19][20] and can be seen as its replacement.” – Wikipedia

“In addition to handling whole-disk failures, … can also detect and correct silent data corruption, offering “self-healing data”: when reading a … block, ZFS compares it against its checksum, and if the data disks did not return the right answer, ZFS reads the parity and then figures out which disk returned bad data. Then, it repairs the damaged data and returns good data to the requestor.” – Wikipedia

“ZFS is a 128-bit file system,[31][32] so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The limitations of ZFS are designed to be so large that they should never be encountered in practice. For instance, fully populating a single zpool with 2128 bits of data would require 1024 3 TB hard disk drives.[33]” – Wikipedia

  • Compression

“Transparent filesystem compression. Supports LZJB, gzip[55] and LZ4.” – Wikipedia

In the above ingestion test, the resultant data occupied 5 bytes per data point due to LZ4 compression enabled on the zpool that the data was written to.

Our CEO has written about this before.

Administration tools

This admin UI gives you introspection into what each IRONdb node is doing. A thorough discussion of the Admin UI would require it’s own blog post, but at a high level you can see:

  • Current ingest rates and storage space info
  • Replication latency among nodes in the cluster
  • The topology of the cluster
  • Latency measurements of almost every single operation we perform in the database on the “Internals” tab

This last one also provides a nice histogram visualization of the distribution of each operation. Here is an example of the distribution of write operations for the Influx-data ingestion benchmark above:

The 50th to 75th percentile band of write operation latency for batch PUTS to IRONdb was between 570µs and 920µs

Graphite-web Interoperability

IRONdb is compatible with Graphite-web 0.10 and above. We have written a storage finder plugin for use with Graphite-web. Simply install this plugin and configure it to point to one or more of your IRONdb nodes and you can render metrics right from Graphite-web:

Or use Grafana with the Graphite datasource:

There is also a Grafana Datasource in the works which will expose even more of the power of IRONdb, so stay tuned!

tl;dr

IRONdb is a replacement for Whisper and Carbon-cache in Graphite that’s faster, more efficient, and easier to operate and scale.

Read the IRONdb Documentation or click below to sign up and install IRONdb today!