Streamlining Long-Term Storage Query Performance for Metrics With Thanos

Published in

DevOps.dev

6 min readFeb 15, 2024

Introduction

When first starting running Thanos and Prometheus, for metric collecting and long term storage, you will go through a learning curve especially when it comes to Thanos. I still remember the first time I had halting issues on Thanos Compact due to a mistake of having more than one replica enabled for Compact, and then the clean up afterwards.

But after a few of those, when you get a bit more comfortable with your setup, you will start considering how to optimize your Thanos setup to improve the experience of Querying for your daily needs, from dashboards to debugging for issues. With this in mind, in this article I’m going to share a bit from what I’ve learned in that process.

Disclaimer

This article assumes a basic understanding of Thanos and its main components, also Prometheus. You can take a look at this tutorial for a bit of that.

Improving Querying, one component at a time

Thanos Compact

As with every Thanos component, Compact replaces Prometheus with his ability to compact and downsample metrics, but instead of doing in from memory, will do it from your object storage of choice. This is necessary to allow for querying longer periods of time since it reduces the amount of retrieved data. It will also be the component where you can define the retention of the metrics in storage.

By default Thanos will not define a retention period, but ideally you should consider defining those, specially to avoid storing unused metrics. To start with a plan of how much retention you will need it might be related to overall team needs, but you also will need to consider how much time you need the raw metrics. Having in mind that retrieving raw metrics for longer periods of time will result in more data points, and if you are querying a couple of months doesn’t really make sense to retrieve a data point for every 30s or 1 min, depending on your Prometheus scrape interval.

So an example could be if you want to store metrics for 1 years, keeping the raw metrics up to 3 months, with 5 minutes resolution up to 180 days, just need to pass the following flags to your Compact.

--retention.resolution-raw=90d
--retention.resolution-5m=180d
--retention.resolution-1h=1y

Also if the Compact runs out of blocks to work on it will exit. For that either define the — wait flag in order to keep it running and waiting for more metrics to handle, or — wait-interval if you prefer to specify a specific interval.

But by far the most important thing is to always keep only one instance running to handle the metric compacting and downsampling for one specific Prometheus/Object store for that matter. And always ensure it has enough local disk space to avoid any halting due to lack of space. You can estimate the amount of local storage you might need based on the amount of metrics collected by Prometheus, if you do a bit of digging you might find something like formula below. From personal experience, add a bit of extra space to be safe, around 10% more and you’re good to go.

sum(
rate(prometheus_tsdb_head_samples_appended_total[2h]) *
(rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[2h]) 
/ rate(prometheus_tsdb_compaction_chunk_samples_sum[2h]))
) 1.5 / 1024

The last consideration has to do with the amount of resources you might need. Considering that the CPU number has to be in pair with — compact.concurrency flag, which defaults to 1, you will need at least one CPU. Memory will be related to the amount of blocks, you can get an idea based on the official documentation, but from personal experience it is better to over provision at the beginning and then adjust down, to avoid OOM, when some more heavy queries are done by users.

Thanos Store

Thanos Store will be responsible to retrieve all data from storage based on the query’s time range. To better handle your querying distribution you could define Sharding for Store, defining a — min-time, — max-time flags, to have dedicated Stores for each time window. For example based on the one year retention defined before, 3 shards could be:

# From now to ago to 30 days:
- min-time=30d
# From 30 days ago to 90 days:
- min-time=90d
- max-time=30d
# From 90 days ago to 1 year:
- min-time=1y
- max-time=90d

Of course, some consideration might need to be had in the shards, depending on the amount of queries you have for each time range, you might need to dig a big in the logs for that. For example, if you have a high level of queries within 24 hours, and a very low amount onwards you might need to create a new shards specific for that time range, and adjust the others based on that.

Furthermore, you can also shard in a similar way as you can do with Prometheus, with relabelings, and using hasmod or just keep or drop relabels. For that you will need to pass the --selector.relabel-config flag with the desired relabels, for example:

  - |
    --selector.relabel-config=
      - action: hashmod
        source_labels: ["__block_id"]
        target_label: shard
        modulus: 2
      - action: keep
        source_labels: ["shard"]
        regex: 0

Another thing to have in mind is that Store by default has in memory index caching, to speed up the retrieval of TSDB blocks. You can define — index-cache-size to define how much data is held in the in-memory. Also rely on Memcache or Redis to handle the index cache instead.

Thanos Query

Query is responsible to take care of queries to all Store API, which basically can be another Thanos Query if you have one main Querier to handle the queries to all, or Store, Sidecar, Ruler or Receive, since all of those have built in the Store API.

Thanos Query relies by default on Prometheus Query API, which by definition, from the moment the query starts, begins by defining a series of operations in a tree-like structure to assemble the data to be returned. But just recently was introduced the Thanos Query PromQL Engine, which retrieves the data in a distributed way and thus speeds up the process of query execution. It already supports most queries and since it will default to Prometheus PromQL engine for the unsupported cases, you can just enabled with:

 — query.promql-engine=thanos

Thanos Query Frontend

After going from all the other Thanos components, the final one to optimize your querying should be Query Frontend, as it will break your queries into multiple short queries, also has caching built in or supported by Memcache or Redis.

The splitting of the queries is based on –query-range.split-interval, which defaults to 24h, but I’ve worked with –query-range.split-interval=6h quite well. Also recommend going with Memcache as it will improve the query performance overall. An example of the Memcache configuration.

   type: MEMCACHED
   config:
     addresses: [memcache-host:port]
     timeout: 3s
     max_idle_connections: 1024
     max_async_concurrency: 20
     max_item_size: 30MB
     max_async_buffer_size: 10000
     max_get_multi_concurrency: 200
     max_get_multi_batch_size: 0
     dns_provider_update_interval: 10s
     expiration: 24h

After this you should have a nicely setup for your metric query. A bonus point would be to have a Grafana or a Perses in front of it, for some nice dashboards.

Conclusions

Hope this article might help you in some way with your Prometheus/Thanos setup. I took a bit of inspiration to write this one after watching the Intro and Deep Dive Into Thanos presentation from last year’s Kubecon EU. I highly recommend it, especially for some more deep dive into some of the specifics of Thanos, especially the Thanos Query Engine which is an amazing feature.

Also read this week the Monitoring Reinvented which I feel is a good complementary read to this. It also provides a good overview with some other important takes, like using hashmod to split targets based on address between Prometheus scrape metrics from the same applications, and thus reducing the amount of targets per replica and improving Prometheus performance.

Lastly an AI image generated of Thanos Quering for fun.