Compression in ClickHouse
One of the secrets to ClickHouse query performance is compression.
Less data on disk means less I/O and faster queries and inserts. The overhead of any compression algorithm with respect to CPU is in most cases outweighed by the reduction in IO. Improving the compression of the data should therefore be the first focus when working on ensuring ClickHouse queries are fast.
For why ClickHouse compresses data so well, we recommended this article. In summary, as a column-oriented database, values will be written in column order. If these values are sorted, the same values will be adjacent to each other. Compression algorithms exploit contiguous patterns of data. On top of this, ClickHouse has codecs and granular data types which allow users to tune the compression techniques further.
Compression in ClickHouse will be impacted by 3 principal factors:
- The ordering key
- The data types
- Which codecs are used
All of these are configured through the schema.
Choose the right data type to optimize compression
Let's use the Stack Overflow dataset as an example. Let's compare compression statistics for the following schemas for the posts table:
- posts- A non type optimized schema with no ordering key.
- posts_v3- A type optimized schema with the appropriate type and bit size for each column with ordering key- (PostTypeId, toDate(CreationDate), CommentCount).
Using the following queries, we can measure the current compressed and uncompressed size of each column. Let's examine the size of the initial optimized schema posts with no ordering key.
We show both a compressed and uncompressed size here. Both are important. The compressed size equates to what we will need to read off disk - something we want to minimize for query performance (and storage cost). This data will need to be decompressed prior to reading. The size of this uncompressed size will be dependent on the data type used in this case. Minimizing this size will reduce memory overhead of queries and the amount of data which has to be processed by the query, improving utilization of caches and ultimately query times.
The above query relies on the table
columnsin the system database. This database is managed by ClickHouse and is a treasure trove of useful information, from query performance metrics to background cluster logs. We recommend "System Tables and a Window into the Internals of ClickHouse" and accompanying articles[1][2] for the curious reader.
To summarize the total size of the table, we can simplify the above query:
Repeating this query for the posts_v3, the table with an optimized type and ordering key, we can see a significant reduction in uncompressed and compressed sizes.
The full column breakdown shows considerable savings for the Body, Title, Tags and CreationDate columns achieved by ordering the data prior to compression and using the appropriate types.
Choosing the right column compression codec
With column compression codecs, we can change the algorithm (and its settings) used to encode and compress each column.
Encodings and compression work slightly differently with the same objective: to reduce our data size. Encodings apply a mapping to our data, transforming the values based on a function by exploiting properties of the data type. Conversely, compression uses a generic algorithm to compress data at a byte level.
Typically, encodings are applied first before compression is used. Since different encodings and compression algorithms are effective on different value distributions, we must understand our data.
ClickHouse supports a large number of codecs and compression algorithms. The following are some recommendations in order of importance:
| Recommendation | Reasoning | 
|---|---|
| ZSTDall the way | ZSTDcompression offers the best rates of compression.ZSTD(1)should be the default for most common types. Higher rates of compression can be tried by modifying the numeric value. We rarely see sufficient benefits on values higher than 3 for the increased cost of compression (slower insertion). | 
| Deltafor date and integer sequences | Delta-based codecs work well whenever you have monotonic sequences or small deltas in consecutive values. More specifically, the Delta codec works well, provided the derivatives yield small numbers. If not,DoubleDeltais worth trying (this typically adds little if the first-level derivative fromDeltais already very small). Sequences where the monotonic increment is uniform, will compress even better  e.g. DateTime fields. | 
| DeltaimprovesZSTD | ZSTDis an effective codec on delta data - conversely, delta encoding can improveZSTDcompression. In the presence ofZSTD, other codecs rarely offer further improvement. | 
| LZ4overZSTDif possible | if you get comparable compression between LZ4andZSTD, favor the former since it offers faster decompression and needs less CPU. However,ZSTDwill outperformLZ4by a significant margin in most cases. Some of these codecs may work faster in combination withLZ4while providing similar compression compared toZSTDwithout a codec. This will be data specific, however, and requires testing. | 
| T64for sparse or small ranges | T64can be effective on sparse data or when the range in a block is small. AvoidT64for random numbers. | 
| GorillaandT64for unknown patterns? | If the data has an unknown pattern, it may be worth trying GorillaandT64. | 
| Gorillafor gauge data | Gorillacan be effective on floating point data, specifically that which represents gauge readings, i.e. random spikes. | 
See here for further options.
Below we specify the Delta codec for the Id, ViewCount and AnswerCount, hypothesizing these will be linearly correlated with the ordering key and thus should benefit from Delta encoding.
The compression improvements for these columns is shown below:
Compression in ClickHouse Cloud
In ClickHouse Cloud, we utilize the ZSTD compression algorithm (with a default value of 1) by default. While compression speeds can vary for this algorithm, depending on the compression level (higher = slower), it has the advantage of being consistently fast on decompression (around 20% variance) and also benefiting from the ability to be parallelized. Our historical tests also suggest that this algorithm is often sufficiently effective and can even outperform LZ4 combined with a codec. It is effective on most data types and information distributions, and is thus a sensible general-purpose default and why our initial earlier compression is already excellent even without optimization.
