ClickHouse DDL Schema Improvements for Observability #144

garysassano · 2025-07-24T22:55:41Z

Summary

This PR aligns ClickHouse DDL schema with the official observability schema design by fixing inconsistencies in timestamp handling and removing unnecessary type conversions.

Changes Made

1. Fixed DateTime Precision Inconsistency

Issue: traces_trace_id_ts table used DateTime while main traces table uses DateTime64(9), causing precision loss in materialized view aggregations.

-- ❌ Before: Inconsistent precision
Start DateTime CODEC(Delta, ZSTD(1)),
End DateTime CODEC(Delta, ZSTD(1)),

-- ✅ After: Consistent precision  
Start DateTime64(9) CODEC(Delta, ZSTD(1)),
End DateTime64(9) CODEC(Delta, ZSTD(1)),

Reference: ClickHouse DateTime64 Documentation

2. Removed Unnecessary Type Conversions

Issue: TTL expressions and ordering keys used toDateTime() conversions that add overhead and reduce performance.

// ❌ Before: Unnecessary conversions
build_ttl_string(ttl, "toDateTime(Timestamp)")
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))

// ✅ After: Direct column usage
build_ttl_string(ttl, "Timestamp")  
ORDER BY (ServiceName, SpanName, Timestamp)

Reference: ClickHouse TTL Documentation

3. Performance Issues with Ordering Keys

Issue: Using toDateTime() in ORDER BY clauses causes slow query performance with DateTime64 as noted in ClickHouse documentation.

-- ❌ Before: Function overhead in ordering
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))

-- ✅ After: Direct column usage for better performance
ORDER BY (ServiceName, SpanName, Timestamp)

Reference: ClickHouse Primary Keys and Ordering

4. Metrics Schema Using `toUnixTimestamp64Nano()`

Note: The metrics schema uses toUnixTimestamp64Nano(TimeUnix) in ORDER BY clauses, which may need future evaluation for performance impact:

ORDER BY (ServiceName, MetricName, toUnixTimestamp64Nano(TimeUnix))

Reference: ClickHouse Observability Schema Design

Impact

Performance: Faster queries due to removed function overhead
Consistency: Uniform timestamp precision across trace tables
Compliance: Alignment with ClickHouse observability best practices

Files Modified

src/bin/clickhouse-ddl/ddl_traces.rs
src/bin/clickhouse-ddl/ddl_metrics.rs

Migration Notes

For existing deployments, the DateTime precision change requires data migration. TTL and ordering key changes are backward compatible.

garysassano · 2025-07-25T00:34:11Z

ClickHouse Docs Schema

Regarding point 2, I'm not sure. This is what's present in the docs:

CREATE TABLE otel_traces
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `ParentSpanId` String CODEC(ZSTD(1)),
    `TraceState` String CODEC(ZSTD(1)),
    `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
    `SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `Duration` Int64 CODEC(ZSTD(1)),
    `StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
    `StatusMessage` String CODEC(ZSTD(1)),
    `Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
    `Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    `Links.TraceId` Array(String) CODEC(ZSTD(1)),
    `Links.SpanId` Array(String) CODEC(ZSTD(1)),
    `Links.TraceState` Array(String) CODEC(ZSTD(1)),
    `Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)

CREATE TABLE otel_traces_trace_id_ts
(
    `TraceId` String CODEC(ZSTD(1)),
    `Start` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `End` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY (TraceId, toUnixTimestamp(Start))

CREATE MATERIALIZED VIEW otel_traces_trace_id_ts_mv TO otel_traces_trace_id_ts
(
    `TraceId` String,
    `Start` DateTime64(9),
    `End` DateTime64(9)
)
AS SELECT
    TraceId,
    min(Timestamp) AS Start,
    max(Timestamp) AS End
FROM otel_traces
WHERE TraceId != ''
GROUP BY TraceId

ClickHouse Exporter Schema

And this is what's generated by the ClickHouse Exporter:

CREATE TABLE IF NOT EXISTS "otel"."otel_traces"  (
    Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
    TraceId String CODEC(ZSTD(1)),
    SpanId String CODEC(ZSTD(1)),
    ParentSpanId String CODEC(ZSTD(1)),
    TraceState String CODEC(ZSTD(1)),
    SpanName LowCardinality(String) CODEC(ZSTD(1)),
    SpanKind LowCardinality(String) CODEC(ZSTD(1)),
    ServiceName LowCardinality(String) CODEC(ZSTD(1)),
    ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    ScopeName String CODEC(ZSTD(1)),
    ScopeVersion String CODEC(ZSTD(1)),
    SpanAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    Duration UInt64 CODEC(ZSTD(1)),
    StatusCode LowCardinality(String) CODEC(ZSTD(1)),
    StatusMessage String CODEC(ZSTD(1)),
    Events Nested (
                      Timestamp DateTime64(9),
    Name LowCardinality(String),
    Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    Links Nested (
                     TraceId String,
                     SpanId String,
                     TraceState String,
                     Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))
TTL toDateTime(Timestamp) + INTERVAL 259200 SECOND
SETTINGS index_granularity=8192, ttl_only_drop_parts = 1

CREATE TABLE IF NOT EXISTS "otel"."otel_traces_trace_id_ts"  (
    TraceId String CODEC(ZSTD(1)),
    Start DateTime64(9) CODEC(Delta, ZSTD(1)),
    End DateTime64(9) CODEC(Delta, ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Start)
ORDER BY (TraceId, toDateTime(Start))
TTL toDateTime(Start) + INTERVAL 259200 SECOND
SETTINGS index_granularity=8192, ttl_only_drop_parts = 1

CREATE MATERIALIZED VIEW IF NOT EXISTS "otel"."otel_traces_trace_id_ts_mv" 
TO "otel"."otel_traces_trace_id_ts"
AS SELECT
    TraceId,
    min(Timestamp) as Start,
    max(Timestamp) as End
FROM "otel"."otel_traces"
GROUP BY TraceId

mheffner · 2025-07-25T16:17:18Z

Thanks @garysassano for taking a look at this. Taking a brief look at this I've got a few feedback points, but I wonder if @SpencerTorres would have some insight on this too.

For 1: While there is a precision loss, I wonder how much it truly matters for the traces_trace_id_ts table since that is primarily used by the Grafana CH datasource for query optimization. As long as the precision loss wouldn't exclude a timestamp range query, maybe it's ok to keep the precision lower? Granted, unless the precision loss results in savings for the traces_trace_id_ts table, maybe better to maintain full nanosec precision?

Regarding the datetime conversions in TTL and ORDER BY clauses, I think the primary importance would be maintaining compatibility with the existing queries from Grafana datasources and HyperDX/Clickstack. It does seem like there are mixed opinions (1 and 2) on the best schema for this data. At a glance it doesn't seem like these changes would break queries and would just improve performance, but I'd want to verify that first.

garysassano added 2 commits July 25, 2025 00:49

fix ddl queries

20bdb14

include column types in create mv

0b656e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClickHouse DDL Schema Improvements for Observability #144

ClickHouse DDL Schema Improvements for Observability #144

Uh oh!

garysassano commented Jul 24, 2025 •

edited

Loading

Uh oh!

garysassano commented Jul 25, 2025

Uh oh!

mheffner commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ClickHouse DDL Schema Improvements for Observability #144

Are you sure you want to change the base?

ClickHouse DDL Schema Improvements for Observability #144

Uh oh!

Conversation

garysassano commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

1. Fixed DateTime Precision Inconsistency

2. Removed Unnecessary Type Conversions

3. Performance Issues with Ordering Keys

4. Metrics Schema Using toUnixTimestamp64Nano()

Impact

Files Modified

Migration Notes

Uh oh!

garysassano commented Jul 25, 2025

ClickHouse Docs Schema

ClickHouse Exporter Schema

Uh oh!

mheffner commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garysassano commented Jul 24, 2025 •

edited

Loading

4. Metrics Schema Using `toUnixTimestamp64Nano()`