Skip to content

Conversation

@garysassano
Copy link

@garysassano garysassano commented Jul 24, 2025

Summary

This PR aligns ClickHouse DDL schema with the official observability schema design by fixing inconsistencies in timestamp handling and removing unnecessary type conversions.

Changes Made

1. Fixed DateTime Precision Inconsistency

Issue: traces_trace_id_ts table used DateTime while main traces table uses DateTime64(9), causing precision loss in materialized view aggregations.

-- ❌ Before: Inconsistent precision
Start DateTime CODEC(Delta, ZSTD(1)),
End DateTime CODEC(Delta, ZSTD(1)),

-- ✅ After: Consistent precision  
Start DateTime64(9) CODEC(Delta, ZSTD(1)),
End DateTime64(9) CODEC(Delta, ZSTD(1)),

Reference: ClickHouse DateTime64 Documentation

2. Removed Unnecessary Type Conversions

Issue: TTL expressions and ordering keys used toDateTime() conversions that add overhead and reduce performance.

// ❌ Before: Unnecessary conversions
build_ttl_string(ttl, "toDateTime(Timestamp)")
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))

// ✅ After: Direct column usage
build_ttl_string(ttl, "Timestamp")  
ORDER BY (ServiceName, SpanName, Timestamp)

Reference: ClickHouse TTL Documentation

3. Performance Issues with Ordering Keys

Issue: Using toDateTime() in ORDER BY clauses causes slow query performance with DateTime64 as noted in ClickHouse documentation.

-- ❌ Before: Function overhead in ordering
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))

-- ✅ After: Direct column usage for better performance
ORDER BY (ServiceName, SpanName, Timestamp)

Reference: ClickHouse Primary Keys and Ordering

4. Metrics Schema Using toUnixTimestamp64Nano()

Note: The metrics schema uses toUnixTimestamp64Nano(TimeUnix) in ORDER BY clauses, which may need future evaluation for performance impact:

ORDER BY (ServiceName, MetricName, toUnixTimestamp64Nano(TimeUnix))

Reference: ClickHouse Observability Schema Design

Impact

  • Performance: Faster queries due to removed function overhead
  • Consistency: Uniform timestamp precision across trace tables
  • Compliance: Alignment with ClickHouse observability best practices

Files Modified

  • src/bin/clickhouse-ddl/ddl_traces.rs
  • src/bin/clickhouse-ddl/ddl_metrics.rs

Migration Notes

For existing deployments, the DateTime precision change requires data migration. TTL and ordering key changes are backward compatible.

@garysassano
Copy link
Author

ClickHouse Docs Schema

Regarding point 2, I'm not sure. This is what's present in the docs:

CREATE TABLE otel_traces
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `ParentSpanId` String CODEC(ZSTD(1)),
    `TraceState` String CODEC(ZSTD(1)),
    `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
    `SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `Duration` Int64 CODEC(ZSTD(1)),
    `StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
    `StatusMessage` String CODEC(ZSTD(1)),
    `Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
    `Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    `Links.TraceId` Array(String) CODEC(ZSTD(1)),
    `Links.SpanId` Array(String) CODEC(ZSTD(1)),
    `Links.TraceState` Array(String) CODEC(ZSTD(1)),
    `Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)

CREATE TABLE otel_traces_trace_id_ts
(
    `TraceId` String CODEC(ZSTD(1)),
    `Start` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `End` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY (TraceId, toUnixTimestamp(Start))

CREATE MATERIALIZED VIEW otel_traces_trace_id_ts_mv TO otel_traces_trace_id_ts
(
    `TraceId` String,
    `Start` DateTime64(9),
    `End` DateTime64(9)
)
AS SELECT
    TraceId,
    min(Timestamp) AS Start,
    max(Timestamp) AS End
FROM otel_traces
WHERE TraceId != ''
GROUP BY TraceId

ClickHouse Exporter Schema

And this is what's generated by the ClickHouse Exporter:

CREATE TABLE IF NOT EXISTS "otel"."otel_traces"  (
    Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
    TraceId String CODEC(ZSTD(1)),
    SpanId String CODEC(ZSTD(1)),
    ParentSpanId String CODEC(ZSTD(1)),
    TraceState String CODEC(ZSTD(1)),
    SpanName LowCardinality(String) CODEC(ZSTD(1)),
    SpanKind LowCardinality(String) CODEC(ZSTD(1)),
    ServiceName LowCardinality(String) CODEC(ZSTD(1)),
    ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    ScopeName String CODEC(ZSTD(1)),
    ScopeVersion String CODEC(ZSTD(1)),
    SpanAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    Duration UInt64 CODEC(ZSTD(1)),
    StatusCode LowCardinality(String) CODEC(ZSTD(1)),
    StatusMessage String CODEC(ZSTD(1)),
    Events Nested (
                      Timestamp DateTime64(9),
    Name LowCardinality(String),
    Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    Links Nested (
                     TraceId String,
                     SpanId String,
                     TraceState String,
                     Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))
TTL toDateTime(Timestamp) + INTERVAL 259200 SECOND
SETTINGS index_granularity=8192, ttl_only_drop_parts = 1

CREATE TABLE IF NOT EXISTS "otel"."otel_traces_trace_id_ts"  (
    TraceId String CODEC(ZSTD(1)),
    Start DateTime64(9) CODEC(Delta, ZSTD(1)),
    End DateTime64(9) CODEC(Delta, ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Start)
ORDER BY (TraceId, toDateTime(Start))
TTL toDateTime(Start) + INTERVAL 259200 SECOND
SETTINGS index_granularity=8192, ttl_only_drop_parts = 1

CREATE MATERIALIZED VIEW IF NOT EXISTS "otel"."otel_traces_trace_id_ts_mv" 
TO "otel"."otel_traces_trace_id_ts"
AS SELECT
    TraceId,
    min(Timestamp) as Start,
    max(Timestamp) as End
FROM "otel"."otel_traces"
GROUP BY TraceId

@mheffner
Copy link
Contributor

Thanks @garysassano for taking a look at this. Taking a brief look at this I've got a few feedback points, but I wonder if @SpencerTorres would have some insight on this too.

For 1: While there is a precision loss, I wonder how much it truly matters for the traces_trace_id_ts table since that is primarily used by the Grafana CH datasource for query optimization. As long as the precision loss wouldn't exclude a timestamp range query, maybe it's ok to keep the precision lower? Granted, unless the precision loss results in savings for the traces_trace_id_ts table, maybe better to maintain full nanosec precision?

Regarding the datetime conversions in TTL and ORDER BY clauses, I think the primary importance would be maintaining compatibility with the existing queries from Grafana datasources and HyperDX/Clickstack. It does seem like there are mixed opinions (1 and 2) on the best schema for this data. At a glance it doesn't seem like these changes would break queries and would just improve performance, but I'd want to verify that first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants