OpenTelemetry Traces to Parquet: How It Works

OpenTelemetry Traces to Parquet: Why Columnar Storage Works

Distributed systems emit billions of spans. A single user request can touch ten services, produce dozens of child spans, and carry hundreds of attributes. Multiply that by traffic volume and retention window, and trace storage becomes one of the most expensive and query-intensive parts of an observability stack.

Moving OpenTelemetry traces to Parquet changes the storage model. Instead of storing each span as a JSON blob or a row-oriented record, Apache Parquet lays the data out column by column. Queries that filter by service name, status code, or duration read only those columns — not every field in every span. That changes the economics of trace retention and the speed of trace analysis.

Parseable uses Parquet as its underlying storage format, paired with metadata management, smart caching, and a Rust-based query engine. This article explains how the model works, why it fits OpenTelemetry trace data specifically, and where any Parquet-backed trace system still needs careful engineering.

What you'll learn

How OpenTelemetry traces are structured as span trees
How span trees are flattened into rows for analytical storage
Why Apache Parquet's columnar layout reduces I/O for trace queries
How Parseable maps common span attributes into queryable columns
Why column pruning, predicate pushdown, compression, and vectorized processing matter for trace workloads
Where Parquet trace storage needs careful backend design: compaction, dynamic attributes, trace reconstruction, and query planning

Quick answer: why store OpenTelemetry traces in Parquet?

OpenTelemetry spans are structured records with repeated fields: trace_id, span_id, service.name, span.name, status.code, duration, and resource attributes. In a row-oriented format, a query for all error spans from a specific service reads the entire dataset. In a columnar format like Parquet, the same query reads only the status.code and service.name columns — skipping everything else.

That property makes OTel traces to Parquet a natural fit for high-throughput trace workloads:

Column pruning — read only the fields the query touches
Predicate pushdown — skip row groups that cannot match the filter before reading them
Compression — repeated values like service names and status codes compress extremely well column by column
Vectorized scans — modern CPUs can scan and filter columnar data in wide batches

The result is faster queries, lower storage cost, and more practical long-term trace retention. Parquet is not magic on its own — the backend still needs ingestion batching, file compaction, metadata management, and good query planning. But the storage format is the right foundation for analytical trace workloads.

What are OpenTelemetry traces?

OpenTelemetry is a vendor-neutral observability framework that standardizes how applications collect and export telemetry. It defines APIs and SDKs for traces, metrics, and logs, and specifies OTLP as the wire protocol for exporting that data to backends. OpenTelemetry is the collection and transport layer — storage and visualization are handled by observability backends like Parseable.

How traces and spans work

A trace represents the full journey of a request through a distributed system. It is a tree of spans — individual units of work — connected by shared trace_id and parent-child span_id references.

Each span records:

name — the operation (e.g., GET /login, SQL Query)
trace_id — shared identifier for the whole trace
span_id — unique identifier for this span
parent_span_id — the span that triggered this one (null for the root)
service.name — which service produced the span
start and end timestamps — for duration calculation
status — OK, ERROR, or UNSET
attributes — key-value pairs like http.method, db.system, k8s.pod.name
span events — timestamped log-like records attached to the span
links — references to related spans outside this trace

OpenTelemetry's trace specification defines spans as the primary unit of tracing. The hierarchy is maintained through IDs, not nesting — which is what makes flattened storage practical.

Trace tree example

A single login request touching four services might produce this trace:

trace_id	span_id	parent_span_id	service_name	operation_name	duration_ms	status
abc123	1	NULL	API Gateway	GET /login	142	OK
abc123	2	1	Auth Service	Validate Token	38	OK
abc123	3	1	User Service	Fetch Profile	91	OK
abc123	4	3	DB Service	SQL Query	87	OK

Each row is a span. The parent_span_id column encodes the tree structure. No graph database is required to store or query this — the hierarchy is recoverable from the IDs at query or visualization time.

Why trace depth and width matter for storage

Production traces are rarely four spans deep. A complex microservices request can produce hundreds of spans across dozens of services, each carrying different resource attributes, SDK versions, deployment identifiers, and custom application attributes. The combination of high span volume, deep hierarchies, wide attribute sets, and high cardinality (unique trace IDs, unique request IDs) makes trace data one of the hardest signals to store and query efficiently.

Bring Logs, Metrics and Traces from 70+ data source in one platform, Parseable. Get started for free

How Parseable stores OpenTelemetry traces in Parquet

Flattening traces into span rows

Parseable receives OpenTelemetry trace data over OTLP, ingests it through its OpenTelemetry ingestion pipeline, and stores spans as rows in Apache Parquet files on object storage.

The trace tree is flattened at ingestion: each span becomes a row, with trace_id, span_id, and parent_span_id preserving the hierarchy. No separate graph storage is needed. The tree can be reconstructed at query time when a trace viewer needs to display the hierarchy.

This model keeps storage simple, analytical, and schema-aligned. Parquet files on object storage are immutable, columnar, and indexable, well-suited for long-term retention and batch analytical queries.

Attribute splitting in Parseable

Not all span attributes are equal. Some appear on nearly every span — service.name, telemetry.sdk.language, http.method, status.code. Others are dynamic and inconsistent across services.

Parseable handles this with a two-tier attribute model:

Raw OpenTelemetry attribute	How Parseable stores it	Why it matters
`service.name`	Dedicated `service_name` column	High-frequency filter; benefits from column pruning
`telemetry.sdk.language`	Dedicated `sdk_language` column	Useful for SDK-level analysis and aggregation
`span.name` / operation	Dedicated column	Common grouping and filter dimension
`status.code`	Dedicated column	Most error queries start here
Dynamic / custom attributes	`other_attributes` key-value column	Preserves flexibility without schema explosion

Promoting high-frequency attributes to top-level columns means those fields benefit from full column pruning, row-group statistics, and predicate pushdown. Dynamic attributes stay queryable through the other_attributes column without requiring a schema change for every new attribute a team adds.

Why columnar storage helps OpenTelemetry traces

Selective queries: read only the columns you need

In a row-oriented store, a query like "find all error spans from the auth service in the last hour" reads every field of every span in the time window — timestamps, trace IDs, span IDs, resource attributes, event lists, and everything else — even though the query only cares about service_name, status_code, and p_timestamp.

In Parquet, that same query reads three columns and skips the rest. For trace data with wide attribute sets, the I/O reduction is significant. Queries that filter by service, status, timestamp, or duration — the most common trace queries — touch a small fraction of the stored data.

This is the core value of columnar storage for traces: the query I/O scales with the number of fields queried, not the width of the schema.

Compression and redundancy in trace data

Parquet compresses each column independently. For trace data, this has a large practical effect because span records carry many repeated values:

service_name: a system with 20 services has only 20 distinct values across billions of spans
status_code: three possible values (OK, ERROR, UNSET)
sdk_language: a handful of values (python, go, java, dotnet)
environment: production, staging, dev
Kubernetes attributes: pod names, namespace names, and node names repeat heavily within a deployment window

Dictionary encoding and run-length encoding in Parquet exploit this repetition directly. Columns with low cardinality — which describes most resource attributes — compress to a small fraction of their raw size. High-cardinality columns like trace_id compress less aggressively, but they represent a smaller share of total storage than the repeated metadata fields.

In practice, Parquet-stored observability data is substantially more compact than equivalent JSON. The exact ratio depends on attribute density, cardinality distribution, and compression codec. Benchmarks from Parseable's Apache Parquet for observability work show meaningful compression gains for typical telemetry workloads. For authoritative numbers from your own data, compare ingest volume against Parquet file size on object storage after a representative retention window.

Better disk I/O with predicate pushdown

Parquet files are divided into row groups, each storing column-level statistics: minimum and maximum values for every column in that group. When a query includes a filter like p_timestamp > '2024-01-15T14:00:00Z', the query engine reads those statistics before touching the actual data. Row groups where p_timestamp_max < 14:00:00 are skipped entirely.

For time-bounded trace queries — which is most production queries — predicate pushdown means the engine reads only the row groups that could possibly match. Combined with column pruning, this significantly reduces the data scanned per query.

CPU-efficient scans and vectorized processing

Parquet stores data in contiguous blocks per column. Scanning a column means sequential memory reads rather than scattered random reads across rows. Modern CPUs handle sequential access efficiently, both through prefetching and through SIMD (vectorized) instructions that process multiple values in a single CPU operation.

For a filter like status_code = 'ERROR', vectorized execution can evaluate the filter across a wide batch of values per instruction rather than one at a time. The practical effect is faster scans for high-volume trace data without requiring specialized hardware.

The cardinality problem and why columns help

High-cardinality trace data creates a specific challenge for row-oriented indexes: a unique trace_id on every span means no index reduces the scan much. But most production trace queries do not query by trace ID directly — they query by service, status, time window, and operation name, then retrieve trace IDs from the results.

Columnar storage sidesteps the trace ID cardinality problem for these common patterns. The engine reads the low-cardinality filter columns first, reduces the candidate set, and only then accesses the trace_id and span_id columns for the matched rows.

Bring your high-cardinality data. Cut the cost of storing and using it. Get started with Parseable

Real-world context: trace storage is moving toward columnar formats

Parseable and Parquet-backed observability

Parseable stores all telemetry; logs, traces, and metrics, in Apache Parquet on object storage. The architecture combines Parquet with metadata management, smart caching, and a Rust-based query engine built on DataFusion. Parquet is the foundation; the query acceleration layer handles what Parquet alone cannot: fast metadata pruning across thousands of files, caching for hot query patterns, and efficient ingest batching before files are written.

Grafana Tempo and Parquet trace storage

Grafana's Tempo project moved toward Parquet trace storage starting with its 1.5 release, which introduced experimental Parquet support. Tempo's rationale aligned with the same principle: Parquet lets queries read fewer columns and pull less data from object storage, which matters at the scale of billions of spans. Tempo 2.0 formalized Parquet as the default backend format. The Tempo example matters because it shows that the Parquet trace storage model is not specific to any single vendor — it is emerging as a standard pattern for scalable distributed trace storage.

OpenTelemetry and columnar telemetry formats

The OpenTelemetry project has explored columnar data transport through the OpenTelemetry Protocol with Apache Arrow (OTel Arrow). The focus of that work is efficient columnar transport and reduced wire-format overhead, not Parquet storage directly. The broader pattern — telemetry data represented and processed in columnar layouts — runs through both the transport and storage layers of modern observability infrastructure.

Where Parquet trace storage needs careful design

Parquet is a good foundation, but it does not solve all trace storage problems on its own. Production systems need engineering at several layers above the file format.

Small files and compaction

Writing many tiny Parquet files — one per batch of incoming spans — creates a fragmentation problem for object storage. Object-store metadata operations become expensive, and query planning over thousands of tiny files adds overhead. Production Parquet backends need ingestion batching (accumulate spans before writing) and periodic compaction (merge small files into larger ones). The file layout strategy affects query performance as much as the format itself.

Dynamic attributes and schema evolution

OpenTelemetry attribute sets are not fixed. Different services attach different attributes, SDKs evolve, and teams add custom dimensions over time. A storage system that requires a fixed schema for all attributes will either reject new attributes or require expensive schema migrations. The two-tier model (promoted columns + other_attributes) handles this, but attribute promotion decisions need thought: promote too few and common queries stay slow; promote too many and the schema becomes unwieldy.

Trace reconstruction

Flattening spans into rows is efficient for storage and analytical queries, but trace visualization requires reconstructing the tree. That means the backend needs to fetch all spans for a given trace_id, sort them by parent relationship, and build the hierarchy in memory or at query time. For deep traces with hundreds of spans, this reconstruction step is where latency is often highest. Good backends cache recently accessed traces and build efficient span-assembly paths.

Query acceleration beyond column pruning

Column pruning and predicate pushdown help, but they are not sufficient for all trace query patterns. Lookups by trace_id bypass columnar advantages because trace_id is high cardinality. Backends need secondary indexing, bloom filters, or metadata caches to make trace-ID lookups fast without full scans. Parseable combines Parquet with metadata management and caching specifically to address these patterns.

How Parseable makes OpenTelemetry traces queryable

Parseable accepts OpenTelemetry trace data over OTLP HTTP and gRPC. Spans are ingested, attribute-split into the two-tier column model described above, and written to Parquet files on object storage. The query layer sits on top of those Parquet files and uses DataFusion for vectorized SQL execution.

Key capabilities for trace workloads:

OTLP ingestion — accepts traces, logs, and metrics from any OpenTelemetry-instrumented application or Collector
Attribute promotion — common span fields become queryable columns; dynamic attributes go to other_attributes
SQL queries — standard SQL over trace data, no custom query language
Metadata management — file-level statistics and metadata pruning reduce scan scope before reading Parquet data
Smart caching — frequently accessed spans and query results stay warm
Object storage backend — long-term Parquet files on S3, GCS, or compatible storage at object-storage pricing
Dashboards and alerts — trace query results feed directly into dashboard panels and alerting rules

For teams evaluating trace backends, the observability pricing model matters: Parseable's approach scales with ingest volume on object storage rather than per-host or per-seat pricing.

Example trace queries in Parseable

Before running these queries, inspect your stream schema:

SELECT * FROM "otel-traces" LIMIT 5

Use the actual field names from the response. The examples below use the attribute names emitted by standard OpenTelemetry SDKs after Parseable's attribute normalization.

Find slow spans by service

SELECT
  service_name,
  span_name,
  avg(duration_ms)                                             AS avg_duration_ms,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms)   AS p95_duration_ms,
  count(*)                                                     AS span_count
FROM "otel-traces"
WHERE p_timestamp >= now() - interval '1 hour'
GROUP BY service_name, span_name
ORDER BY p95_duration_ms DESC
LIMIT 20;

Find error spans

SELECT
  p_timestamp,
  service_name,
  span_name,
  trace_id,
  span_id,
  status_code,
  status_message
FROM "otel-traces"
WHERE status_code = 'ERROR'
  AND p_timestamp >= now() - interval '30 minutes'
ORDER BY p_timestamp DESC;

Retrieve all spans for a trace ID

SELECT
  span_id,
  parent_span_id,
  service_name,
  span_name,
  duration_ms,
  status_code,
  p_timestamp
FROM "otel-traces"
WHERE trace_id = 'abc123def456'
ORDER BY p_timestamp ASC;

Group latency by operation

SELECT
  service_name,
  span_name,
  count(*)                                                      AS total_spans,
  avg(duration_ms)                                              AS avg_ms,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms)    AS p99_ms
FROM "otel-traces"
WHERE p_timestamp >= now() - interval '24 hours'
GROUP BY service_name, span_name
ORDER BY p99_ms DESC;

Find services producing the most spans

SELECT
  service_name,
  count(*) AS span_count,
  count(DISTINCT trace_id) AS unique_traces
FROM "otel-traces"
WHERE p_timestamp >= now() - interval '1 hour'
GROUP BY service_name
ORDER BY span_count DESC;

Adapt field names based on your actual stream schema and how your OpenTelemetry attributes are mapped in Parseable. Run a SELECT * LIMIT 5 first and use the column names you see in the response.

Conclusion

Moving OpenTelemetry traces to Parquet is not just a storage decision — it shapes how trace data can be queried, retained, and analyzed at scale. Spans are structured records with repeated fields and wide attribute sets. Parquet stores them column by column, making it possible to filter by service, status, or duration without reading every field in every span.

Column pruning, predicate pushdown, compression on low-cardinality attributes, and vectorized CPU scans all contribute. But Parquet is the foundation, not the full solution. Production trace storage still requires ingestion batching, file compaction, metadata management, smart caching, and query planning that can handle high-cardinality trace-ID lookups alongside analytical aggregations.

Parseable builds on the Parquet model with a Rust-based query engine, two-tier attribute storage, metadata pruning, and object-storage backends — designed to make OpenTelemetry trace storage practical at the volume and retention windows that production systems require.

What's next?

If your tracing backend is becoming expensive or hard to query at scale, try Parseable Pro for 14 days and see how OpenTelemetry traces behave when stored in Apache Parquet on object storage. Parseable pricing scales with ingest volume, not host count.

FAQ

What is OpenTelemetry traces to Parquet?

It means storing OpenTelemetry span data in Apache Parquet files rather than in row-oriented databases or JSON blobs. Each span becomes a row in a columnar Parquet file, which enables efficient filtering, compression, and long-term retention for distributed trace data.

Why use Apache Parquet for trace storage?

Parquet's columnar layout lets trace queries read only the fields they need — service name, status, timestamp, duration — without touching the full span record. This reduces I/O, improves compression on repeated attribute values, and makes predicate pushdown possible for time-bounded queries.

Does OpenTelemetry write to Parquet directly?

No. OpenTelemetry handles collection and export — it sends spans over OTLP to a backend. The backend decides the storage format. Parseable is one backend that stores OTLP-ingested spans in Parquet.

What span attributes does Parseable promote to columns?

Common high-frequency attributes like service.name, span.name, status.code, telemetry.sdk.language, and timestamps become dedicated columns. Less common or dynamic attributes are stored in an other_attributes column as key-value pairs, keeping them queryable without requiring fixed schema.

Is Parquet good for high-cardinality trace data?

For analytical queries (filter by service, status, time, operation), yes. For direct trace-ID lookups, columnar advantages are smaller because trace_id is unique per trace. Production systems combine Parquet with metadata caches and secondary indexes to handle both query patterns efficiently.

OpenTelemetry Traces to Parquet: How It Works

Predictive Observability at Scale

Table of Contents

Try Parseable Pro free for 14 days

Subscribe to our newsletter

Home

Pricing

Resources

Legal

SFO

BLR