Most observability platforms treat storage as an implementation detail you never see. Your logs, metrics, and traces go in, and you query them through a proprietary interface. What happens in between is a black box. This design is deliberate — it creates lock-in. If you cannot access your own telemetry outside the vendor's query engine, you cannot leave without losing everything.
Apache Parquet breaks this pattern. It is an open, columnar file format that has become the standard for analytical workloads across the data industry. When your observability data lives in Parquet on object storage, you get compression ratios that slash storage costs, query performance that matches or beats proprietary engines, and the freedom to read your own data with any tool in the ecosystem.
This article makes the technical case for Parquet as the storage format for telemetry data — what it is, why it works so well for observability, and how it compares to the alternatives.
What Is Apache Parquet?
Apache Parquet is a columnar storage format designed for efficient analytical processing. It was originally developed at Twitter and Cloudera in 2013, inspired by Google's Dremel paper, and donated to the Apache Software Foundation. Today it is the de facto standard for analytical data across the industry — supported natively by Spark, DuckDB, Athena, Presto, BigQuery, Redshift, Snowflake, Databricks, Pandas, Polars, and dozens of other tools.
The key architectural distinction is columnar versus row-oriented storage.
In a row-oriented format (like JSON lines, CSV, or traditional RDBMS pages), each record is stored contiguously. If you have a log event with 40 fields and you want to aggregate one of them, the storage engine still reads all 40 fields for every record.
In a columnar format, values for each field are stored together. A query that touches only timestamp, severity, and service_name reads only those three columns and skips everything else. For observability data, where log events routinely have 30-80 fields but queries touch 3-5 of them, this is transformative.
Parquet's Internal Structure
A Parquet file is organized into three layers:
| Layer | What It Contains | Why It Matters |
|---|---|---|
| Row groups | Horizontal partitions of the data (typically 128 MB each) | Enables parallel reads and predicate pushdown across chunks |
| Column chunks | All values for a single column within a row group | Allows reading only the columns a query needs |
| Pages | Subdivisions of column chunks (typically 1 MB) | Unit of compression and encoding; enables fine-grained filtering |
Each file also contains a footer with schema metadata, min/max statistics per column chunk, and encoding information. This footer is critical for query performance — a query engine can read the footer, check the min/max stats, and skip entire row groups that cannot contain matching data. This is called predicate pushdown, and it means queries often read a fraction of the actual file.
Why Parquet Is Ideal for Observability Data
Observability telemetry — logs, metrics, and traces — has characteristics that align precisely with Parquet's strengths. Here is why.
1. Extreme Compression for Repetitive Telemetry
Observability data is highly repetitive. The same service_name appears in millions of log lines. The same severity values (INFO, WARN, ERROR) repeat endlessly. Kubernetes labels like namespace, pod_name, and container_name follow predictable patterns with a limited set of distinct values.
Columnar storage exploits this repetition. When all values for a single column are stored together, compression algorithms see long runs of identical or similar values and compress them aggressively. Parquet supports multiple encoding schemes — dictionary encoding, run-length encoding, delta encoding, and bit-packing — that are applied per column based on the data characteristics.
Here is what this looks like in practice for typical observability data:
| Data Type | Raw JSON Size | Parquet Size | Compression Ratio |
|---|---|---|---|
| Application logs (structured) | 100 GB | 8-12 GB | 8-12x |
| Kubernetes event logs | 100 GB | 6-10 GB | 10-16x |
| HTTP access logs | 100 GB | 10-15 GB | 7-10x |
| OpenTelemetry traces | 100 GB | 12-18 GB | 6-8x |
| Infrastructure metrics | 100 GB | 5-8 GB | 12-20x |
These are not theoretical numbers. They reflect real-world compression ratios observed across production deployments. The variance depends on cardinality (how many distinct values each column has), field count, and the proportion of structured versus freeform text fields. Structured logs with consistent schemas compress better; unstructured message bodies compress less.
At $0.023/GB/month on S3 standard tier, the difference between storing 100 GB of raw JSON versus 10 GB of compressed Parquet is the difference between $2.30 and $0.23 per month per day of retention. Multiply that across terabytes of daily ingestion and months of retention, and the savings are material.
2. Column Pruning Eliminates Wasted I/O
This is Parquet's single biggest advantage for observability queries. Consider a typical incident investigation query:
SELECT timestamp, severity, message, trace_id
FROM application_logs
WHERE service_name = 'payment-service'
AND severity = 'ERROR'
AND timestamp > NOW() - INTERVAL '1 hour'This query touches 5 columns out of a log event that might have 50. In a row-oriented format, the storage engine reads all 50 columns for every row, then discards 45 of them. That is 90% wasted I/O.
In Parquet, the query engine reads only the 5 columns it needs. If each column is roughly equal in size, that is a 10x reduction in data scanned. In practice, the improvement is often larger because the columns you query on (timestamps, severity levels, service names) tend to be small and highly compressible, while the columns you skip (full message bodies, stack traces, request payloads) tend to be large.
For observability workloads, where queries are overwhelmingly selective — filtering by time range, service, severity, or trace ID — column pruning converts what would be a full-table scan in row-oriented storage into a surgical read of a small subset of the data.
3. Predicate Pushdown Skips Irrelevant Data
Parquet's min/max statistics per row group enable another optimization: the query engine can skip entire row groups without reading them.
If a row group's max timestamp is 2026-02-26T23:59:59Z and your query filters for timestamp > 2026-02-27T00:00:00Z, the engine skips that row group entirely. No I/O, no decompression, no processing. For time-range queries — which describe the vast majority of observability queries — this means the engine jumps directly to the relevant time window.
Combined with time-based partitioning (where Parquet files are organized by hour or day), predicate pushdown can reduce the data scanned for a typical "last 1 hour" query from terabytes to megabytes.
4. Schema Evolution Without Migration
Production telemetry schemas change constantly. A new microservice adds fields. A Kubernetes upgrade introduces new labels. An OpenTelemetry instrumentation library adds span attributes. In a traditional database, schema changes require ALTER TABLE statements, migrations, and potentially downtime.
Parquet handles schema evolution natively. Each file carries its own schema in the footer. New columns can be added without rewriting existing files — older files simply return NULL for the new column. Columns can be removed from future writes without affecting historical data. This is exactly the flexibility that observability pipelines need, where the schema is defined by the applications generating telemetry, not by a central DBA.
5. Universal Compatibility Eliminates Lock-in
This is the strategic argument for Parquet. When your telemetry is stored in Parquet on object storage, you are not locked into any single query engine. You can query the same data with:
- ParseableDB (built on Apache Arrow DataFusion) for real-time observability queries
- DuckDB for ad hoc analysis on a laptop
- Apache Spark for large-scale batch processing
- Amazon Athena or Presto/Trino for serverless queries
- Pandas or Polars for data science workflows
- Snowflake or BigQuery for integration with your data warehouse
This is not a hypothetical benefit. Teams routinely need to analyze observability data in ways their primary query engine was not designed for — correlating deployments with error rates in a Jupyter notebook, feeding log patterns into an ML pipeline, or joining telemetry with business data in a warehouse. When your data is in Parquet, these workflows require zero data movement. The files are already there, in a format every tool understands.
Compare this to a proprietary storage engine where the only way to get data out is an export API with rate limits, format conversion overhead, and egress charges.
Parquet vs. the Alternatives for Observability
To appreciate why Parquet stands out, it helps to see how it compares against the other storage approaches used in observability platforms.
Parquet vs. JSON Lines (Raw Logs)
JSON lines — one JSON object per line, stored in plain text files — is the simplest possible storage format for logs. Many teams start here, writing logs to files on disk or directly to S3.
| Dimension | JSON Lines | Apache Parquet |
|---|---|---|
| Compression | 2-3x with gzip | 8-15x with columnar encoding + Snappy/Zstd |
| Query speed | Full scan required for every query | Column pruning + predicate pushdown |
| Schema | Schemaless (flexible but no type safety) | Schema-on-write with evolution support |
| Tooling | grep, jq, custom scripts | SQL engines, data warehouses, ML frameworks |
| Cost at 1 TB/day, 90 days | ~$2,070/month (S3 storage alone) | ~$200-350/month (S3 storage) |
JSON lines works for small volumes and simple grep-style searches. It falls apart at scale because every query reads every byte.
Parquet vs. Proprietary Database Formats (ClickHouse, Elasticsearch)
Platforms like ClickHouse and Elasticsearch use their own on-disk formats optimized for their specific query engines. These formats can be extremely fast — ClickHouse's MergeTree engine is one of the fastest analytical stores in existence.
| Dimension | Proprietary DB Format | Apache Parquet on Object Storage |
|---|---|---|
| Query latency (hot data) | Sub-second (local SSD) | Low seconds (object storage + caching) |
| Compression | Comparable to Parquet | Comparable to proprietary |
| Storage cost | SSD/EBS: $0.08-0.25/GB/month | Object storage: $0.02-0.03/GB/month |
| Data portability | Locked to the specific database | Any Parquet-compatible tool |
| Operational overhead | Cluster management, sharding, replication | Object storage (managed by cloud provider) |
| Scaling model | Scale compute + storage together | Scale independently |
The trade-off is real. A ClickHouse cluster with data on local NVMe drives will return sub-second results for queries that take 2-5 seconds over Parquet on S3. But that ClickHouse cluster costs 5-10x more to operate, requires dedicated operational expertise, and locks your data into a format only ClickHouse can read.
For observability, where the overwhelming majority of queries target the last few hours of data (which can be cached aggressively), the latency difference on hot data is negligible. For the long tail of historical data — which is the entire point of building an observability data lake — the cost advantage of Parquet on object storage is decisive.
Parquet vs. Apache ORC
ORC (Optimized Row Columnar) is another columnar format from the Hadoop ecosystem. It was developed at Hortonworks as an alternative to Parquet and has similar compression and query characteristics.
In practice, Parquet has won the ecosystem war. It has broader support across cloud data warehouses, processing engines, and analytical tools. Unless you are deeply invested in Hive (which ORC was designed for), Parquet is the safer choice for maximizing compatibility.
How Parseable Uses Parquet for Observability
Parseable is an observability data lake platform built on Apache Parquet from the ground up. It is not a traditional database that exports to Parquet — Parquet is the native storage format, and every design decision flows from that choice.
Ingestion to Parquet
When telemetry arrives at Parseable (via OTLP, HTTP, or other protocols), the ingestion layer buffers events, applies schema inference, and writes compressed Parquet files to object storage. This happens continuously, with configurable flush intervals that balance write latency against file size optimization.
The schema is inferred automatically from the incoming data. If an OpenTelemetry Collector sends spans with new attributes, Parseable adds the columns to subsequent Parquet files without manual intervention. Historical files retain their original schema, and queries across both return the new columns as NULL for older data.
Query Engine: ParseableDB on Arrow DataFusion
Parseable's query engine, ParseableDB, is built on Apache Arrow DataFusion — a high-performance query engine written in Rust that operates natively on Arrow columnar memory format. Because Parquet and Arrow share the same columnar memory model (Arrow is Parquet's in-memory cousin), there is zero serialization overhead when reading Parquet into the query engine. Data moves from storage to query processing without format conversion.
ParseableDB exploits all of Parquet's performance features:
- Column pruning: Only reads columns referenced in the query
- Predicate pushdown: Uses row group statistics to skip irrelevant data
- Partition pruning: Leverages time-based partitioning to narrow the scan range
- Vectorized execution: Processes data in Arrow columnar batches, which enables SIMD optimizations on modern CPUs
The result is fast, interactive queries over data that lives on object storage — not local SSDs. For the most common observability patterns (time-range filtered, single-service, severity-filtered), query latency is typically in the low seconds, even over months of historical data.
Data You Own, Formats You Control
On the Enterprise plan with Bring Your Own Bucket (BYOB), the Parquet files Parseable writes live in an S3 bucket in your AWS, GCS, or Azure account. You control the IAM policies, encryption keys, lifecycle rules, and access patterns. You can point external tools — DuckDB, Spark, Athena — directly at the same files for analysis that goes beyond what the observability UI provides.
This is the endgame for data ownership in observability. Your telemetry is not trapped in a vendor's infrastructure. It is standard Parquet files on storage you own, queryable by any tool, portable to any platform, subject to your retention policies and your compliance framework.
On the Pro plan ($0.39/GB ingested), Parseable Cloud handles all of this on managed multi-tenant infrastructure. You get the same Parquet-backed storage and query performance without managing buckets or infrastructure, with 365 days of retention, 99.9% uptime SLA, AI-native analysis, anomaly detection, dashboards, alerts, and unlimited users included.
Practical Patterns: Parquet for Observability Workflows
Here are concrete examples of how Parquet's format properties translate into real observability workflows.
Cross-Tool Investigation
An SRE detects an anomaly in Prism (Parseable's web UI) and wants to perform deeper statistical analysis. With BYOB, they open a Jupyter notebook and query the same Parquet files directly:
import duckdb
conn = duckdb.connect()
df = conn.execute("""
SELECT
date_trunc('minute', timestamp) as minute,
service_name,
count(*) FILTER (WHERE severity = 'ERROR') as errors,
count(*) as total,
round(100.0 * count(*) FILTER (WHERE severity = 'ERROR') / count(*), 2) as error_rate
FROM read_parquet('s3://my-observability-bucket/logs/2026/02/27/**/*.parquet')
WHERE timestamp > '2026-02-27T10:00:00Z'
GROUP BY 1, 2
HAVING error_rate > 5.0
ORDER BY error_rate DESC
""").fetchdf()No export step. No ETL pipeline. No API rate limits. The analyst reads the same Parquet files that Parseable's query engine reads.
Long-Term Trend Analysis
Because Parquet on object storage is cheap to retain, teams can keep 12+ months of telemetry and perform capacity planning queries that would be prohibitively expensive on a SaaS platform:
SELECT
date_trunc('week', timestamp) as week,
service_name,
sum(request_bytes) as total_ingress,
avg(response_time_ms) as avg_latency,
percentile_cont(0.99) WITHIN GROUP (ORDER BY response_time_ms) as p99_latency
FROM http_access_logs
WHERE timestamp > NOW() - INTERVAL '6 months'
GROUP BY 1, 2
ORDER BY week, service_nameOn a traditional SaaS platform, querying six months of data either costs extra (per-query pricing) or is simply not available (data aged out). With Parquet on object storage, the data is there and the query cost is the compute time to scan it.
Compliance and Audit
Regulated industries require long-term retention of access logs and security events. Parquet files on object storage with versioning enabled provide an immutable audit trail. Because the format is open, auditors can verify the data independently — they do not need access to your observability platform. Hand them S3 read credentials and a copy of DuckDB, and they can run their own queries.
Getting Started
If you want to see Parquet-backed observability in action, the fastest path is a free trial on Parseable Cloud:
- Sign up at app.parseable.com — 14-day free trial
- Point your OTel Collector at the OTLP endpoint Parseable provides
- Query your data through Prism or the SQL API
- Observe the compression — check how much storage your telemetry actually consumes in Parquet versus what it would cost as raw JSON on a traditional platform
For teams evaluating the data lake versus data warehouse approach, or those coming from platforms where the true cost of observability has become unsustainable, Parquet on object storage is the architectural foundation that makes cost-effective, long-term observability practical.
The format is open. The tools are mature. The economics are proven. The only question is whether you want your telemetry data in a format you control, or one your vendor controls.
Start your free trial on Parseable Cloud and see the difference that open-format observability makes.


