Most observability setups start the same way: pick a SaaS vendor, install their agent, point everything at their cloud. It works until it doesn't — until retention windows shrink, costs compound, and the data you need during an incident was purged three days ago. The observability lakehouse is the architecture that fixes this by combining the cheap, durable storage of a data lake with the structured query capabilities of a data warehouse, purpose-built for telemetry.
This guide walks through building a complete observability lakehouse end to end. You will instrument applications with OpenTelemetry, route telemetry through the OTel Collector, ingest it into Parseable, store it as Apache Parquet on object storage, and query everything with SQL. Every configuration snippet in this article is real and production-tested.
What Is an Observability Lakehouse?
Before diving into the build, it helps to understand what makes a lakehouse different from a plain data lake or a traditional data warehouse.
A data lake stores raw data cheaply on object storage but offers limited query performance and no built-in schema management. A data warehouse provides fast, structured queries but at significantly higher storage cost and with rigid schemas that don't adapt well to telemetry's semi-structured nature. A lakehouse combines both: data lives on object storage in open columnar formats (Apache Parquet), while a purpose-built query engine provides interactive, structured access with full SQL support.
For observability, the lakehouse model is ideal because telemetry data is:
- High volume — a mid-sized Kubernetes cluster can produce 5-10 TB of logs per day
- Semi-structured — log formats vary across services, trace spans carry arbitrary attributes
- Query-intensive during incidents — you need sub-second full-text search across billions of rows
- Long-lived — compliance and root-cause analysis often require months of retention
The observability lakehouse gives you all of this without the cost explosion of traditional SaaS platforms. If you want the foundational context, read What Is an Observability Data Lake? first.
Architecture Overview
Here is the full pipeline you will build in this guide:
┌────────────────────────────────────────────────┐
│ Your Applications │
│ (Python, Go, Java, Node.js services) │
│ Instrumented with OpenTelemetry SDKs │
└──────────────────┬─────────────────────────────┘
│ OTLP/gRPC or OTLP/HTTP
▼
┌────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receives, processes, batches, and exports │
│ logs, traces, and metrics │
└──────────────────┬─────────────────────────────┘
│ OTLP/HTTP
▼
┌────────────────────────────────────────────────┐
│ Parseable │
│ Ingests via /v1/logs, /v1/traces, /v1/metrics│
│ Converts to Apache Parquet │
│ Stores on S3 / MinIO / GCS / Azure Blob │
└──────────────────┬─────────────────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌─────────┐ ┌────────┐ ┌────────┐
│ Prism │ │ SQL │ │ Alerts │
│ Web UI │ │ API │ │& Dash │
└─────────┘ └────────┘ └────────┘Each layer has a single responsibility:
- OpenTelemetry SDKs instrument your applications and emit telemetry in the OTLP format
- OpenTelemetry Collector receives, batches, enriches, and exports telemetry to Parseable
- Parseable ingests OTLP data, converts it to Apache Parquet, stores it on object storage, and serves queries via ParseableDB (built on Apache Arrow DataFusion)
- Prism (Parseable's web UI) provides dashboards, alerts, log exploration, and trace visualization
The result is an OpenTelemetry data lake with structured query capabilities — a true observability lakehouse. No Kafka. No Elasticsearch. No ClickHouse clusters to manage. One binary plus object storage.
Step 1: Instrument Your Applications with OpenTelemetry
OpenTelemetry provides SDKs for every major language. For this guide, we will instrument a Python application, but the same pattern applies to Go, Java, Node.js, Rust, and .NET.
Install the OTel SDK
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requestsAuto-Instrumentation (Recommended for Getting Started)
The fastest path is zero-code auto-instrumentation. This wraps your application and automatically generates traces and metrics for supported libraries:
opentelemetry-instrument \
--service_name my-api-service \
--exporter_otlp_endpoint http://localhost:4317 \
--exporter_otlp_protocol grpc \
python app.pyThis single command instruments Flask routes, outbound HTTP calls via requests, database queries, and more — without changing a line of application code.
Manual Instrumentation (For Custom Spans and Logs)
When you need custom spans or structured log correlation, use the SDK directly:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
import logging
# Configure the tracer
resource = Resource.create({"service.name": "order-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
# Use in your application code
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.type", "standard")
# Your business logic here
validate_inventory(order_id)
charge_payment(order_id)
span.set_attribute("order.status", "completed")Structured Logging with Trace Correlation
Connect your logs to traces so you can pivot from a log line to the full request trace:
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
# Configure log export
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4317"))
)
set_logger_provider(logger_provider)With this configuration, every log record automatically carries the trace_id and span_id of the active span, enabling direct correlation in Parseable.
Step 2: Deploy the OpenTelemetry Collector
The OTel Collector sits between your applications and Parseable. It receives telemetry over OTLP (gRPC or HTTP), processes it (batching, filtering, enrichment), and exports it to Parseable's OTLP/HTTP endpoints.
Collector Configuration
Create a file named otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
send_batch_max_size: 2048
resource:
attributes:
- key: environment
value: production
action: upsert
- key: cluster
value: us-east-1-prod
action: upsert
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
exporters:
otlphttp/parseable:
endpoint: "https://your-parseable-instance:8000/v1"
headers:
Authorization: "Basic <base64-encoded-credentials>"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlphttp/parseable]
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlphttp/parseable]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlphttp/parseable]A few things to note in this configuration:
otlphttp/parseableexporter: Parseable supports OTLP/HTTP natively at/v1/logs,/v1/traces, and/v1/metrics. The Collector appends the signal-specific path automatically.batchprocessor: Batches telemetry into groups of 1,024 records (up to 2,048) or every 5 seconds, whichever comes first. This reduces the number of HTTP requests to Parseable and improves throughput.resourceprocessor: Adds environment and cluster attributes to every telemetry record. These become queryable columns in Parseable.memory_limiterprocessor: Prevents the Collector from consuming unbounded memory during traffic spikes.- Retry on failure: The exporter retries with exponential backoff if Parseable is temporarily unreachable.
Run the Collector with Docker
docker run -d \
--name otel-collector \
-p 4317:4317 \
-p 4318:4318 \
-v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector-contrib:latest \
--config /etc/otelcol/config.yamlRun the Collector on Kubernetes
For production Kubernetes deployments, use the OpenTelemetry Operator or a DaemonSet. Here is a minimal DaemonSet manifest:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otelcol/config.yaml"]
ports:
- containerPort: 4317
protocol: TCP
- containerPort: 4318
protocol: TCP
volumeMounts:
- name: config
mountPath: /etc/otelcol/config.yaml
subPath: config.yaml
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
# Paste the Collector config from above hereExpose the Collector within the cluster so your instrumented applications can reach it:
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318Your application pods export telemetry to otel-collector.observability.svc.cluster.local:4317.
Step 3: Deploy Parseable
Parseable is the backend of your observability lakehouse. It ingests OTLP data, converts it to Apache Parquet on the fly, and stores it on S3-compatible object storage. You can run it self-hosted, on Parseable Cloud, or via BYOC.
Option A: Docker (Local / Dev)
The fastest way to get started is with Docker and local storage:
docker run -d \
--name parseable \
-p 8000:8000 \
-e P_USERNAME=admin \
-e P_PASSWORD=admin \
-e P_STAGING_DIR=/tmp/parseable/staging \
-e P_STORAGE_URL=file:///tmp/parseable/data \
parseable/parseable:latest \
parseable local-storeFor production with S3-compatible storage (MinIO, AWS S3, GCS, Azure Blob):
docker run -d \
--name parseable \
-p 8000:8000 \
-e P_USERNAME=admin \
-e P_PASSWORD=admin \
-e P_S3_URL=https://s3.amazonaws.com \
-e P_S3_ACCESS_KEY=your-access-key \
-e P_S3_SECRET_KEY=your-secret-key \
-e P_S3_BUCKET=parseable-observability \
-e P_S3_REGION=us-east-1 \
-e P_STAGING_DIR=/tmp/parseable/staging \
parseable/parseable:latest \
parseable s3-storeOption B: Kubernetes (Production)
For production Kubernetes deployments, use the Helm chart:
helm repo add parseable https://charts.parseable.com
helm repo update
helm install parseable parseable/parseable \
--namespace observability \
--create-namespace \
--set parseable.store=s3-store \
--set parseable.s3.url=https://s3.amazonaws.com \
--set parseable.s3.accessKey=your-access-key \
--set parseable.s3.secretKey=your-secret-key \
--set parseable.s3.bucket=parseable-observability \
--set parseable.s3.region=us-east-1Option C: Parseable Cloud (Managed)
If you don't want to manage infrastructure, Parseable Cloud handles everything. The Pro plan is $0.39/GB ingested, and includes 365 days of retention, 99.9% uptime SLA, AI-native analysis, anomaly detection, unlimited users, dashboards, alerts, and API access. Start with a 14-day free trial.
With Parseable Cloud, your OTel Collector exporter configuration points to your cloud instance endpoint:
exporters:
otlphttp/parseable:
endpoint: "https://your-instance.parseable.com/v1"
headers:
Authorization: "Basic <base64-encoded-credentials>"Step 4: Verify Data Flow and Query with SQL
Once the pipeline is running, verify that telemetry is flowing end to end.
Check Ingestion via the API
# List all log streams created by OTLP ingestion
curl -s -u admin:admin https://your-parseable-instance:8000/api/v1/logstream \
| jq '.[] | .name'Parseable automatically creates streams for OTLP data. You should see streams like otelmetrics, otellogs, and oteltraces.
Query Logs with SQL
Open Prism (Parseable's web UI) at https://your-parseable-instance:8000 and navigate to the SQL editor. Here are queries you can run immediately:
Find all error logs in the last hour, grouped by service:
SELECT
"service.name" AS service,
severity_text,
COUNT(*) AS error_count
FROM otellogs
WHERE severity_text IN ('ERROR', 'FATAL')
AND p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY "service.name", severity_text
ORDER BY error_count DESC;Identify the slowest endpoints by p99 latency:
SELECT
"service.name" AS service,
name AS span_name,
ROUND(APPROX_PERCENTILE_CONT(duration_ms, 0.99), 2) AS p99_latency_ms,
COUNT(*) AS request_count
FROM oteltraces
WHERE kind = 'SPAN_KIND_SERVER'
AND p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY "service.name", name
HAVING COUNT(*) > 10
ORDER BY p99_latency_ms DESC
LIMIT 20;Trace a single request across services:
SELECT
"service.name" AS service,
name AS span_name,
span_id,
parent_span_id,
duration_ms,
status_code
FROM oteltraces
WHERE trace_id = '4bf92f3577b34da6a3ce929d0e0e4736'
ORDER BY start_time_unix_nano ASC;Check metric ingestion rates by service:
SELECT
"service.name" AS service,
metric_name,
COUNT(*) AS datapoints
FROM otelmetrics
WHERE p_timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY "service.name", metric_name
ORDER BY datapoints DESC
LIMIT 20;These are standard SQL queries powered by ParseableDB, which is built on Apache Arrow DataFusion. No proprietary query language to learn. If you know SQL, you can query your observability data.
Step 5: Set Up Dashboards and Alerts in Prism
With data flowing and queryable, the next step is to build operational dashboards and configure alerts.
Create a Service Health Dashboard
In Prism, navigate to the Dashboards section and create a new dashboard. Add panels with the following queries:
Error rate over time (timeseries panel):
SELECT
DATE_TRUNC('minute', p_timestamp) AS time_bucket,
"service.name" AS service,
COUNT(*) AS errors
FROM otellogs
WHERE severity_text IN ('ERROR', 'FATAL')
AND p_timestamp > NOW() - INTERVAL '6 hours'
GROUP BY time_bucket, "service.name"
ORDER BY time_bucket ASC;Request latency heatmap (timeseries panel):
SELECT
DATE_TRUNC('minute', p_timestamp) AS time_bucket,
ROUND(AVG(duration_ms), 2) AS avg_latency_ms,
ROUND(APPROX_PERCENTILE_CONT(duration_ms, 0.95), 2) AS p95_latency_ms,
ROUND(MAX(duration_ms), 2) AS max_latency_ms
FROM oteltraces
WHERE kind = 'SPAN_KIND_SERVER'
AND p_timestamp > NOW() - INTERVAL '6 hours'
GROUP BY time_bucket
ORDER BY time_bucket ASC;Configure Alerts
Set up alerts on critical conditions. In Prism, navigate to Alerts and create rules like:
- High error rate: Trigger when error count exceeds 50 per minute for any service
- Latency spike: Trigger when p95 latency exceeds 2,000ms for server spans
- Ingestion drop: Trigger when log volume drops below 50% of the rolling average (indicates a broken pipeline)
Alerts can be routed to Slack, PagerDuty, email, or any webhook endpoint.
Advanced: Multi-Signal Correlation
The real power of an observability lakehouse is correlating across signal types. When an alert fires for high error rate, you want to go from the alert to the relevant logs, from the logs to the traces that generated them, and from the traces to the metrics that show the broader impact.
From Log to Trace
Because OpenTelemetry embeds trace_id in log records, you can jump from a suspicious log to the full distributed trace:
-- Step 1: Find error logs with trace context
SELECT
body,
trace_id,
span_id,
"service.name" AS service,
p_timestamp
FROM otellogs
WHERE severity_text = 'ERROR'
AND "service.name" = 'order-service'
AND p_timestamp > NOW() - INTERVAL '30 minutes'
ORDER BY p_timestamp DESC
LIMIT 10;
-- Step 2: Use the trace_id from above to reconstruct the full trace
SELECT
"service.name" AS service,
name AS operation,
duration_ms,
status_code,
start_time_unix_nano
FROM oteltraces
WHERE trace_id = '<trace_id_from_step_1>'
ORDER BY start_time_unix_nano ASC;Correlate Traces with Metrics
Identify whether a latency spike correlates with resource saturation:
-- Find the time window of the latency spike
SELECT
DATE_TRUNC('minute', p_timestamp) AS minute,
AVG(duration_ms) AS avg_latency
FROM oteltraces
WHERE "service.name" = 'order-service'
AND kind = 'SPAN_KIND_SERVER'
AND p_timestamp > NOW() - INTERVAL '2 hours'
GROUP BY minute
ORDER BY avg_latency DESC
LIMIT 5;
-- Check CPU/memory metrics during that window
SELECT
metric_name,
DATE_TRUNC('minute', p_timestamp) AS minute,
AVG(gauge_value) AS avg_value
FROM otelmetrics
WHERE "service.name" = 'order-service'
AND metric_name IN ('process.cpu.utilization', 'process.memory.usage')
AND p_timestamp BETWEEN '2026-02-27T10:00:00Z' AND '2026-02-27T10:15:00Z'
GROUP BY metric_name, minute
ORDER BY minute ASC;This kind of cross-signal analysis is what separates an observability lakehouse from a collection of disconnected tools. All three signal types live in the same backend, queryable with the same SQL interface, correlated by shared attributes like service.name, trace_id, and timestamp.
Retention and Cost Management
One of the primary advantages of the observability lakehouse architecture is cost-effective long-term retention. Because data is stored in Apache Parquet on object storage, keeping months or years of telemetry is economically viable.
How Parseable Handles Retention
On Parseable Cloud Pro, 365 days of retention is included in the $0.39/GB ingested price. Query scanning is included up to 10x of your monthly ingestion volume, with additional scans at $0.02/GB. There are no separate storage charges or compute surcharges.
On the Enterprise plan, you get everything in Pro plus Bring Your Own Bucket (BYOB), where your data stays in your own S3 bucket. This means unlimited retention — you control the lifecycle policies on your own storage. Enterprise also includes Apache Iceberg support for advanced lakehouse features, premium support, and flexible deployment options (Parseable Cloud, BYOC, or self-hosted). Enterprise starts at $15,000/year.
Why Parquet Matters for Cost
Apache Parquet provides 70-90% compression compared to raw JSON. A service generating 100 GB/day of JSON logs may only consume 10-20 GB of Parquet storage on S3. At S3 pricing of ~$0.023/GB/month, storing a year of that data costs roughly $30-55/month in raw storage — a fraction of what traditional SaaS platforms charge for 30 days of retention. For a deeper dive, read Why Your Observability Data Should Live in Apache Parquet.
When to Choose Each Deployment Model
The right deployment model depends on your requirements:
| Requirement | Recommended Option |
|---|---|
| Get started in 5 minutes, no infra to manage | Parseable Cloud Pro |
| Data sovereignty, compliance, bucket-level control | Enterprise with BYOB |
| Air-gapped or on-premises environments | Self-hosted (single binary) |
| Run in your cloud account with managed operations | BYOC (Enterprise) |
For most teams starting with an OpenTelemetry data lake, Parseable Cloud Pro is the fastest path. You can migrate to Enterprise or self-hosted later without re-instrumenting anything — the OTel Collector configuration stays the same, only the exporter endpoint changes.
How This Compares to Traditional Stacks
The observability lakehouse approach is fundamentally different from stacking together Elasticsearch for logs, Jaeger for traces, and Prometheus for metrics. Here is how the architectures compare:
| Dimension | Traditional Stack (ELK + Jaeger + Prometheus) | Observability Lakehouse (OTel + Parseable) |
|---|---|---|
| Components to manage | 6-10 (Elasticsearch, Logstash, Kibana, Jaeger, Cassandra, Prometheus, Grafana) | 2 (OTel Collector + Parseable) |
| Storage format | Proprietary indices, proprietary chunks | Open Apache Parquet |
| Query language | KQL + PromQL + Jaeger API | SQL for everything |
| Cross-signal correlation | Manual, across tools | Native, same query interface |
| Retention cost at 1 TB/day | $3,000-10,000+/month in compute and storage | Object storage cost ($0.02-0.03/GB/month) |
| Operational overhead | High (cluster tuning, shard management, compaction) | Minimal (single binary + object storage) |
For a detailed comparison of data lake versus data warehouse architectures for observability, see Data Lake vs. Data Warehouse for Observability.
Summary
Here is what you built in this guide:
- Instrumented applications with OpenTelemetry SDKs to emit logs, traces, and metrics over OTLP
- Deployed the OTel Collector to receive, batch, enrich, and export telemetry to Parseable
- Deployed Parseable as the observability lakehouse backend, storing all data as Parquet on object storage
- Queried all signal types with SQL — error analysis, latency investigation, trace reconstruction, and metric correlation
- Built dashboards and alerts in Prism for ongoing operational visibility
- Correlated across signals — from log to trace to metric — using shared OpenTelemetry attributes
This is the observability lakehouse pattern: OpenTelemetry for standardized instrumentation, Parseable for storage and query, object storage for durable and cheap persistence, and SQL as the universal query interface. No vendor lock-in. No proprietary formats. Full data ownership.
Ready to build your own observability lakehouse? Start a free 14-day trial on Parseable Cloud — $0.39/GB ingested, 365 days retention, AI-native analysis, and unlimited users included. For Enterprise needs including BYOB, Apache Iceberg support, and flexible deployment, contact the Parseable team.


