Building an Observability Lakehouse with OpenTelemetry

D
Debabrata Panigrahi
February 27, 2026
Step-by-step guide to building an OpenTelemetry data lake with Parseable. Instrument, collect, ingest, store, and query all telemetry with SQL.
Building an Observability Lakehouse with OpenTelemetry

Most observability setups start the same way: pick a SaaS vendor, install their agent, point everything at their cloud. It works until it doesn't — until retention windows shrink, costs compound, and the data you need during an incident was purged three days ago. The observability lakehouse is the architecture that fixes this by combining the cheap, durable storage of a data lake with the structured query capabilities of a data warehouse, purpose-built for telemetry.

This guide walks through building a complete observability lakehouse end to end. You will instrument applications with OpenTelemetry, route telemetry through the OTel Collector, ingest it into Parseable, store it as Apache Parquet on object storage, and query everything with SQL. Every configuration snippet in this article is real and production-tested.

What Is an Observability Lakehouse?

Before diving into the build, it helps to understand what makes a lakehouse different from a plain data lake or a traditional data warehouse.

A data lake stores raw data cheaply on object storage but offers limited query performance and no built-in schema management. A data warehouse provides fast, structured queries but at significantly higher storage cost and with rigid schemas that don't adapt well to telemetry's semi-structured nature. A lakehouse combines both: data lives on object storage in open columnar formats (Apache Parquet), while a purpose-built query engine provides interactive, structured access with full SQL support.

For observability, the lakehouse model is ideal because telemetry data is:

  • High volume — a mid-sized Kubernetes cluster can produce 5-10 TB of logs per day
  • Semi-structured — log formats vary across services, trace spans carry arbitrary attributes
  • Query-intensive during incidents — you need sub-second full-text search across billions of rows
  • Long-lived — compliance and root-cause analysis often require months of retention

The observability lakehouse gives you all of this without the cost explosion of traditional SaaS platforms. If you want the foundational context, read What Is an Observability Data Lake? first.

Architecture Overview

Here is the full pipeline you will build in this guide:

┌────────────────────────────────────────────────┐
│         Your Applications                      │
│   (Python, Go, Java, Node.js services)         │
│   Instrumented with OpenTelemetry SDKs         │
└──────────────────┬─────────────────────────────┘
                   │ OTLP/gRPC or OTLP/HTTP

┌────────────────────────────────────────────────┐
│         OpenTelemetry Collector                │
│   Receives, processes, batches, and exports    │
│   logs, traces, and metrics                    │
└──────────────────┬─────────────────────────────┘
                   │ OTLP/HTTP

┌────────────────────────────────────────────────┐
│         Parseable                              │
│   Ingests via /v1/logs, /v1/traces, /v1/metrics│
│   Converts to Apache Parquet                   │
│   Stores on S3 / MinIO / GCS / Azure Blob      │
└──────────────────┬─────────────────────────────┘

        ┌──────────┼──────────┐
        ▼          ▼          ▼
   ┌─────────┐ ┌────────┐ ┌────────┐
   │ Prism   │ │ SQL    │ │ Alerts │
   │ Web UI  │ │ API    │ │& Dash  │
   └─────────┘ └────────┘ └────────┘

Each layer has a single responsibility:

  1. OpenTelemetry SDKs instrument your applications and emit telemetry in the OTLP format
  2. OpenTelemetry Collector receives, batches, enriches, and exports telemetry to Parseable
  3. Parseable ingests OTLP data, converts it to Apache Parquet, stores it on object storage, and serves queries via ParseableDB (built on Apache Arrow DataFusion)
  4. Prism (Parseable's web UI) provides dashboards, alerts, log exploration, and trace visualization

The result is an OpenTelemetry data lake with structured query capabilities — a true observability lakehouse. No Kafka. No Elasticsearch. No ClickHouse clusters to manage. One binary plus object storage.

Step 1: Instrument Your Applications with OpenTelemetry

OpenTelemetry provides SDKs for every major language. For this guide, we will instrument a Python application, but the same pattern applies to Go, Java, Node.js, Rust, and .NET.

Install the OTel SDK

pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp \
            opentelemetry-instrumentation-flask \
            opentelemetry-instrumentation-requests

The fastest path is zero-code auto-instrumentation. This wraps your application and automatically generates traces and metrics for supported libraries:

opentelemetry-instrument \
  --service_name my-api-service \
  --exporter_otlp_endpoint http://localhost:4317 \
  --exporter_otlp_protocol grpc \
  python app.py

This single command instruments Flask routes, outbound HTTP calls via requests, database queries, and more — without changing a line of application code.

Manual Instrumentation (For Custom Spans and Logs)

When you need custom spans or structured log correlation, use the SDK directly:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
import logging
 
# Configure the tracer
resource = Resource.create({"service.name": "order-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer("order-service")
 
# Use in your application code
def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.type", "standard")
 
        # Your business logic here
        validate_inventory(order_id)
        charge_payment(order_id)
 
        span.set_attribute("order.status", "completed")

Structured Logging with Trace Correlation

Connect your logs to traces so you can pivot from a log line to the full request trace:

from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
 
# Configure log export
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4317"))
)
set_logger_provider(logger_provider)

With this configuration, every log record automatically carries the trace_id and span_id of the active span, enabling direct correlation in Parseable.

Step 2: Deploy the OpenTelemetry Collector

The OTel Collector sits between your applications and Parseable. It receives telemetry over OTLP (gRPC or HTTP), processes it (batching, filtering, enrichment), and exports it to Parseable's OTLP/HTTP endpoints.

Collector Configuration

Create a file named otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048
 
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-east-1-prod
        action: upsert
 
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
 
exporters:
  otlphttp/parseable:
    endpoint: "https://your-parseable-instance:8000/v1"
    headers:
      Authorization: "Basic <base64-encoded-credentials>"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
 
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp/parseable]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp/parseable]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp/parseable]

A few things to note in this configuration:

  • otlphttp/parseable exporter: Parseable supports OTLP/HTTP natively at /v1/logs, /v1/traces, and /v1/metrics. The Collector appends the signal-specific path automatically.
  • batch processor: Batches telemetry into groups of 1,024 records (up to 2,048) or every 5 seconds, whichever comes first. This reduces the number of HTTP requests to Parseable and improves throughput.
  • resource processor: Adds environment and cluster attributes to every telemetry record. These become queryable columns in Parseable.
  • memory_limiter processor: Prevents the Collector from consuming unbounded memory during traffic spikes.
  • Retry on failure: The exporter retries with exponential backoff if Parseable is temporarily unreachable.

Run the Collector with Docker

docker run -d \
  --name otel-collector \
  -p 4317:4317 \
  -p 4318:4318 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:latest \
  --config /etc/otelcol/config.yaml

Run the Collector on Kubernetes

For production Kubernetes deployments, use the OpenTelemetry Operator or a DaemonSet. Here is a minimal DaemonSet manifest:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otelcol/config.yaml"]
          ports:
            - containerPort: 4317
              protocol: TCP
            - containerPort: 4318
              protocol: TCP
          volumeMounts:
            - name: config
              mountPath: /etc/otelcol/config.yaml
              subPath: config.yaml
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    # Paste the Collector config from above here

Expose the Collector within the cluster so your instrumented applications can reach it:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
    - name: otlp-http
      port: 4318
      targetPort: 4318

Your application pods export telemetry to otel-collector.observability.svc.cluster.local:4317.

Step 3: Deploy Parseable

Parseable is the backend of your observability lakehouse. It ingests OTLP data, converts it to Apache Parquet on the fly, and stores it on S3-compatible object storage. You can run it self-hosted, on Parseable Cloud, or via BYOC.

Option A: Docker (Local / Dev)

The fastest way to get started is with Docker and local storage:

docker run -d \
  --name parseable \
  -p 8000:8000 \
  -e P_USERNAME=admin \
  -e P_PASSWORD=admin \
  -e P_STAGING_DIR=/tmp/parseable/staging \
  -e P_STORAGE_URL=file:///tmp/parseable/data \
  parseable/parseable:latest \
  parseable local-store

For production with S3-compatible storage (MinIO, AWS S3, GCS, Azure Blob):

docker run -d \
  --name parseable \
  -p 8000:8000 \
  -e P_USERNAME=admin \
  -e P_PASSWORD=admin \
  -e P_S3_URL=https://s3.amazonaws.com \
  -e P_S3_ACCESS_KEY=your-access-key \
  -e P_S3_SECRET_KEY=your-secret-key \
  -e P_S3_BUCKET=parseable-observability \
  -e P_S3_REGION=us-east-1 \
  -e P_STAGING_DIR=/tmp/parseable/staging \
  parseable/parseable:latest \
  parseable s3-store

Option B: Kubernetes (Production)

For production Kubernetes deployments, use the Helm chart:

helm repo add parseable https://charts.parseable.com
helm repo update
 
helm install parseable parseable/parseable \
  --namespace observability \
  --create-namespace \
  --set parseable.store=s3-store \
  --set parseable.s3.url=https://s3.amazonaws.com \
  --set parseable.s3.accessKey=your-access-key \
  --set parseable.s3.secretKey=your-secret-key \
  --set parseable.s3.bucket=parseable-observability \
  --set parseable.s3.region=us-east-1

Option C: Parseable Cloud (Managed)

If you don't want to manage infrastructure, Parseable Cloud handles everything. The Pro plan is $0.39/GB ingested, and includes 365 days of retention, 99.9% uptime SLA, AI-native analysis, anomaly detection, unlimited users, dashboards, alerts, and API access. Start with a 14-day free trial.

With Parseable Cloud, your OTel Collector exporter configuration points to your cloud instance endpoint:

exporters:
  otlphttp/parseable:
    endpoint: "https://your-instance.parseable.com/v1"
    headers:
      Authorization: "Basic <base64-encoded-credentials>"

Step 4: Verify Data Flow and Query with SQL

Once the pipeline is running, verify that telemetry is flowing end to end.

Check Ingestion via the API

# List all log streams created by OTLP ingestion
curl -s -u admin:admin https://your-parseable-instance:8000/api/v1/logstream \
  | jq '.[] | .name'

Parseable automatically creates streams for OTLP data. You should see streams like otelmetrics, otellogs, and oteltraces.

Query Logs with SQL

Open Prism (Parseable's web UI) at https://your-parseable-instance:8000 and navigate to the SQL editor. Here are queries you can run immediately:

Find all error logs in the last hour, grouped by service:

SELECT
  "service.name" AS service,
  severity_text,
  COUNT(*) AS error_count
FROM otellogs
WHERE severity_text IN ('ERROR', 'FATAL')
  AND p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY "service.name", severity_text
ORDER BY error_count DESC;

Identify the slowest endpoints by p99 latency:

SELECT
  "service.name" AS service,
  name AS span_name,
  ROUND(APPROX_PERCENTILE_CONT(duration_ms, 0.99), 2) AS p99_latency_ms,
  COUNT(*) AS request_count
FROM oteltraces
WHERE kind = 'SPAN_KIND_SERVER'
  AND p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY "service.name", name
HAVING COUNT(*) > 10
ORDER BY p99_latency_ms DESC
LIMIT 20;

Trace a single request across services:

SELECT
  "service.name" AS service,
  name AS span_name,
  span_id,
  parent_span_id,
  duration_ms,
  status_code
FROM oteltraces
WHERE trace_id = '4bf92f3577b34da6a3ce929d0e0e4736'
ORDER BY start_time_unix_nano ASC;

Check metric ingestion rates by service:

SELECT
  "service.name" AS service,
  metric_name,
  COUNT(*) AS datapoints
FROM otelmetrics
WHERE p_timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY "service.name", metric_name
ORDER BY datapoints DESC
LIMIT 20;

These are standard SQL queries powered by ParseableDB, which is built on Apache Arrow DataFusion. No proprietary query language to learn. If you know SQL, you can query your observability data.

Step 5: Set Up Dashboards and Alerts in Prism

With data flowing and queryable, the next step is to build operational dashboards and configure alerts.

Create a Service Health Dashboard

In Prism, navigate to the Dashboards section and create a new dashboard. Add panels with the following queries:

Error rate over time (timeseries panel):

SELECT
  DATE_TRUNC('minute', p_timestamp) AS time_bucket,
  "service.name" AS service,
  COUNT(*) AS errors
FROM otellogs
WHERE severity_text IN ('ERROR', 'FATAL')
  AND p_timestamp > NOW() - INTERVAL '6 hours'
GROUP BY time_bucket, "service.name"
ORDER BY time_bucket ASC;

Request latency heatmap (timeseries panel):

SELECT
  DATE_TRUNC('minute', p_timestamp) AS time_bucket,
  ROUND(AVG(duration_ms), 2) AS avg_latency_ms,
  ROUND(APPROX_PERCENTILE_CONT(duration_ms, 0.95), 2) AS p95_latency_ms,
  ROUND(MAX(duration_ms), 2) AS max_latency_ms
FROM oteltraces
WHERE kind = 'SPAN_KIND_SERVER'
  AND p_timestamp > NOW() - INTERVAL '6 hours'
GROUP BY time_bucket
ORDER BY time_bucket ASC;

Configure Alerts

Set up alerts on critical conditions. In Prism, navigate to Alerts and create rules like:

  • High error rate: Trigger when error count exceeds 50 per minute for any service
  • Latency spike: Trigger when p95 latency exceeds 2,000ms for server spans
  • Ingestion drop: Trigger when log volume drops below 50% of the rolling average (indicates a broken pipeline)

Alerts can be routed to Slack, PagerDuty, email, or any webhook endpoint.

Advanced: Multi-Signal Correlation

The real power of an observability lakehouse is correlating across signal types. When an alert fires for high error rate, you want to go from the alert to the relevant logs, from the logs to the traces that generated them, and from the traces to the metrics that show the broader impact.

From Log to Trace

Because OpenTelemetry embeds trace_id in log records, you can jump from a suspicious log to the full distributed trace:

-- Step 1: Find error logs with trace context
SELECT
  body,
  trace_id,
  span_id,
  "service.name" AS service,
  p_timestamp
FROM otellogs
WHERE severity_text = 'ERROR'
  AND "service.name" = 'order-service'
  AND p_timestamp > NOW() - INTERVAL '30 minutes'
ORDER BY p_timestamp DESC
LIMIT 10;
 
-- Step 2: Use the trace_id from above to reconstruct the full trace
SELECT
  "service.name" AS service,
  name AS operation,
  duration_ms,
  status_code,
  start_time_unix_nano
FROM oteltraces
WHERE trace_id = '<trace_id_from_step_1>'
ORDER BY start_time_unix_nano ASC;

Correlate Traces with Metrics

Identify whether a latency spike correlates with resource saturation:

-- Find the time window of the latency spike
SELECT
  DATE_TRUNC('minute', p_timestamp) AS minute,
  AVG(duration_ms) AS avg_latency
FROM oteltraces
WHERE "service.name" = 'order-service'
  AND kind = 'SPAN_KIND_SERVER'
  AND p_timestamp > NOW() - INTERVAL '2 hours'
GROUP BY minute
ORDER BY avg_latency DESC
LIMIT 5;
 
-- Check CPU/memory metrics during that window
SELECT
  metric_name,
  DATE_TRUNC('minute', p_timestamp) AS minute,
  AVG(gauge_value) AS avg_value
FROM otelmetrics
WHERE "service.name" = 'order-service'
  AND metric_name IN ('process.cpu.utilization', 'process.memory.usage')
  AND p_timestamp BETWEEN '2026-02-27T10:00:00Z' AND '2026-02-27T10:15:00Z'
GROUP BY metric_name, minute
ORDER BY minute ASC;

This kind of cross-signal analysis is what separates an observability lakehouse from a collection of disconnected tools. All three signal types live in the same backend, queryable with the same SQL interface, correlated by shared attributes like service.name, trace_id, and timestamp.

Retention and Cost Management

One of the primary advantages of the observability lakehouse architecture is cost-effective long-term retention. Because data is stored in Apache Parquet on object storage, keeping months or years of telemetry is economically viable.

How Parseable Handles Retention

On Parseable Cloud Pro, 365 days of retention is included in the $0.39/GB ingested price. Query scanning is included up to 10x of your monthly ingestion volume, with additional scans at $0.02/GB. There are no separate storage charges or compute surcharges.

On the Enterprise plan, you get everything in Pro plus Bring Your Own Bucket (BYOB), where your data stays in your own S3 bucket. This means unlimited retention — you control the lifecycle policies on your own storage. Enterprise also includes Apache Iceberg support for advanced lakehouse features, premium support, and flexible deployment options (Parseable Cloud, BYOC, or self-hosted). Enterprise starts at $15,000/year.

Why Parquet Matters for Cost

Apache Parquet provides 70-90% compression compared to raw JSON. A service generating 100 GB/day of JSON logs may only consume 10-20 GB of Parquet storage on S3. At S3 pricing of ~$0.023/GB/month, storing a year of that data costs roughly $30-55/month in raw storage — a fraction of what traditional SaaS platforms charge for 30 days of retention. For a deeper dive, read Why Your Observability Data Should Live in Apache Parquet.

When to Choose Each Deployment Model

The right deployment model depends on your requirements:

RequirementRecommended Option
Get started in 5 minutes, no infra to manageParseable Cloud Pro
Data sovereignty, compliance, bucket-level controlEnterprise with BYOB
Air-gapped or on-premises environmentsSelf-hosted (single binary)
Run in your cloud account with managed operationsBYOC (Enterprise)

For most teams starting with an OpenTelemetry data lake, Parseable Cloud Pro is the fastest path. You can migrate to Enterprise or self-hosted later without re-instrumenting anything — the OTel Collector configuration stays the same, only the exporter endpoint changes.

How This Compares to Traditional Stacks

The observability lakehouse approach is fundamentally different from stacking together Elasticsearch for logs, Jaeger for traces, and Prometheus for metrics. Here is how the architectures compare:

DimensionTraditional Stack (ELK + Jaeger + Prometheus)Observability Lakehouse (OTel + Parseable)
Components to manage6-10 (Elasticsearch, Logstash, Kibana, Jaeger, Cassandra, Prometheus, Grafana)2 (OTel Collector + Parseable)
Storage formatProprietary indices, proprietary chunksOpen Apache Parquet
Query languageKQL + PromQL + Jaeger APISQL for everything
Cross-signal correlationManual, across toolsNative, same query interface
Retention cost at 1 TB/day$3,000-10,000+/month in compute and storageObject storage cost ($0.02-0.03/GB/month)
Operational overheadHigh (cluster tuning, shard management, compaction)Minimal (single binary + object storage)

For a detailed comparison of data lake versus data warehouse architectures for observability, see Data Lake vs. Data Warehouse for Observability.

Summary

Here is what you built in this guide:

  1. Instrumented applications with OpenTelemetry SDKs to emit logs, traces, and metrics over OTLP
  2. Deployed the OTel Collector to receive, batch, enrich, and export telemetry to Parseable
  3. Deployed Parseable as the observability lakehouse backend, storing all data as Parquet on object storage
  4. Queried all signal types with SQL — error analysis, latency investigation, trace reconstruction, and metric correlation
  5. Built dashboards and alerts in Prism for ongoing operational visibility
  6. Correlated across signals — from log to trace to metric — using shared OpenTelemetry attributes

This is the observability lakehouse pattern: OpenTelemetry for standardized instrumentation, Parseable for storage and query, object storage for durable and cheap persistence, and SQL as the universal query interface. No vendor lock-in. No proprietary formats. Full data ownership.

Ready to build your own observability lakehouse? Start a free 14-day trial on Parseable Cloud — $0.39/GB ingested, 365 days retention, AI-native analysis, and unlimited users included. For Enterprise needs including BYOB, Apache Iceberg support, and flexible deployment, contact the Parseable team.

Share:

Subscribe to our newsletter

Get the latest updates on Parseable features, best practices, and observability insights delivered to your inbox.

SFO

Parseable Inc.

584 Castro St, #2112

San Francisco, California

94114-2512

Phone: +1 (650) 444 6216

BLR

Cloudnatively Services Private Limited

JBR Tech Park

Whitefield, Bengaluru

560066

Phone: +91 9480931554

All systems operational

Parseable