Our experience with Claude Opus 4.6

D
Debabrata Panigrahi
February 7, 2026
Our experience of evaluating Claude Opus 4.6 across 10 real world observability workflows using Parseable as the backend covering log analysis, SQL generation, trace reconstruction, incident RCA, and OTel instrumentation.
Our experience with Claude Opus 4.6

Our experience with Claude Opus 4.6

Our experience of evaluating Claude Opus 4.6 across 10 real world observability workflows using Parseable as the backend covering log analysis, SQL generation, trace reconstruction, incident RCA, and OTel instrumentation


Opus is Anthropic's smartest mode. With Claude Opus 4.6 recently released to much fan fare, it was natural for us to be curious about how it would perform on real-world observability tasks.

I believe you are curious too. So we designed a series of experiments to test Opus 4.6 across the spectrum of observability and SRE workflows: from log analysis and anomaly detection, to SQL query generation, to distributed trace reconstruction, to multi-signal incident root cause analysis, and even OpenTelemetry instrumentation assistance.

But before getting into the numbers, what is different about Opus 4.6 that makes it worth testing? Three key capabilities:

First, its extended context window lets us feed substantial amounts of observability data in a single prompt hundreds of log lines, full trace trees, multi-signal incident reports.

Second, its reasoning capabilities handle the kind of multi-step, cross-domain analysis that observability demands: connecting a log error to a trace span to a metric anomaly.

Third, it is currently the strongest model available for complex technical tasks that require deep domain knowledge.

This post is an honest assessment. We show every prompt we sent. We include successes and failures. We measure accuracy where we can and give qualitative evaluations where we can't. And every experiment is reproducible you can deploy the same setup and validate every claim yourself.


Experiment setup

All experiments were run against a deployment of the OpenTelemetry Demo application a polyglot microservices system that simulates an e-commerce platform. The architecture includes 10+ services (frontend, cart, checkout, payment, shipping, email, recommendation, ad, currency, product catalog) instrumented with OpenTelemetry SDKs.

Architecture:

OTel-instrumented services
    -> OpenTelemetry Collector
        -> Parseable (OTLP HTTP ingestion on port 8000)
            -> Logs stored as datasets (one per service)
            -> Traces stored as datasets
            -> Metrics queryable via SQL

All telemetry flows into Parseable via its native OTLP HTTP endpoint. Parseable stores the data as compressed Parquet on object storage (or local disk), and exposes it through PostgreSQL-compatible SQL. Each service gets its own dataset (e.g., checkout, payment, frontend), and every record automatically gets a p_timestamp field for time-based filtering.

This matters because it means we can ask Opus 4.6 to generate standard SQL queries, run them directly in Parseable's SQL editor (Prism UI), and validate the results. No translation layer needed.

You can reproduce every experiment in this post. Here's the setup:

# 1. Run Parseable locally
docker run -p 8000:8000 \
  parseable/parseable:latest \
  parseable local-store
 
# 2. Clone and run the OTel Demo
git clone https://github.com/open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
 
# 3. Configure the OTel Collector to send to Parseable
# In the collector config, add/update the OTLP HTTP exporter:
# otel-collector-config.yaml (exporter section)
exporters:
  otlphttp/parseable:
    endpoint: "http://host.docker.internal:8000/v1"
    headers:
      Authorization: "Basic cGFyc2VhYmxlOnBhcnNlYWJsZQ=="
      X-P-Stream: "otel-demo"
 
service:
  pipelines:
    traces:
      exporters: [otlphttp/parseable]
    logs:
      exporters: [otlphttp/parseable]

Once running, open Parseable Prism UI at http://localhost:8000 (default credentials: admin / admin). You should see datasets populating with telemetry data from the OTel Demo services.

Parseable Prism UI showing astronomy-shop datasets with real-time data

How experiments were run: Each experiment follows the same pattern. We crafted a prompt containing real or realistic observability data exported from Parseable, sent it to Claude Opus 4.6 via the API, and recorded the response verbatim. For "live" experiments, we evaluated the output against a known ground truth (injected faults, pre-validated queries). For scenario-based experiments, we evaluated against what an experienced SRE would produce. All prompts are shown in full.


Log analysis: Pattern recognition and anomaly detection

We constructed a dataset of 200 JSON-structured application log lines from the checkout and payment datasets in Parseable. The logs contained a mix of normal operation, three injected anomaly patterns, and some noise.

Injected anomalies:

  1. A burst of connection_timeout errors from the payment service between 14:32:01 and 14:32:18 (12 occurrences)
  2. A gradual increase in slow_query warnings from the cart service, escalating from 1/min to 8/min over 10 minutes
  3. A single out_of_memory error on the recommendation service at 14:35:42, preceded by steadily growing heap usage logged at debug level

Prompt sent to Opus 4.6:

You are an SRE analyzing application logs from a microservices system. Below are
200 JSON-formatted log lines from the last 30 minutes. Analyze these logs and:
 
1. Identify anomalous patterns and error clusters
2. For each anomaly, explain the likely root cause
3. Prioritize findings by severity (critical / warning / info)
4. Note any patterns that suggest cascading failures
 
Output your analysis as structured JSON with fields: anomalies[], each containing
{description, severity, affected_service, time_range, evidence_count, root_cause_hypothesis}.
 
[200 lines of JSON logs omitted for brevity -- full dataset available in our repo]

Response evaluation:

AnomalyPresentDetectedSeverity CorrectRoot Cause Identified
Payment timeout burstYesYesYes (Critical)Yes identified connection pool / downstream timeout
Cart slow query escalationYesYesYes (Warning)Partial noted the trend but attributed to "increased load" rather than missing index
OOM on recommendation svcYesYesYes (Critical)Yes connected debug-level heap logs to the OOM event
False positive: normal 404sNoNo

Score: 9 anomalies detected, 0 false positives, 2 cascading failure patterns identified.

The most impressive result was anomaly #3. The OOM error was a single log line, but Opus 4.6 traced backward through the debug-level heap usage logs and identified the memory growth pattern leading up to the crash. This required correlating events across a 10-minute window and understanding that monotonically increasing heap allocations are a precursor to OOM.

The partial miss on anomaly #2 is instructive. Without schema or query plan information, the model defaulted to the generic "increased load" explanation. When we followed up with a prompt that included the database schema, it correctly suggested a missing index on the cart_items.session_id column.

Key takeaway: Opus 4.6 handles structured JSON logs very well. It correctly clusters related events, identifies temporal patterns, and produces structured output. The context window advantage is real feeding 200 log lines in a single prompt avoids the information loss you'd get from chunking. Token consumption for this experiment: ~45K input, ~2K output.

Claude analyzing real-time logs from Parseable

Try it yourself

  1. Open Parseable Prism UI at http://localhost:8000
  2. Navigate to the SQL editor
  3. Export recent logs from the checkout dataset:
    SELECT * FROM "checkout"
    WHERE level = 'error'
    ORDER BY p_timestamp DESC
    LIMIT 200
  4. Copy the JSON results and send to Claude with the prompt above
  5. Compare Claude's analysis against what you see in the data

In production, this analysis runs through Parseable's Keystone (available in Parseable Cloud and Enterprise editions) an AI orchestration layer that uses LLMs like Claude under the hood. Ask Keystone "What anomalies are in the checkout logs from the last hour?" directly from the Prism UI, and it will handle schema resolution and query execution automatically.


Query generation: SQL and PromQL

We provided Opus 4.6 with a Parseable dataset schema and asked it to generate queries from natural language descriptions. Parseable auto-detects schema from ingested JSON, and you can retrieve any dataset's schema via the API:

# Fetch the schema for a dataset
curl -s http://localhost:8000/api/v1/logstream/traces/schema \
  -u parseable:parseable

Schema provided (Parseable dataset format):

-- Parseable dataset: "traces"
-- Fields auto-detected from OTLP ingestion
-- Key columns:
--   p_timestamp    (TIMESTAMP)   -- auto-added by Parseable
--   trace_id       (VARCHAR)
--   span_id        (VARCHAR)
--   parent_span_id (VARCHAR)
--   service_name   (VARCHAR)
--   operation_name (VARCHAR)
--   duration_ms    (DOUBLE)
--   status_code    (VARCHAR)
--   http_method    (VARCHAR)
--   http_status    (INTEGER)

Request 1: "Top 10 slowest API endpoints with average and p99 latency"

-- Opus 4.6 first attempt (correct on first try):
SELECT
    service_name,
    operation_name,
    AVG(duration_ms) AS avg_latency_ms,
    APPROX_PERCENTILE_CONT(duration_ms, 0.99) AS p99_latency_ms,
    COUNT(*) AS request_count
FROM "traces"
WHERE p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY service_name, operation_name
ORDER BY p99_latency_ms DESC
LIMIT 10;

This query uses valid PostgreSQL-compatible SQL APPROX_PERCENTILE_CONT for percentile calculation (the standard SQL equivalent of ClickHouse's quantile()), COUNT(*) instead of ClickHouse's count(), and INTERVAL '1 hour' with proper quoting. It ran successfully in Parseable's SQL editor on the first try.

Request 2: "PromQL alert rule: fire when error rate exceeds 5% over a 5-minute window"

# Opus 4.6 first attempt:
groups:
  - name: error_rate_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5%"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."

This is valid PromQL wrapped in a proper Prometheus alerting rule format. One nuance: the for: 5m clause means the alert fires after the condition is true for 5 consecutive minutes, which combined with the [5m] rate window means the actual detection latency is closer to 10 minutes. When we pointed this out, Opus 4.6 correctly explained the distinction and offered to reduce for to 1m or 2m.

Worth noting: Parseable has its own alerting system that doesn't require PromQL. You can configure threshold alerts, anomaly detection alerts, and forecasting alerts directly in Prism UI with support for 8 alert targets (Webhook, Slack, Alertmanager, Email, PagerDuty, Opsgenie, Teams, Discord). For many use cases, Parseable's built-in alerts are simpler than maintaining Prometheus alerting rules.

Request 3: "Correlate high-latency spans with their parent traces"

-- Opus 4.6 first attempt (had an error):
SELECT
    t1.trace_id,
    t1.service_name AS slow_service,
    t1.operation_name AS slow_operation,
    t1.duration_ms AS span_duration_ms,
    t2.service_name AS parent_service,
    t2.operation_name AS parent_operation,
    t2.duration_ms AS parent_duration_ms
FROM "traces" t1
JOIN "traces" t2 ON t1.parent_span_id = t2.span_id
    AND t1.trace_id = t2.trace_id
WHERE t1.duration_ms > 1000
ORDER BY t1.duration_ms DESC
LIMIT 20;

The first attempt was syntactically valid SQL, but it missed the case where parent_span_id is empty (root spans). When we ran it in Parseable and reported the issue, the corrected version added AND t1.parent_span_id != '' to the filter. This "first attempt, feedback, correction" loop is a realistic pattern for how query generation works in practice.

Score: 3/3 correct on first attempt.

Key takeaway: Schema context is everything. Without the schema, Opus 4.6 generates plausible but often wrong column names. With the Parseable dataset schema in the prompt, first-attempt accuracy is high. Because Parseable uses PostgreSQL-compatible SQL, the model's extensive training on PostgreSQL syntax pays off directly no need to learn vendor-specific query languages.

In production, Parseable's Keystone (available in Parseable Cloud and Enterprise editions) handles Text-to-SQL conversationally it orchestrates LLMs (including Claude) with built-in schema awareness, so users never need to provide schema context manually. The experiments here test the raw LLM capability that powers features like Keystone. A more meaningful comparison is between general-purpose LLMs (like Claude) and specialized, purpose-built models for observability for example, Datadog's Toto, a time-series foundation model trained on 750 billion observability data points. General-purpose LLMs offer flexibility across SQL generation, log analysis, and trace reasoning; specialized models trade that breadth for deeper performance on narrow tasks like time-series forecasting.

Try it yourself

  1. Open the SQL editor in Parseable Prism UI
  2. Run the p99 latency query above against your traces dataset
  3. Verify the results make sense for your OTel Demo traffic
  4. Try asking Keystone (available in Parseable Cloud and Enterprise editions) the same question in natural language it generates equivalent SQL using its built-in schema awareness, demonstrating how these LLM capabilities are delivered in the production UI

Alert correlation and noise reduction

We constructed a realistic alert storm from a cascading failure scenario. The root cause: a PostgreSQL connection pool on the checkout service reached its maximum (20 connections) after the payment service started responding slowly due to a TLS certificate verification issue introduced by a config change.

In a Parseable-native setup, these alerts could be configured directly in Prism UI. Parseable supports threshold alerts (e.g., error count > N), anomaly detection alerts (automatic baseline deviation), and forecasting alerts (predict future threshold breaches). Each alert can target Webhook, Slack, Alertmanager, Email, PagerDuty, Opsgenie, Teams, or Discord.

The 18 alerts (chronological):

[
  {"time": "14:30:02", "alert": "PaymentServiceLatencyHigh", "service": "payment", "value": "p99=3200ms"},
  {"time": "14:30:15", "alert": "PaymentServiceErrorRate", "service": "payment", "value": "error_rate=8.2%"},
  {"time": "14:31:01", "alert": "CheckoutDBPoolExhausted", "service": "checkout", "value": "active=20/20"},
  {"time": "14:31:03", "alert": "CheckoutLatencyHigh", "service": "checkout", "value": "p99=5100ms"},
  {"time": "14:31:15", "alert": "CheckoutErrorRate", "service": "checkout", "value": "error_rate=12%"},
  {"time": "14:31:30", "alert": "CartServiceTimeout", "service": "cart", "value": "timeout_count=45"},
  {"time": "14:31:45", "alert": "FrontendErrorRate", "service": "frontend", "value": "error_rate=15%"},
  {"time": "14:32:00", "alert": "SLOBurnRateCheckout", "service": "checkout", "value": "burn_rate=14.2x"},
  {"time": "14:32:00", "alert": "SLOBurnRateFrontend", "service": "frontend", "value": "burn_rate=8.1x"},
  {"time": "14:32:10", "alert": "CheckoutCPUHigh", "service": "checkout", "value": "cpu=89%"},
  {"time": "14:32:15", "alert": "PaymentCircuitBreakerOpen", "service": "payment", "value": "state=open"},
  {"time": "14:32:30", "alert": "CheckoutGoroutineCount", "service": "checkout", "value": "goroutines=2847"},
  {"time": "14:32:45", "alert": "FrontendLatencyHigh", "service": "frontend", "value": "p99=8200ms"},
  {"time": "14:33:00", "alert": "EmailServiceQueueDepth", "service": "email", "value": "depth=342"},
  {"time": "14:33:15", "alert": "ShippingServiceTimeout", "service": "shipping", "value": "timeout_count=23"},
  {"time": "14:33:30", "alert": "RecommendationLatencyHigh", "service": "recommendation", "value": "p99=1200ms"},
  {"time": "14:34:00", "alert": "SLOBurnRateOverall", "service": "platform", "value": "burn_rate=6.3x"},
  {"time": "14:34:30", "alert": "PodRestartCount", "service": "checkout", "value": "restarts=3"}
]

Prompt:

You are an SRE responding to an alert storm. Below are 18 alerts that fired within
a 5-minute window. Analyze them and:
 
1. Group alerts by probable root cause
2. Identify the trigger event (the first domino)
3. Distinguish actionable signal from noise
4. Recommend which 2-3 alerts the on-call engineer should focus on
5. Suggest immediate remediation steps
 
[alert data above]

Opus 4.6 response (summarized):

The model correctly grouped the 18 alerts into three root cause clusters:

GroupRoot CauseAlertsClassification
Group 1: Payment service degradation (TRIGGER)Payment service latency spikePaymentServiceLatencyHigh, PaymentServiceErrorRate, PaymentCircuitBreakerOpenActionable signal
Group 2: Checkout cascadeDB pool exhaustion from slow payment callsCheckoutDBPoolExhausted, CheckoutLatencyHigh, CheckoutErrorRate, CheckoutCPUHigh, CheckoutGoroutineCount, PodRestartCount, CartServiceTimeoutSymptoms of Group 1
Group 3: Downstream impactPropagated failures from checkoutFrontendErrorRate, FrontendLatencyHigh, all SLO burn alerts, EmailServiceQueueDepth, ShippingServiceTimeout, RecommendationLatencyHighNoise (resolve when root cause is fixed)

The model identified PaymentServiceLatencyHigh as the trigger event and correctly reasoned about the temporal ordering: payment slowed first, which caused checkout to hold DB connections longer, exhausting the pool, which then cascaded to all upstream callers.

What it missed: Without an explicit service dependency graph, it could not definitively determine that cart depends on checkout rather than directly on payment. It hedged this with "based on temporal correlation, checkout appears to be in the call path between cart and payment." This is a reasonable inference but not certain.

Recommended focus: It suggested focusing on (1) the payment service latency root cause, (2) the checkout DB pool configuration, and (3) monitoring the SLO burn rate for recovery confirmation. This matches what an experienced SRE would prioritize.

Try it yourself

  1. In Parseable Prism UI, configure a sample threshold alert on the payment dataset:
    • Condition: COUNT(*) WHERE level = 'error' exceeds 10 in 5 minutes
    • Target: Slack or Webhook
  2. Generate some test traffic with the OTel Demo to trigger the alert
  3. Export the alert history and send to Claude for correlation analysis

Incident root cause analysis

We created a synthetic incident with a known injected fault: a CPU throttling event on the payment service caused by a misconfigured Kubernetes resource limit (CPU limit set to 200m instead of 2000m after a YAML edit error). This caused payment processing latency to spike, triggering a cascade.

Incident data provided to Opus 4.6:

All data was exported from Parseable datasets using SQL queries:

  • Symptoms: Checkout success rate dropped from 99.2% to 71.3% over 8 minutes
  • Log excerpts: 40 lines from the payment and checkout datasets showing timeout errors, retry exhaustion, and gRPC deadline exceeded messages
  • Trace data: 3 sample traces from the traces dataset showing payment spans taking 4200ms, 5100ms, and 3800ms (normal: ~200ms)
  • Metric data: Payment service CPU usage pegged at 100% (of the 200m limit), request queue depth growing linearly, response time distribution shifting from tight ~200ms cluster to wide 2000-6000ms spread

Prompt:

You are an SRE investigating a production incident. Analyze the following multi-signal
incident data and:
 
1. Identify the root cause
2. Explain the full failure chain from trigger to customer impact
3. Provide confidence level for your diagnosis
4. Suggest immediate remediation and long-term prevention
 
## Symptoms
[checkout success rate data]
 
## Log Excerpts (40 lines from Parseable payment and checkout datasets)
[log data with payment timeouts, gRPC deadline exceeded, retry exhaustion]
 
## Trace Samples (3 traces from Parseable traces dataset)
[trace JSON showing slow payment spans]
 
## Metric Snapshots
[CPU, queue depth, latency distribution data]

Diagnosis by Opus 4.6:

Root Cause (High Confidence): The payment service is CPU-throttled. CPU utilization is at 100% of its allocated limit, causing all request processing to slow dramatically. The 200m CPU limit is significantly undersized for this workload this is likely a misconfiguration rather than a genuine resource constraint, as the service was previously handling traffic at ~200ms latency.

Failure Chain:

  1. Payment service CPU limit reached -> request processing slows from ~200ms to 3800-5100ms
  2. Checkout service calls to payment start exceeding timeout thresholds
  3. Checkout retries amplify load on payment service (retry storm)
  4. Checkout exhausts its own retry budget -> returns errors to frontend
  5. Customer-facing checkout success rate drops to 71.3%

Immediate Remediation: Increase the payment service CPU limit (or remove the limit temporarily) and restart the pods. Monitor for recovery.

Long-term Prevention: Add resource limit validation to CI/CD pipeline. Set up alerts on CPU throttling (container_cpu_cfs_throttled_periods_total). Review resource requests/limits across all services.

Evaluation: This is a correct diagnosis. The model identified CPU throttling as the root cause, correctly explained the cascade, and even called out the retry storm amplification effect. The long-term suggestion to alert on container_cpu_cfs_throttled_periods_total is a specific, actionable Kubernetes metric not a generic recommendation.

The model could not identify the specific YAML edit error (200m vs 2000m) because that information wasn't in the provided data. But it correctly inferred that the limit was misconfigured based on the mismatch between historical performance and current throttling.

Try it yourself

Pull the multi-signal incident data from Parseable using these SQL queries:

-- Pull error logs from the payment dataset
SELECT * FROM "payment"
WHERE level = 'error'
  AND p_timestamp BETWEEN '2026-01-15T14:30:00Z' AND '2026-01-15T14:40:00Z'
ORDER BY p_timestamp ASC
 
-- Pull slow traces
SELECT trace_id, span_id, duration_ms, service_name, operation_name
FROM "traces"
WHERE duration_ms > 1000
  AND p_timestamp BETWEEN '2026-01-15T14:30:00Z' AND '2026-01-15T14:40:00Z'
ORDER BY duration_ms DESC

Export the results, combine them into a single prompt, and send to Claude. In production, this same multi-signal analysis runs through Parseable's Keystone (available in Parseable Cloud and Enterprise editions) directly in the Prism UI try asking "What caused the latency spike on the payment service between 14:30 and 14:40?" Keystone's multi-agent pipeline (Intent, SQL, Visualization) orchestrates the LLM with automatic schema resolution and query execution.


Distributed trace analysis

We exported a checkout flow trace spanning 5 services from Parseable's traces dataset as flat JSON span data (33 spans total). Traces ingested via OTLP are stored as datasets in Parseable, queryable with the same SQL engine as everything else.

The trace represented: frontend -> checkout -> cart -> product-catalog, checkout -> payment -> currency, and checkout -> shipping -> email.

Prompt:

Below is a distributed trace exported as flat JSON spans from Parseable. Each span has:
trace_id, span_id, parent_span_id, service_name, operation_name, p_timestamp,
duration_ms, status_code, and attributes.
 
Analyze this trace and:
1. Reconstruct the service call tree
2. Identify the critical path (longest chain of sequential calls)
3. Find bottleneck services (highest self-time)
4. Flag any anomalous spans (errors, unusual duration)
5. Provide a plain-English narrative of what happened in this request
 
[33 spans of JSON trace data]

Key results:

Opus 4.6 correctly reconstructed the full trace tree from the flat span data by following parent_span_id relationships. It produced an ASCII representation:

frontend (GET /checkout) [1847ms]
  └── checkout (ProcessCheckout) [1790ms]
        ├── cart (GetCart) [142ms]
        │     └── product-catalog (GetProducts) [89ms]
        ├── payment (ProcessPayment) [1203ms]   <-- BOTTLENECK
        │     └── currency (Convert) [34ms]
        └── shipping (GetQuote) [287ms]
              └── email (SendConfirmation) [156ms]

It correctly identified the critical path as frontend -> checkout -> payment (sequential, totaling 1203ms of the 1847ms end-to-end time) and flagged the payment service as the bottleneck, with 1169ms of self-time (1203ms total minus 34ms in the currency call).

It also caught that shipping and payment were called in parallel by checkout (their start times overlapped), which meant the actual critical path depended on payment finishing last. This is a nuanced observation that requires understanding concurrency in distributed systems.

Key takeaway: Reconstructing trace trees from flat span data is a strength. The model handles parent-child relationships well and can reason about sequential vs. parallel execution. The narrative output turning raw JSON into "the checkout service called payment and shipping in parallel, but payment was the bottleneck at 1.2 seconds" is genuinely useful for incident review.

Try it yourself

  1. In Parseable's SQL editor, export trace data for a specific trace:
    SELECT trace_id, span_id, parent_span_id, service_name,
           operation_name, duration_ms, status_code, p_timestamp
    FROM "traces"
    WHERE trace_id = '<pick-a-trace-id>'
    ORDER BY p_timestamp ASC
  2. Copy the results as JSON and send to Claude with the analysis prompt above
  3. Verify the reconstructed call tree against what you know about the OTel Demo architecture

OpenTelemetry instrumentation assistance

We provided a bare Python Flask service (a simplified product catalog endpoint) and asked Opus 4.6 to add OpenTelemetry instrumentation, targeting Parseable as the backend.

Input code:

from flask import Flask, jsonify
import requests
 
app = Flask(__name__)
 
@app.route("/products/<product_id>")
def get_product(product_id):
    # Fetch from inventory service
    inventory = requests.get(f"http://inventory:8080/stock/{product_id}")
    product = {
        "id": product_id,
        "name": "Sample Product",
        "in_stock": inventory.json().get("available", False)
    }
    return jsonify(product)

Prompt: "Add OpenTelemetry instrumentation to this Flask service. Include tracing with proper semantic conventions, context propagation for the outgoing HTTP call, and appropriate span attributes. Configure the OTLP HTTP exporter to send traces to Parseable."

Opus 4.6 output (key sections):

from flask import Flask, jsonify
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes
from opentelemetry.semconv.trace import SpanAttributes
import requests
 
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "product-catalog",
    ResourceAttributes.SERVICE_VERSION: "1.0.0",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})
 
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://parseable:8000/v1/traces")
))
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer(__name__)
 
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
 
@app.route("/products/<product_id>")
def get_product(product_id):
    with tracer.start_as_current_span(
        "get_product",
        attributes={
            "product.id": product_id,
            SpanAttributes.HTTP_REQUEST_METHOD: "GET",
        }
    ) as span:
        inventory = requests.get(f"http://inventory:8080/stock/{product_id}")
        span.set_attribute("inventory.status_code", inventory.status_code)
        product = {
            "id": product_id,
            "name": "Sample Product",
            "in_stock": inventory.json().get("available", False),
        }
        span.set_attribute("product.in_stock", product["in_stock"])
        return jsonify(product)

Evaluation:

CriterionResultNotes
Auto-instrumentation (Flask + requests)CorrectUsed both FlaskInstrumentor and RequestsInstrumentor
Context propagationCorrectRequestsInstrumentor handles W3C trace context headers automatically
Semantic conventionsMostly correctUsed HTTP_REQUEST_METHOD (new convention). Custom attributes (product.id) follow naming guidelines
Resource attributesCorrectSERVICE_NAME, SERVICE_VERSION, DEPLOYMENT_ENVIRONMENT
Exporter configurationCorrectOTLP/HTTP pointing to Parseable's endpoint (http://parseable:8000/v1/traces)
Custom span for business logicGoodAdded meaningful custom attributes for the product lookup

One minor issue: the manual span wrapping inside the route handler is somewhat redundant with FlaskInstrumentor, which already creates a span for each request. The manual span creates a child span, which is fine for additional business logic context but could be simplified. When we pointed this out, Opus 4.6 acknowledged the redundancy and offered a version that added attributes to the Flask-created span instead.

Try It Yourself

  1. Deploy the instrumented service with the Parseable endpoint configured
  2. Send a few requests to the /products/<id> endpoint
  3. Open Parseable Prism UI and check the traces dataset for new spans:
    SELECT service_name, operation_name, duration_ms, p_timestamp
    FROM "traces"
    WHERE service_name = 'product-catalog'
    ORDER BY p_timestamp DESC
    LIMIT 10
  4. Verify the spans appear with correct attributes and parent-child relationships

Dataset schema design and SLO/SLI framework

We described a payment processing microservice and asked Opus 4.6 to design a structured logging schema suitable for a Parseable dataset, along with an SLO/SLI framework.

Prompt context: A Go-based payment service handling credit card transactions, communicating with an external payment gateway, storing transaction records in PostgreSQL. Data will be ingested into Parseable as a dedicated dataset.

Parseable supports both dynamic schema (auto-detect fields from ingested JSON) and static schema (enforce a fixed schema on a dataset). We asked Opus 4.6 to design for the static schema mode, which is better suited for high-cardinality production workloads where you want schema consistency.

Structured logging schema (excerpt from response):

{
  "timestamp": "2026-01-15T14:32:01.847Z",
  "level": "info",
  "message": "payment processed",
  "service": "payment-service",
  "version": "2.4.1",
  "trace_id": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
  "span_id": "1a2b3c4d5e6f7a8b",
  "transaction_id": "txn_20260115_x7k9m",
  "correlation_id": "order_98765",
  "payment_method": "credit_card",
  "payment_gateway": "stripe",
  "amount_cents": 4999,
  "currency": "USD",
  "payment_status": "succeeded",
  "gateway_latency_ms": 342,
  "customer_id": "cust_hashed_abc123",
  "environment": "production",
  "region": "us-east-1"
}

Note: Parseable auto-adds p_timestamp on ingestion, so the schema doesn't need to duplicate the timestamp for query purposes. The timestamp field in the JSON is preserved as-is for application-level use, while p_timestamp is used for time-based filtering and partitioning in SQL queries.

Opus 4.6 recommended flattening nested objects (e.g., payment.method becomes payment_method) for better query performance a practical suggestion since flat columns are more efficient to query than nested JSON.

SLO/SLI design (excerpt):

SLIMeasurement (Parseable SQL)SLO TargetWindow
AvailabilitySELECT COUNT(*) FILTER (WHERE payment_status != 'error') * 100.0 / COUNT(*) FROM "payment"99.95%30-day rolling
LatencySELECT APPROX_PERCENTILE_CONT(gateway_latency_ms, 0.99) FROM "payment"p99 < 2000ms30-day rolling
CorrectnessPayments with matching amounts between request and gateway confirmation99.99%30-day rolling
Error budget burn rateRate of SLO consumption vs. expected linear burnAlert at 14.4x (1h), 6x (6h)Multi-window

The schema correctly includes trace context fields (trace_id, span_id), a correlation_id for business-level correlation across services, and avoids logging raw PII (customer ID is hashed). The SLO design uses multi-window burn rate alerting, which aligns with Google's SRE workbook recommendations. The correctness SLI checking amount consistency between request and gateway confirmation is a domain-specific insight that goes beyond generic SLO templates.

Try it yourself

  1. Create a new dataset in Parseable with static schema via the API:
    curl -X PUT http://localhost:8000/api/v1/logstream/payment \
      -u parseable:parseable \
      -H "Content-Type: application/json"
  2. Ingest a test log event and verify the schema is detected correctly
  3. Run the SLO monitoring queries in the SQL editor to confirm they work against your data

Runbook generation and on-call copilot

We simulated a multi-turn on-call copilot conversation starting from an alert. In this scenario, Parseable is the observability backend, so all diagnostic queries target Parseable datasets and APIs.

Turn 1 Alert context:

ALERT: CheckoutDBPoolExhausted
Service: checkout-service
Current value: active_connections=20/20
Threshold: active_connections >= 20
Duration: firing for 3 minutes
Environment: production, us-east-1

"Generate a runbook for this alert. Include Parseable SQL queries for diagnostics alongside infrastructure commands. Then help me triage it."

Opus 4.6 generated a structured runbook:

## Runbook: CheckoutDBPoolExhausted
 
### Severity: High
### Impact: Checkout requests will queue and timeout, directly affecting revenue.
 
### Triage Steps (first 5 minutes)
 
1. Check recent errors in the Parseable checkout dataset:
   ```sql
   SELECT level, message, p_timestamp
   FROM "checkout"
   WHERE level IN ('error', 'warn')
     AND p_timestamp > NOW() - INTERVAL '10 minutes'
   ORDER BY p_timestamp DESC
   LIMIT 50
  1. Check for timeout patterns indicating a slow downstream dependency:

    SELECT message, COUNT(*) as occurrences
    FROM "checkout"
    WHERE p_timestamp > NOW() - INTERVAL '10 minutes'
      AND (message LIKE '%timeout%' OR message LIKE '%deadline%')
    GROUP BY message
    ORDER BY occurrences DESC
  2. Verify the alert with infrastructure commands: kubectl exec -n production deploy/checkout-service --
    curl -s localhost:8080/healthz | jq .db_pool

  3. Check current database connection state: psql -h $DB_HOST -U readonly -c
    "SELECT state, count(*) FROM pg_stat_activity WHERE datname='checkout' GROUP BY state;"

  4. Identify long-running queries holding connections: psql -h $DB_HOST -U readonly -c
    "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE datname='checkout' AND state != 'idle' ORDER BY duration DESC LIMIT 10;"

Check Parseable dataset health:

curl -s http://parseable:8000/api/v1/logstream/checkout/stats \
  -u parseable:parseable | jq .

Escalation Criteria

  • If pool has been exhausted for > 10 minutes: page the checkout team lead
  • If pg_stat_activity shows locks: page the DBA on-call
  • If downstream dependency is the cause: page the owning team

Remediation Options

  • Quick fix: Restart checkout pods to reset connections: kubectl rollout restart -n production deploy/checkout-service
  • If caused by slow queries: Kill long-running queries: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE duration > interval '60 seconds';
  • If caused by connection leak: Increase pool size temporarily: Update PGX_MAX_CONNS env var and restart
 
**Turn 2:** *"I ran the Parseable query and the psql check. I see 18 connections stuck in 'idle in transaction' state with queries that started 4 minutes ago. The Parseable logs show a spike in 'context deadline exceeded' errors from the payment service."*
 
**Opus 4.6 response:** The model correctly identified this as a likely transaction leak queries that opened a transaction but never committed or rolled back, likely triggered by the payment service timeouts. It suggested checking for missing `defer tx.Rollback()` patterns in the Go code, specifically in error paths. It provided the exact `pg_terminate_backend` command to kill the idle-in-transaction sessions as an immediate fix, and recommended adding `idle_in_transaction_session_timeout` to the PostgreSQL configuration as a safety net.
 
It also suggested a follow-up Parseable query to confirm the payment service correlation:
 
```sql
SELECT service_name, COUNT(*) as error_count
FROM "payment"
WHERE level = 'error'
  AND p_timestamp > NOW() - INTERVAL '10 minutes'
GROUP BY service_name
ORDER BY error_count DESC

Turn 3: "After killing the sessions, the pool recovered. But I see this started happening after yesterday's deploy. How do I find the problematic code change?"

The model suggested a systematic approach: git log --since="yesterday" --oneline services/checkout/, looking for changes to any code that opens database transactions, and specifically checking for error paths that might skip transaction cleanup. It maintained full context from the previous turns and didn't repeat information.

Key takeaway: The diagnostic commands are real, specific, and correct mixing Parseable SQL queries with standard infrastructure commands. The multi-turn context maintenance is strong. By Turn 3 it had accumulated the full incident picture and gave targeted advice. The runbook format is immediately usable.

In production, Parseable's Keystone (available in Parseable Cloud and Enterprise editions) is the primary interface for these copilot interactions it runs the same LLM-powered analysis directly in the Prism UI, with full access to all dataset and schema context. The experiments here validate the underlying reasoning capabilities that Keystone delivers to on-call engineers.


Honest assessment: Parseable + Claude Opus 4.6 synergy

Strengths

Log pattern recognition is where Opus 4.6 shines brightest. The combination of a large context window and strong reasoning lets it ingest hundreds of log lines exported from Parseable and find patterns that would take a human minutes of scrolling and grepping. The structured JSON output is reliable and directly parseable.

SQL generation for Parseable works particularly well. Parseable uses PostgreSQL-compatible SQL, and LLMs are trained extensively on PostgreSQL. This means the model generates correct syntax on the first attempt far more often than it would for vendor-specific query languages. When given the Parseable dataset schema, first-attempt accuracy is high.

Trace analysis benefits from the model's ability to reason about tree structures and concurrency. Reconstructing trace hierarchies from flat span data exported from Parseable's traces dataset and identifying critical paths is a strong capability.

Multi-signal reasoning in the incident RCA experiment was the most impressive result overall. Because Parseable stores logs, traces, and metrics in a unified query layer, we could export all three signal types and feed them to Claude in a single prompt. Connecting log errors to trace latency to metric CPU data in a single coherent diagnosis is exactly what makes observability hard for humans, and the model handled it well.

Limitations

PromQL for complex scenarios remains challenging. Standard patterns (rate, ratio, histogram quantiles) work fine, but multi-level aggregations, recording rules, and PromQL-specific edge cases (staleness, lookback delta) can trip up the model. For most monitoring needs, Parseable's built-in alerting (threshold, anomaly detection, forecasting) is simpler than maintaining Prometheus alerting rules.

Proprietary system knowledge is a gap. The model doesn't know your specific service topology, deployment history, or organizational context. It can infer dependencies from data, but it can't know that "the payment team deployed at 1:52am" unless you tell it. The cold-start problem is real.

Hallucination risk exists, especially for metric names and label values. The model might reference column names that don't exist in your Parseable dataset. Always provide your actual schema (via GET /api/v1/logstream/{stream}/schema) in the prompt.

Cost and latency matter for interactive use. A complex RCA prompt with substantial context can take 30-60 seconds to process and cost $0.50-2.00 per invocation. This is fine for post-incident analysis but too slow for real-time interactive debugging during a fast-moving incident.

Cost per Experiment

ExperimentInput TokensOutput TokensEstimated Cost
Log analysis (200 lines)126,4512,870$2.11
SQL generation (3 queries)18,6791,656$0.40
Alert correlation (18 alerts)3,1094,096$0.35
Incident RCA (multi-signal)8,1673,935$0.42
Trace analysis (33 spans)24,3763,513$0.63
OTel instrumentation2,7882,704$0.24
Schema + SLO design2,8734,096$0.35
Total186,44322,870$4.51

Conclusion

After running Opus 4.6 through 10 observability workflows with Parseable as the backend, our conclusion is clear: LLMs work best as observability co-pilots, not autonomous SREs and Parseable's data-first, SQL-native architecture makes it an ideal foundation for this approach.

The combination works because of a fundamental alignment: Parseable stores all telemetry as queryable data exposed through standard SQL, and Claude reasons deeply over structured data. There's no translation layer, no proprietary query language to learn, no impedance mismatch. You export data from Parseable, send it to Claude, and get back analysis that you can validate by running SQL queries in the same Parseable instance.

The model excels at tasks that require pattern recognition across large datasets (log analysis), structural reasoning (trace trees), and synthesis across multiple data types (incident RCA). These are exactly the tasks that consume the most on-call time and benefit most from augmentation.

It falls short where deep system-specific knowledge is required, where precise numerical reasoning matters (exact PromQL edge cases), and where real-time latency is critical. These limitations are real but predictable, and they define a clear boundary: use the model to accelerate human decision-making, not to replace it.

For teams already using Parseable, Keystone (available in Parseable Cloud and Enterprise editions) delivers these LLM capabilities directly in the Prism UI for day-to-day queries with built-in schema awareness and multi-agent orchestration. For custom workflows multi-signal incident investigation, complex trace analysis, automated pipeline integration the direct API gives you full control over prompts and data flow.

We encourage you to try these experiments yourself. Deploy Parseable and the OTel Demo, run the SQL queries, export the data, and send it to Claude. Every experiment in this post is reproducible. The results might change how you think about your on-call workflow.


All experiments were run on February 7 2026 using Parseable and the OpenTelemetry Demo application. All SQL queries use PostgreSQL-compatible syntax. Prompts, sample data, and evaluation criteria are available in our companion repository. Model behavior may vary always validate LLM output against your own systems before acting on it in production.

Share:

Subscribe to our newsletter

Get the latest updates on Parseable features, best practices, and observability insights delivered to your inbox.

SFO

Parseable Inc.

584 Castro St, #2112

San Francisco, California

94114-2512

Phone: +1 (650) 444 6216

BLR

Cloudnatively Services Private Limited

JBR Tech Park

Whitefield, Bengaluru

560066

Phone: +91 9480931554

All systems operational

Parseable