Open-source Unified Observability Platform: If you like where we’re headed, star Parseable on GitHub and help us build.Star on GitHub
Parseable

Inferencing LLM models with vLLM and Parseable: Complete Observability for AI Workloads

O
Ompragash Viswanathan (Guest) and Debabrata Panigrahi
September 30, 2025
Inferencing LLM models with vLLM and Parseable: Complete Observability for AI Workloads

Introduction

Modern AI inference is moving into production at scale. As teams deploy powerful open models such as GPT‑OSS‑20B on high‑performance GPU infrastructure and serve them with vLLM, observability and monitoring becomes essential.

In this post, we'll show you how to set up end‑to‑end metrics collection and monitoring for vLLM using OpenTelemetry to collect and export metrics in OTel JSON format, and Parseable to store, query, and visualize the data. By the end, you’ll have a working stack, ready‑made dashboards, and a cost analysis workflow.

What is Inferencing and What's vLLM?

Inferencing refers to the process of using a pre-trained machine learning model to make predictions or decisions on new data. In the context of large language models, this means taking a trained model like GPT-OSS-20B and using it to generate text, answer questions, or perform other language tasks based on user inputs.

vLLM: Fast and Easy LLM Inference

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM

Overview

This solution provides a complete observability stack for vLLM services by:

  • Proxying vLLM metrics with OTel JSON format compatibility fixes
  • Collecting metrics using OTel Collector's efficient scraping capabilities
  • Storing metrics in Parseable for analysis and visualization
  • Containerized deployment with Podman/Docker Compose for easy setup

Whether you're running GPT-OSS-20B, Llama models, or any other LLM through vLLM, this stack ensures you have complete visibility into your inference operations.

Why Monitor vLLM Inference?

While open-source models like GPT-OSS-20B deployed on high-performance hardware (GPUs via RunPod, AWS, or on-premise) deliver exceptional capabilities, understanding what happens under the hood through metrics provides critical insights.

Cost Analysis: GPT-OSS-20B on A100 PCIe

The following analysis demonstrates the cost-effectiveness of GPT-OSS-20B inference using real production metrics from a 3.15-hour deployment window.

Performance Metrics Table

MetricValueUnit
Infrastructure
Instance TypeA100 PCIe-
Hourly Cost$1.64USD/hour
Deployment Duration3.15hours
Total Infrastructure Cost$5.166USD
Request Performance
Total Requests Processed5,138requests
Requests per Hour1,631requests/hour
Average Request Rate0.453requests/second
Cost Efficiency
Cost per Request$0.001005USD/request
Cost per 1,000 Requests$1.005USD/1K requests
Cost per Million Requests$1,005USD/1M requests
Token Economics
Cost per Request-Hour$0.001005USD/(req·hr)
Throughput Efficiency995.1requests/USD

Token Usage Analysis

The performance metrics demonstrate exceptional cost-effectiveness for GPT-OSS-20B inference on A100 PCIe hardware. With 5,138 requests processed during the monitoring period, the model achieved a cost efficiency of $0.001005 per request, translating to a sustained throughput of 1,631 requests per hour. This cost structure, at $1.64 per hour for A100 PCIe instances, provides significant economic advantages over commercial API pricing models that typically charge $0.002-$0.02 per 1K tokens. The mathematical relationship between infrastructure utilization and request volume demonstrates optimal resource efficiency, with the system maintaining consistent sub-millisecond cost granularity across the entire deployment window.

Now let's deep-dive into how we setup the complete observability stack along with vLLM metrics collection to gather the above cost analysis data.

Architecture

The solution follows a streamlined data pipeline architecture:

┌─────────────┐     ┌──────────────┐     ┌────────────┐     ┌────────────┐
│    vLLM     │────▶│   Metrics    │────▶│  OTel      │────▶│ Parseable  │
│   Service   │     │    Proxy     │     │    Collector│     │            │
└─────────────┘     └──────────────┘     └────────────┘     └────────────┘
      ↓                    ↓                    ↓                   ↓
   Metrics           Sanitization          Collection          Observability

Data Flow

  1. vLLM Service exposes raw metrics in Prometheus format
  2. Metrics Proxy sanitizes metric names for compatibility
  3. OTel Collector scrapes and forwards metrics via OpenTelemetry
  4. Parseable stores and provides query interface for analysis

Components

1. Metrics Proxy (proxy.py)

The metrics proxy serves as a critical compatibility layer:

Features:

  • Flask-based HTTP proxy service
  • Sanitizes vLLM metric names by replacing colons with underscores
  • Ensures Prometheus-format compatibility
  • Runs on port 9090
  • Includes health check endpoint for monitoring

2. OTel Collector

OTel Collector is an Observability agent that scrapes metrics from the metrics proxy and forwards them to Parseable. Here's the configuration for the OTel Collector.

Capabilities:

  • Scrapes metrics every 2 seconds (configurable)
  • Forwards metrics via OpenTelemetry protocol
  • Adds custom labels for filtering
  • Automatic retry and buffering

3. Parseable

Parseable is an unified observability platform that can handle high volumes of metrics and logs powered by cost-effective object storage. It provides a web UI(Prism) for visualizing and analyzing metrics and logs.

Features:

  • Time-series data storage optimized for metrics
  • Web UI available on port 8080
  • SQL-based query interface
  • Real-time streaming and historical analysis
  • Stores metrics in the vLLMmetrics stream

Parseable Dashboard

Prerequisites

Before deploying the monitoring stack, ensure you have:

  • Container runtime: Podman with Podman Compose (or Docker with Docker Compose)
  • Network access: Open ports 9090 (proxy) and 8080 (Parseable UI)
  • vLLM deployment: Running vLLM service with metrics endpoint accessible
  • System resources: Minimum 2GB RAM, 10GB storage for metrics retention

Quick Start

1. Clone the Repository

git clone https://github.com/opensourceops/vllm-inference-metrics.git
cd vllm-inference-metrics

2. Configure vLLM Endpoint

Edit compose.yml to point to your vLLM deployment:

services:
  proxy:
    environment:
      - VLLM_METRICS_URL=https://your-vllm-endpoint/metrics

For local vLLM deployments:

environment:
  - VLLM_METRICS_URL=http://localhost:8000/metrics

3. Start the Stack

Using Podman:

podman compose -f compose-otel.yml up -d

Using Docker:

docker compose -f compose-otel.yml up -d

4. Access Services

  • Parseable UI: localhost:8080 (credentials: admin/admin)
  • Metrics endpoint: localhost:9090/metrics
  • Health check: localhost:9090/health

5. Verify Metrics Collection

Check that metrics are flowing:

# View proxy metrics
curl http://localhost:9090/metrics

# Check OTel Collector logs
podman compose logs -f otel-collector

# Query metrics in Parseable
curl -X POST http://localhost:8080/api/v1/query \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"query": "SELECT * FROM vLLMmetrics LIMIT 10"}'

Configuration

Environment Variables

Configure the stack through environment variables:

VariableDescriptionDefault
VLLM_METRICS_URLvLLM metrics endpoint URLRequired
P_USERNAMEParseable usernameadmin
P_PASSWORDParseable passwordadmin
P_ADDRParseable listen address0.0.0.0:8000
P_STAGING_DIRParseable staging directory/staging
PROXY_PORTMetrics proxy port9090
SCRAPE_INTERVALMetrics collection interval2s

Docker Compose Configuration

Complete compose.yml example:

services:
  parseable:
    image: parseable/parseable:edge
    command: ["parseable", "local-store"]
    env_file: ./parseable.env
    volumes:
      - parseable-staging:/staging
    ports: ["8080:8000"]
    restart: unless-stopped

  proxy:
    image: python:3.11-alpine
    volumes: ["./proxy.py:/app/proxy.py:ro"]
    environment:
      - VLLM_METRICS_URL=<vllm_metrics_url>
    command: >
      sh -c "pip install --no-cache-dir flask requests && python /app/proxy.py"
    ports: ["9090:9090"]
    restart: unless-stopped
    depends_on: [parseable]
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "python - <<'PY'\nimport requests;print(requests.get('http://localhost:9090/metrics',timeout=3).status_code)\nPY",
        ]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 5s

  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml:ro
    ports:
      - "4317:4317"   # OTLP/gRPC in
      - "4318:4318"   # OTLP/HTTP in
      - "8888:8888"   # Prometheus metrics for the collector itself (optional)
    restart: unless-stopped
    depends_on:
      proxy:
        condition: service_healthy
      parseable:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:13133/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 5s

volumes:
  parseable-staging:

OTel Collector Configuration

For production deployments, consider these OTel Collector optimizations:

receivers:
  # OTLP receiver that accepts JSON format
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 10s
    send_batch_size: 100

exporters:
  otlphttp/parseablemetrics:
    endpoint: "http://localhost:8000"
    headers:
      Authorization: "Basic YWRtaW46YWRtaW4="
      X-P-Stream: vLLMmetrics
      X-P-Log-Source: otel-metrics
    tls:
      insecure: true

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/parseablemetrics]

Monitoring

Key Metrics to Track

Monitor these critical vLLM metrics for optimal performance:

Request Metrics

Request Queue Monitoring

Token Generation Performance

Creating Dashboards

Use Parseable's visualization capabilities to create comprehensive dashboards:

  1. Real-time Performance Dashboard

    • Request latency histogram
    • Token generation rate time series
    • Active request count gauge
    • Error rate percentage
  2. Resource Utilization Dashboard

    • GPU memory usage over time
    • GPU utilization percentage
    • CPU and system memory metrics
    • Model loading times

  1. Business Metrics Dashboard
    • Total requests served
    • Token usage by model
    • Cost per request calculations
    • User request distribution

Metrics Format

The proxy transforms vLLM metrics to ensure compatibility:

Original vLLM Format

Transformed Prometheus-Compatible Format

Complete Metrics Reference

Key vLLM metrics available for monitoring:

Metric NameTypeDescription
vllm_num_requests_runningGaugeActive inference requests
vllm_num_requests_waitingGaugeQueued requests
vllm_gpu_cache_usage_percGaugeGPU KV-cache utilization
vllm_num_preemptions_totalCounterRequest preemptions
vllm_prompt_tokens_totalCounterTotal prompt tokens processed
vllm_generation_tokens_totalCounterTotal tokens generated
vllm_request_latency_secondsHistogramEnd-to-end request latency
vllm_model_forward_time_secondsHistogramModel forward pass duration
vllm_time_to_first_token_secondsHistogramTTFT latency
vllm_time_per_output_token_secondsHistogramInter-token latency

Real-World Use Cases

Use Case 1: Multi-Model Serving Optimization

Scenario: Running multiple models (GPT-OSS-20B, Llama-70B, CodeLlama) on shared GPU infrastructure.

Monitoring Strategy:

Optimization Actions:

  • Adjust model-specific batch sizes based on latency targets
  • Implement dynamic model loading based on request patterns
  • Scale GPU resources per model based on utilization metrics

Use Case 2: Cost-Optimized Inference

Scenario: Minimizing GPU costs while maintaining SLA targets.

Monitoring Strategy:

Optimization Actions:

  • Implement request batching during low-utilization periods
  • Use spot instances for batch processing workloads
  • Autoscale based on queue depth and utilization thresholds

Use Case 3: Real-Time Chat Application

Scenario: Supporting a customer service chatbot with strict latency requirements.

Monitoring Strategy:

Optimization Actions:

  • Prioritize interactive requests over batch jobs
  • Implement streaming token generation
  • Cache common prompt prefixes

FAQ

What is OpenTelemetry and why use it for vLLM monitoring?

OpenTelemetry (OTel) is an open-source observability framework that provides vendor-neutral APIs, SDKs, and tools for collecting telemetry data (metrics, logs, and traces). For vLLM monitoring, OTel offers:

  • Vendor neutrality: Switch between observability backends without changing instrumentation
  • Standardized format: Consistent metric naming and data structures
  • Rich ecosystem: Wide adoption and extensive tooling support
  • Future-proof: Industry-standard approach backed by CNCF

How does this differ from Prometheus-based monitoring?

While vLLM natively exposes Prometheus metrics, using OpenTelemetry offers several advantages:

  • Protocol flexibility: OTel supports multiple protocols (gRPC, HTTP) and formats (JSON, Protobuf)
  • Unified observability: Collect metrics, logs, and traces through a single pipeline
  • Advanced processing: Built-in processors for filtering, aggregation, and transformation
  • Push vs Pull: OTel supports both push and pull models, offering more deployment flexibility

What are the hardware requirements for running this stack?

Minimum requirements:

  • CPU: 2 cores
  • RAM: 4GB (2GB for Parseable, 1GB for OTel Collector, 1GB for proxy)
  • Storage: 10GB for metrics retention (adjust based on scrape interval and retention policy)
  • Network: Stable connectivity to vLLM endpoint

Recommended for production:

  • CPU: 4+ cores
  • RAM: 8GB+
  • Storage: 50GB+ with SSD for better query performance

How long are metrics retained in Parseable?

Parseable stores metrics in object storage (S3 or local filesystem) with configurable retention policies. By default:

  • Hot data: Recent metrics in memory/local cache for fast queries
  • Warm data: Older metrics in staging directory
  • Cold data: Archived metrics in object storage

You can query historical data directly from object storage, making long-term retention cost-effective.

Can I monitor multiple vLLM instances with one monitoring stack?

Yes! Configure multiple scrape jobs in the OTel Collector configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'vllm-instance-1'
          static_configs:
            - targets: ['proxy-1:9090']
              labels:
                instance: 'vllm-1'
                model: 'gpt-oss-20b'
        - job_name: 'vllm-instance-2'
          static_configs:
            - targets: ['proxy-2:9090']
              labels:
                instance: 'vllm-2'
                model: 'llama-70b'

What's the overhead of metrics collection on vLLM performance?

The metrics collection overhead is minimal:

  • vLLM: <0.1% CPU overhead for exposing metrics
  • Proxy: <50MB RAM, negligible CPU for sanitization
  • OTel Collector: <100MB RAM, <5% CPU for scraping and forwarding

The proxy runs as a separate service and doesn't impact vLLM inference performance.

How do I secure the metrics pipeline?

Implement these security best practices:

  1. Authentication: Use basic auth or API keys for Parseable
  2. TLS/SSL: Enable HTTPS for all service-to-service communication
  3. Network isolation: Deploy services in a private network
  4. RBAC: Configure role-based access control in Parseable
  5. Secrets management: Use environment variables or secret managers for credentials

Can I use this with vLLM deployed on cloud platforms (AWS, GCP, Azure)?

Absolutely! The stack works with vLLM deployed anywhere:

  • Cloud VMs: Point VLLM_METRICS_URL to your instance's public/private IP
  • Kubernetes: Deploy the monitoring stack in the same cluster or externally
  • Managed services: Works with RunPod, Lambda Labs, or any vLLM hosting provider
  • Multi-cloud: Monitor vLLM instances across different cloud providers

How do I troubleshoot if metrics aren't appearing in Parseable?

Follow these diagnostic steps:

  1. Check vLLM metrics endpoint:

    curl http://your-vllm-host:8000/metrics
    
  2. Verify proxy is running:

    curl http://localhost:9090/health
    curl http://localhost:9090/metrics
    
  3. Check OTel Collector logs:

    podman compose -f compose-otel.yml logs otel-collector
    
  4. Verify Parseable connectivity:

    curl -X GET http://localhost:8080/api/v1/logstream/vLLMmetrics \
      -H "Authorization: Basic YWRtaW46YWRtaW4="
    
  5. Check for data in Parseable:

    SELECT COUNT(*) FROM vLLMmetrics WHERE p_timestamp >= NOW() - INTERVAL '5 minutes';
    

Can I integrate this with existing monitoring tools (Grafana, Datadog, etc.)?

Yes! You have several options:

  1. Query Parseable from Grafana: Use Parseable's PostgreSQL-compatible interface
  2. Dual export: Configure OTel Collector to send metrics to multiple destinations
  3. Parseable as primary: Query and aggregate in Parseable, then forward to other tools

What's the cost of running this monitoring stack?

The stack uses open-source components, so the only costs are infrastructure:

  • Compute: ~$10-30/month for a small VM (2-4 cores, 4-8GB RAM)
  • Storage: ~$0.02-0.05/GB/month for object storage (S3, MinIO)
  • Network: Minimal egress costs for metrics data

For comparison, managed observability solutions charge $0.10-1.00+ per GB ingested, making this stack significantly more cost-effective for high-volume metrics.

Conclusion

Monitoring vLLM inference with Parseable provides the observability foundation necessary for operating production AI workloads. This solution offers:

  • Complete visibility into model serving performance
  • Actionable insights for optimization and troubleshooting
  • Scalable architecture that grows with your deployment
  • Cost-effective monitoring using open-source components

As AI inference becomes central to modern applications, having robust monitoring is no longer optional—it's essential for delivering reliable, performant, and cost-effective AI services.

Resources

Share:
See Parseable in Action

See Parseable in Action

Log in to the Demo instance to experience Parseable

Parseable Demo

Subscribe

We'll share the latest news, updates and new features on Parseable. (No spam, ever!)
Parseable
©2024-2025 Parseable, Inc. All rights reserved.
Privacy Policy