Introduction

Modern AI inference is moving into production at scale. As teams deploy powerful open models such as GPT‑OSS‑20B on high‑performance GPU infrastructure and serve them with vLLM, observability and monitoring becomes essential.

In this post, we'll show you how to set up end‑to‑end metrics collection and monitoring for vLLM using OpenTelemetry to collect and export metrics in OTel JSON format, and Parseable to store, query, and visualize the data. By the end, you’ll have a working stack, ready‑made dashboards, and a cost analysis workflow.

What is Inferencing and What's vLLM?

Inferencing refers to the process of using a pre-trained machine learning model to make predictions or decisions on new data. In the context of large language models, this means taking a trained model like GPT-OSS-20B and using it to generate text, answer questions, or perform other language tasks based on user inputs.

vLLM: Fast and Easy LLM Inference

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM

Overview

This solution provides a complete observability stack for vLLM services by:

Proxying vLLM metrics with OTel JSON format compatibility fixes
Collecting metrics using OTel Collector's efficient scraping capabilities
Storing metrics in Parseable for analysis and visualization
Containerized deployment with Podman/Docker Compose for easy setup

Whether you're running GPT-OSS-20B, Llama models, or any other LLM through vLLM, this stack ensures you have complete visibility into your inference operations.

Why Monitor vLLM Inference?

While open-source models like GPT-OSS-20B deployed on high-performance hardware (GPUs via RunPod, AWS, or on-premise) deliver exceptional capabilities, understanding what happens under the hood through metrics provides critical insights.

Cost Analysis: GPT-OSS-20B on A100 PCIe

The following analysis demonstrates the cost-effectiveness of GPT-OSS-20B inference using real production metrics from a 3.15-hour deployment window.

Performance Metrics Table

Metric	Value	Unit
Infrastructure
Instance Type	A100 PCIe	-
Hourly Cost	$1.64	USD/hour
Deployment Duration	3.15	hours
Total Infrastructure Cost	$5.166	USD
Request Performance
Total Requests Processed	5,138	requests
Requests per Hour	1,631	requests/hour
Average Request Rate	0.453	requests/second
Cost Efficiency
Cost per Request	$0.001005	USD/request
Cost per 1,000 Requests	$1.005	USD/1K requests
Cost per Million Requests	$1,005	USD/1M requests
Token Economics
Cost per Request-Hour	$0.001005	USD/(req·hr)
Throughput Efficiency	995.1	requests/USD

Token Usage Analysis

The performance metrics demonstrate exceptional cost-effectiveness for GPT-OSS-20B inference on A100 PCIe hardware. With 5,138 requests processed during the monitoring period, the model achieved a cost efficiency of $0.001005 per request, translating to a sustained throughput of 1,631 requests per hour. This cost structure, at $1.64 per hour for A100 PCIe instances, provides significant economic advantages over commercial API pricing models that typically charge $0.002-$0.02 per 1K tokens. The mathematical relationship between infrastructure utilization and request volume demonstrates optimal resource efficiency, with the system maintaining consistent sub-millisecond cost granularity across the entire deployment window.

Now let's deep-dive into how we setup the complete observability stack along with vLLM metrics collection to gather the above cost analysis data.

Architecture

The solution follows a streamlined data pipeline architecture:

┌─────────────┐     ┌──────────────┐     ┌────────────┐     ┌────────────┐
│    vLLM     │────▶│   Metrics    │────▶│  OTel      │────▶│ Parseable  │
│   Service   │     │    Proxy     │     │    Collector│     │            │
└─────────────┘     └──────────────┘     └────────────┘     └────────────┘
      ↓                    ↓                    ↓                   ↓
   Metrics           Sanitization          Collection          Observability

Data Flow

vLLM Service exposes raw metrics in Prometheus format
Metrics Proxy sanitizes metric names for compatibility
OTel Collector scrapes and forwards metrics via OpenTelemetry
Parseable stores and provides query interface for analysis

Components

1. Metrics Proxy (proxy.py)

The metrics proxy serves as a critical compatibility layer:

Features:

Flask-based HTTP proxy service
Sanitizes vLLM metric names by replacing colons with underscores
Ensures Prometheus-format compatibility
Runs on port 9090
Includes health check endpoint for monitoring

2. OTel Collector

OTel Collector is an Observability agent that scrapes metrics from the metrics proxy and forwards them to Parseable. Here's the configuration for the OTel Collector.

Capabilities:

Scrapes metrics every 2 seconds (configurable)
Forwards metrics via OpenTelemetry protocol
Adds custom labels for filtering
Automatic retry and buffering

3. Parseable

Parseable is an unified observability platform that can handle high volumes of metrics and logs powered by cost-effective object storage. It provides a web UI(Prism) for visualizing and analyzing metrics and logs.

Features:

Time-series data storage optimized for metrics
Web UI available on port 8080
SQL-based query interface
Real-time streaming and historical analysis
Stores metrics in the vLLMmetrics stream

Parseable Dashboard

Prerequisites

Before deploying the monitoring stack, ensure you have:

Container runtime: Podman with Podman Compose (or Docker with Docker Compose)
Network access: Open ports 9090 (proxy) and 8080 (Parseable UI)
vLLM deployment: Running vLLM service with metrics endpoint accessible
System resources: Minimum 2GB RAM, 10GB storage for metrics retention

Quick Start

1. Clone the Repository

git clone https://github.com/opensourceops/vllm-inference-metrics.git
cd vllm-inference-metrics

2. Configure vLLM Endpoint

Edit compose.yml to point to your vLLM deployment:

services:
  proxy:
    environment:
      - VLLM_METRICS_URL=https://your-vllm-endpoint/metrics

For local vLLM deployments:

environment:
  - VLLM_METRICS_URL=http://localhost:8000/metrics

3. Start the Stack

Using Podman:

podman compose -f compose-otel.yml up -d

Using Docker:

docker compose -f compose-otel.yml up -d

4. Access Services

Parseable UI: localhost:8080 (credentials: admin/admin)
Metrics endpoint: localhost:9090/metrics
Health check: localhost:9090/health

5. Verify Metrics Collection

Check that metrics are flowing:

# View proxy metrics
curl http://localhost:9090/metrics

# Check OTel Collector logs
podman compose logs -f otel-collector

# Query metrics in Parseable
curl -X POST http://localhost:8080/api/v1/query \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"query": "SELECT * FROM vLLMmetrics LIMIT 10"}'

Configuration

Environment Variables

Configure the stack through environment variables:

Variable	Description	Default
`VLLM_METRICS_URL`	vLLM metrics endpoint URL	Required
`P_USERNAME`	Parseable username	admin
`P_PASSWORD`	Parseable password	admin
`P_ADDR`	Parseable listen address	0.0.0.0:8000
`P_STAGING_DIR`	Parseable staging directory	/staging
`PROXY_PORT`	Metrics proxy port	9090
`SCRAPE_INTERVAL`	Metrics collection interval	2s

Docker Compose Configuration

Complete compose.yml example:

services:
  parseable:
    image: parseable/parseable:edge
    command: ["parseable", "local-store"]
    env_file: ./parseable.env
    volumes:
      - parseable-staging:/staging
    ports: ["8080:8000"]
    restart: unless-stopped

  proxy:
    image: python:3.11-alpine
    volumes: ["./proxy.py:/app/proxy.py:ro"]
    environment:
      - VLLM_METRICS_URL=<vllm_metrics_url>
    command: >
      sh -c "pip install --no-cache-dir flask requests && python /app/proxy.py"
    ports: ["9090:9090"]
    restart: unless-stopped
    depends_on: [parseable]
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "python - <<'PY'\nimport requests;print(requests.get('http://localhost:9090/metrics',timeout=3).status_code)\nPY",
        ]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 5s

  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml:ro
    ports:
      - "4317:4317"   # OTLP/gRPC in
      - "4318:4318"   # OTLP/HTTP in
      - "8888:8888"   # Prometheus metrics for the collector itself (optional)
    restart: unless-stopped
    depends_on:
      proxy:
        condition: service_healthy
      parseable:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:13133/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 5s

volumes:
  parseable-staging:

OTel Collector Configuration

For production deployments, consider these OTel Collector optimizations:

receivers:
  # OTLP receiver that accepts JSON format
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 10s
    send_batch_size: 100

exporters:
  otlphttp/parseablemetrics:
    endpoint: "http://localhost:8000"
    headers:
      Authorization: "Basic YWRtaW46YWRtaW4="
      X-P-Stream: vLLMmetrics
      X-P-Log-Source: otel-metrics
    tls:
      insecure: true

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/parseablemetrics]

Monitoring

Key Metrics to Track

Monitor these critical vLLM metrics for optimal performance:

Request Metrics

Request Queue Monitoring

Token Generation Performance

Creating Dashboards

Use Parseable's visualization capabilities to create comprehensive dashboards:

Real-time Performance Dashboard
- Request latency histogram
- Token generation rate time series
- Active request count gauge
- Error rate percentage
Resource Utilization Dashboard
- GPU memory usage over time
- GPU utilization percentage
- CPU and system memory metrics
- Model loading times

Business Metrics Dashboard
- Total requests served
- Token usage by model
- Cost per request calculations
- User request distribution

Metrics Format

The proxy transforms vLLM metrics to ensure compatibility:

Original vLLM Format

Transformed Prometheus-Compatible Format

Complete Metrics Reference

Key vLLM metrics available for monitoring:

Metric Name	Type	Description
`vllm_num_requests_running`	Gauge	Active inference requests
`vllm_num_requests_waiting`	Gauge	Queued requests
`vllm_gpu_cache_usage_perc`	Gauge	GPU KV-cache utilization
`vllm_num_preemptions_total`	Counter	Request preemptions
`vllm_prompt_tokens_total`	Counter	Total prompt tokens processed
`vllm_generation_tokens_total`	Counter	Total tokens generated
`vllm_request_latency_seconds`	Histogram	End-to-end request latency
`vllm_model_forward_time_seconds`	Histogram	Model forward pass duration
`vllm_time_to_first_token_seconds`	Histogram	TTFT latency
`vllm_time_per_output_token_seconds`	Histogram	Inter-token latency

Real-World Use Cases

Use Case 1: Multi-Model Serving Optimization

Scenario: Running multiple models (GPT-OSS-20B, Llama-70B, CodeLlama) on shared GPU infrastructure.

Monitoring Strategy:

Optimization Actions:

Adjust model-specific batch sizes based on latency targets
Implement dynamic model loading based on request patterns
Scale GPU resources per model based on utilization metrics

Use Case 2: Cost-Optimized Inference

Scenario: Minimizing GPU costs while maintaining SLA targets.

Monitoring Strategy:

Optimization Actions:

Implement request batching during low-utilization periods
Use spot instances for batch processing workloads
Autoscale based on queue depth and utilization thresholds

Use Case 3: Real-Time Chat Application

Scenario: Supporting a customer service chatbot with strict latency requirements.

Monitoring Strategy:

Optimization Actions:

Prioritize interactive requests over batch jobs
Implement streaming token generation
Cache common prompt prefixes

FAQ

What is OpenTelemetry and why use it for vLLM monitoring?

OpenTelemetry (OTel) is an open-source observability framework that provides vendor-neutral APIs, SDKs, and tools for collecting telemetry data (metrics, logs, and traces). For vLLM monitoring, OTel offers:

Vendor neutrality: Switch between observability backends without changing instrumentation
Standardized format: Consistent metric naming and data structures
Rich ecosystem: Wide adoption and extensive tooling support
Future-proof: Industry-standard approach backed by CNCF

How does this differ from Prometheus-based monitoring?

While vLLM natively exposes Prometheus metrics, using OpenTelemetry offers several advantages:

Protocol flexibility: OTel supports multiple protocols (gRPC, HTTP) and formats (JSON, Protobuf)
Unified observability: Collect metrics, logs, and traces through a single pipeline
Advanced processing: Built-in processors for filtering, aggregation, and transformation
Push vs Pull: OTel supports both push and pull models, offering more deployment flexibility

What are the hardware requirements for running this stack?

Minimum requirements:

CPU: 2 cores
RAM: 4GB (2GB for Parseable, 1GB for OTel Collector, 1GB for proxy)
Storage: 10GB for metrics retention (adjust based on scrape interval and retention policy)
Network: Stable connectivity to vLLM endpoint

Recommended for production:

CPU: 4+ cores
RAM: 8GB+
Storage: 50GB+ with SSD for better query performance

How long are metrics retained in Parseable?

Parseable stores metrics in object storage (S3 or local filesystem) with configurable retention policies. By default:

Hot data: Recent metrics in memory/local cache for fast queries
Warm data: Older metrics in staging directory
Cold data: Archived metrics in object storage

You can query historical data directly from object storage, making long-term retention cost-effective.

Can I monitor multiple vLLM instances with one monitoring stack?

Yes! Configure multiple scrape jobs in the OTel Collector configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'vllm-instance-1'
          static_configs:
            - targets: ['proxy-1:9090']
              labels:
                instance: 'vllm-1'
                model: 'gpt-oss-20b'
        - job_name: 'vllm-instance-2'
          static_configs:
            - targets: ['proxy-2:9090']
              labels:
                instance: 'vllm-2'
                model: 'llama-70b'

What's the overhead of metrics collection on vLLM performance?

The metrics collection overhead is minimal:

vLLM: <0.1% CPU overhead for exposing metrics
Proxy: <50MB RAM, negligible CPU for sanitization
OTel Collector: <100MB RAM, <5% CPU for scraping and forwarding

The proxy runs as a separate service and doesn't impact vLLM inference performance.

How do I secure the metrics pipeline?

Implement these security best practices:

Authentication: Use basic auth or API keys for Parseable
TLS/SSL: Enable HTTPS for all service-to-service communication
Network isolation: Deploy services in a private network
RBAC: Configure role-based access control in Parseable
Secrets management: Use environment variables or secret managers for credentials

Can I use this with vLLM deployed on cloud platforms (AWS, GCP, Azure)?

Absolutely! The stack works with vLLM deployed anywhere:

Cloud VMs: Point VLLM_METRICS_URL to your instance's public/private IP
Kubernetes: Deploy the monitoring stack in the same cluster or externally
Managed services: Works with RunPod, Lambda Labs, or any vLLM hosting provider
Multi-cloud: Monitor vLLM instances across different cloud providers

How do I troubleshoot if metrics aren't appearing in Parseable?

Follow these diagnostic steps:

Check vLLM metrics endpoint:

curl http://your-vllm-host:8000/metrics

Verify proxy is running:

curl http://localhost:9090/health
curl http://localhost:9090/metrics

Check OTel Collector logs:

podman compose -f compose-otel.yml logs otel-collector

Verify Parseable connectivity:

curl -X GET http://localhost:8080/api/v1/logstream/vLLMmetrics \
  -H "Authorization: Basic YWRtaW46YWRtaW4="

Check for data in Parseable:

SELECT COUNT(*) FROM vLLMmetrics WHERE p_timestamp >= NOW() - INTERVAL '5 minutes';

Can I integrate this with existing monitoring tools (Grafana, Datadog, etc.)?

Yes! You have several options:

Query Parseable from Grafana: Use Parseable's PostgreSQL-compatible interface
Dual export: Configure OTel Collector to send metrics to multiple destinations
Parseable as primary: Query and aggregate in Parseable, then forward to other tools

What's the cost of running this monitoring stack?

The stack uses open-source components, so the only costs are infrastructure:

Compute: ~$10-30/month for a small VM (2-4 cores, 4-8GB RAM)
Storage: ~$0.02-0.05/GB/month for object storage (S3, MinIO)
Network: Minimal egress costs for metrics data

For comparison, managed observability solutions charge $0.10-1.00+ per GB ingested, making this stack significantly more cost-effective for high-volume metrics.

Conclusion

Monitoring vLLM inference with Parseable provides the observability foundation necessary for operating production AI workloads. This solution offers:

Complete visibility into model serving performance
Actionable insights for optimization and troubleshooting
Scalable architecture that grows with your deployment
Cost-effective monitoring using open-source components

As AI inference becomes central to modern applications, having robust monitoring is no longer optional—it's essential for delivering reliable, performant, and cost-effective AI services.

Resources

GitHub Repository

Inferencing LLM models with vLLM and Parseable: Complete Observability for AI Workloads

Scale for high-volume observability

Table of Contents

See Parseable in Action

See Parseable in Action

Subscribe

Home

Pricing

Resources

Company