Introduction
Modern AI inference is moving into production at scale. As teams deploy powerful open models such as GPT‑OSS‑20B on high‑performance GPU infrastructure and serve them with vLLM, observability and monitoring becomes essential.
In this post, we'll show you how to set up end‑to‑end metrics collection and monitoring for vLLM using OpenTelemetry to collect and export metrics in OTel JSON format, and Parseable to store, query, and visualize the data. By the end, you’ll have a working stack, ready‑made dashboards, and a cost analysis workflow.
What is Inferencing and What's vLLM?
Inferencing refers to the process of using a pre-trained machine learning model to make predictions or decisions on new data. In the context of large language models, this means taking a trained model like GPT-OSS-20B and using it to generate text, answer questions, or perform other language tasks based on user inputs.
vLLM: Fast and Easy LLM Inference
vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
Overview
This solution provides a complete observability stack for vLLM services by:
- Proxying vLLM metrics with OTel JSON format compatibility fixes
- Collecting metrics using OTel Collector's efficient scraping capabilities
- Storing metrics in Parseable for analysis and visualization
- Containerized deployment with Podman/Docker Compose for easy setup
Whether you're running GPT-OSS-20B, Llama models, or any other LLM through vLLM, this stack ensures you have complete visibility into your inference operations.
Why Monitor vLLM Inference?
While open-source models like GPT-OSS-20B deployed on high-performance hardware (GPUs via RunPod, AWS, or on-premise) deliver exceptional capabilities, understanding what happens under the hood through metrics provides critical insights.
Cost Analysis: GPT-OSS-20B on A100 PCIe
The following analysis demonstrates the cost-effectiveness of GPT-OSS-20B inference using real production metrics from a 3.15-hour deployment window.
Performance Metrics Table
Metric | Value | Unit |
---|---|---|
Infrastructure | ||
Instance Type | A100 PCIe | - |
Hourly Cost | $1.64 | USD/hour |
Deployment Duration | 3.15 | hours |
Total Infrastructure Cost | $5.166 | USD |
Request Performance | ||
Total Requests Processed | 5,138 | requests |
Requests per Hour | 1,631 | requests/hour |
Average Request Rate | 0.453 | requests/second |
Cost Efficiency | ||
Cost per Request | $0.001005 | USD/request |
Cost per 1,000 Requests | $1.005 | USD/1K requests |
Cost per Million Requests | $1,005 | USD/1M requests |
Token Economics | ||
Cost per Request-Hour | $0.001005 | USD/(req·hr) |
Throughput Efficiency | 995.1 | requests/USD |
Token Usage Analysis
The performance metrics demonstrate exceptional cost-effectiveness for GPT-OSS-20B inference on A100 PCIe hardware. With 5,138 requests processed during the monitoring period, the model achieved a cost efficiency of $0.001005 per request, translating to a sustained throughput of 1,631 requests per hour. This cost structure, at $1.64 per hour for A100 PCIe instances, provides significant economic advantages over commercial API pricing models that typically charge $0.002-$0.02 per 1K tokens. The mathematical relationship between infrastructure utilization and request volume demonstrates optimal resource efficiency, with the system maintaining consistent sub-millisecond cost granularity across the entire deployment window.
Now let's deep-dive into how we setup the complete observability stack along with vLLM metrics collection to gather the above cost analysis data.
Architecture
The solution follows a streamlined data pipeline architecture:
┌─────────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
│ vLLM │────▶│ Metrics │────▶│ OTel │────▶│ Parseable │
│ Service │ │ Proxy │ │ Collector│ │ │
└─────────────┘ └──────────────┘ └────────────┘ └────────────┘
↓ ↓ ↓ ↓
Metrics Sanitization Collection Observability
Data Flow
- vLLM Service exposes raw metrics in Prometheus format
- Metrics Proxy sanitizes metric names for compatibility
- OTel Collector scrapes and forwards metrics via OpenTelemetry
- Parseable stores and provides query interface for analysis
Components
1. Metrics Proxy (proxy.py)
The metrics proxy serves as a critical compatibility layer:
Features:
- Flask-based HTTP proxy service
- Sanitizes vLLM metric names by replacing colons with underscores
- Ensures Prometheus-format compatibility
- Runs on port 9090
- Includes health check endpoint for monitoring
2. OTel Collector
OTel Collector is an Observability agent that scrapes metrics from the metrics proxy and forwards them to Parseable. Here's the configuration for the OTel Collector.
Capabilities:
- Scrapes metrics every 2 seconds (configurable)
- Forwards metrics via OpenTelemetry protocol
- Adds custom labels for filtering
- Automatic retry and buffering
3. Parseable
Parseable is an unified observability platform that can handle high volumes of metrics and logs powered by cost-effective object storage. It provides a web UI(Prism) for visualizing and analyzing metrics and logs.
Features:
- Time-series data storage optimized for metrics
- Web UI available on port 8080
- SQL-based query interface
- Real-time streaming and historical analysis
- Stores metrics in the
vLLMmetrics
stream
Prerequisites
Before deploying the monitoring stack, ensure you have:
- Container runtime: Podman with Podman Compose (or Docker with Docker Compose)
- Network access: Open ports 9090 (proxy) and 8080 (Parseable UI)
- vLLM deployment: Running vLLM service with metrics endpoint accessible
- System resources: Minimum 2GB RAM, 10GB storage for metrics retention
Quick Start
1. Clone the Repository
git clone https://github.com/opensourceops/vllm-inference-metrics.git
cd vllm-inference-metrics
2. Configure vLLM Endpoint
Edit compose.yml
to point to your vLLM deployment:
services:
proxy:
environment:
- VLLM_METRICS_URL=https://your-vllm-endpoint/metrics
For local vLLM deployments:
environment:
- VLLM_METRICS_URL=http://localhost:8000/metrics
3. Start the Stack
Using Podman:
podman compose -f compose-otel.yml up -d
Using Docker:
docker compose -f compose-otel.yml up -d
4. Access Services
- Parseable UI:
localhost:8080
(credentials: admin/admin) - Metrics endpoint:
localhost:9090/metrics
- Health check:
localhost:9090/health
5. Verify Metrics Collection
Check that metrics are flowing:
# View proxy metrics
curl http://localhost:9090/metrics
# Check OTel Collector logs
podman compose logs -f otel-collector
# Query metrics in Parseable
curl -X POST http://localhost:8080/api/v1/query \
-H "Authorization: Basic YWRtaW46YWRtaW4=" \
-d '{"query": "SELECT * FROM vLLMmetrics LIMIT 10"}'
Configuration
Environment Variables
Configure the stack through environment variables:
Variable | Description | Default |
---|---|---|
VLLM_METRICS_URL | vLLM metrics endpoint URL | Required |
P_USERNAME | Parseable username | admin |
P_PASSWORD | Parseable password | admin |
P_ADDR | Parseable listen address | 0.0.0.0:8000 |
P_STAGING_DIR | Parseable staging directory | /staging |
PROXY_PORT | Metrics proxy port | 9090 |
SCRAPE_INTERVAL | Metrics collection interval | 2s |
Docker Compose Configuration
Complete compose.yml
example:
services:
parseable:
image: parseable/parseable:edge
command: ["parseable", "local-store"]
env_file: ./parseable.env
volumes:
- parseable-staging:/staging
ports: ["8080:8000"]
restart: unless-stopped
proxy:
image: python:3.11-alpine
volumes: ["./proxy.py:/app/proxy.py:ro"]
environment:
- VLLM_METRICS_URL=<vllm_metrics_url>
command: >
sh -c "pip install --no-cache-dir flask requests && python /app/proxy.py"
ports: ["9090:9090"]
restart: unless-stopped
depends_on: [parseable]
healthcheck:
test:
[
"CMD-SHELL",
"python - <<'PY'\nimport requests;print(requests.get('http://localhost:9090/metrics',timeout=3).status_code)\nPY",
]
interval: 10s
timeout: 5s
retries: 5
start_period: 5s
otel-collector:
image: otel/opentelemetry-collector:latest
command: ["--config=/etc/otel-config.yaml"]
volumes:
- ./otel-config.yaml:/etc/otel-config.yaml:ro
ports:
- "4317:4317" # OTLP/gRPC in
- "4318:4318" # OTLP/HTTP in
- "8888:8888" # Prometheus metrics for the collector itself (optional)
restart: unless-stopped
depends_on:
proxy:
condition: service_healthy
parseable:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:13133/healthz"]
interval: 10s
timeout: 5s
retries: 5
start_period: 5s
volumes:
parseable-staging:
OTel Collector Configuration
For production deployments, consider these OTel Collector optimizations:
receivers:
# OTLP receiver that accepts JSON format
otlp:
protocols:
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 10s
send_batch_size: 100
exporters:
otlphttp/parseablemetrics:
endpoint: "http://localhost:8000"
headers:
Authorization: "Basic YWRtaW46YWRtaW4="
X-P-Stream: vLLMmetrics
X-P-Log-Source: otel-metrics
tls:
insecure: true
service:
telemetry:
logs:
level: debug
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/parseablemetrics]
Monitoring
Key Metrics to Track
Monitor these critical vLLM metrics for optimal performance:
Request Metrics
Request Queue Monitoring
Token Generation Performance
Creating Dashboards
Use Parseable's visualization capabilities to create comprehensive dashboards:
-
Real-time Performance Dashboard
- Request latency histogram
- Token generation rate time series
- Active request count gauge
- Error rate percentage
-
Resource Utilization Dashboard
- GPU memory usage over time
- GPU utilization percentage
- CPU and system memory metrics
- Model loading times
- Business Metrics Dashboard
- Total requests served
- Token usage by model
- Cost per request calculations
- User request distribution
Metrics Format
The proxy transforms vLLM metrics to ensure compatibility:
Original vLLM Format
Transformed Prometheus-Compatible Format
Complete Metrics Reference
Key vLLM metrics available for monitoring:
Metric Name | Type | Description |
---|---|---|
vllm_num_requests_running | Gauge | Active inference requests |
vllm_num_requests_waiting | Gauge | Queued requests |
vllm_gpu_cache_usage_perc | Gauge | GPU KV-cache utilization |
vllm_num_preemptions_total | Counter | Request preemptions |
vllm_prompt_tokens_total | Counter | Total prompt tokens processed |
vllm_generation_tokens_total | Counter | Total tokens generated |
vllm_request_latency_seconds | Histogram | End-to-end request latency |
vllm_model_forward_time_seconds | Histogram | Model forward pass duration |
vllm_time_to_first_token_seconds | Histogram | TTFT latency |
vllm_time_per_output_token_seconds | Histogram | Inter-token latency |
Real-World Use Cases
Use Case 1: Multi-Model Serving Optimization
Scenario: Running multiple models (GPT-OSS-20B, Llama-70B, CodeLlama) on shared GPU infrastructure.
Monitoring Strategy:
Optimization Actions:
- Adjust model-specific batch sizes based on latency targets
- Implement dynamic model loading based on request patterns
- Scale GPU resources per model based on utilization metrics
Use Case 2: Cost-Optimized Inference
Scenario: Minimizing GPU costs while maintaining SLA targets.
Monitoring Strategy:
Optimization Actions:
- Implement request batching during low-utilization periods
- Use spot instances for batch processing workloads
- Autoscale based on queue depth and utilization thresholds
Use Case 3: Real-Time Chat Application
Scenario: Supporting a customer service chatbot with strict latency requirements.
Monitoring Strategy:
Optimization Actions:
- Prioritize interactive requests over batch jobs
- Implement streaming token generation
- Cache common prompt prefixes
FAQ
What is OpenTelemetry and why use it for vLLM monitoring?
OpenTelemetry (OTel) is an open-source observability framework that provides vendor-neutral APIs, SDKs, and tools for collecting telemetry data (metrics, logs, and traces). For vLLM monitoring, OTel offers:
- Vendor neutrality: Switch between observability backends without changing instrumentation
- Standardized format: Consistent metric naming and data structures
- Rich ecosystem: Wide adoption and extensive tooling support
- Future-proof: Industry-standard approach backed by CNCF
How does this differ from Prometheus-based monitoring?
While vLLM natively exposes Prometheus metrics, using OpenTelemetry offers several advantages:
- Protocol flexibility: OTel supports multiple protocols (gRPC, HTTP) and formats (JSON, Protobuf)
- Unified observability: Collect metrics, logs, and traces through a single pipeline
- Advanced processing: Built-in processors for filtering, aggregation, and transformation
- Push vs Pull: OTel supports both push and pull models, offering more deployment flexibility
What are the hardware requirements for running this stack?
Minimum requirements:
- CPU: 2 cores
- RAM: 4GB (2GB for Parseable, 1GB for OTel Collector, 1GB for proxy)
- Storage: 10GB for metrics retention (adjust based on scrape interval and retention policy)
- Network: Stable connectivity to vLLM endpoint
Recommended for production:
- CPU: 4+ cores
- RAM: 8GB+
- Storage: 50GB+ with SSD for better query performance
How long are metrics retained in Parseable?
Parseable stores metrics in object storage (S3 or local filesystem) with configurable retention policies. By default:
- Hot data: Recent metrics in memory/local cache for fast queries
- Warm data: Older metrics in staging directory
- Cold data: Archived metrics in object storage
You can query historical data directly from object storage, making long-term retention cost-effective.
Can I monitor multiple vLLM instances with one monitoring stack?
Yes! Configure multiple scrape jobs in the OTel Collector configuration:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'vllm-instance-1'
static_configs:
- targets: ['proxy-1:9090']
labels:
instance: 'vllm-1'
model: 'gpt-oss-20b'
- job_name: 'vllm-instance-2'
static_configs:
- targets: ['proxy-2:9090']
labels:
instance: 'vllm-2'
model: 'llama-70b'
What's the overhead of metrics collection on vLLM performance?
The metrics collection overhead is minimal:
- vLLM: <0.1% CPU overhead for exposing metrics
- Proxy: <50MB RAM, negligible CPU for sanitization
- OTel Collector: <100MB RAM, <5% CPU for scraping and forwarding
The proxy runs as a separate service and doesn't impact vLLM inference performance.
How do I secure the metrics pipeline?
Implement these security best practices:
- Authentication: Use basic auth or API keys for Parseable
- TLS/SSL: Enable HTTPS for all service-to-service communication
- Network isolation: Deploy services in a private network
- RBAC: Configure role-based access control in Parseable
- Secrets management: Use environment variables or secret managers for credentials
Can I use this with vLLM deployed on cloud platforms (AWS, GCP, Azure)?
Absolutely! The stack works with vLLM deployed anywhere:
- Cloud VMs: Point
VLLM_METRICS_URL
to your instance's public/private IP - Kubernetes: Deploy the monitoring stack in the same cluster or externally
- Managed services: Works with RunPod, Lambda Labs, or any vLLM hosting provider
- Multi-cloud: Monitor vLLM instances across different cloud providers
How do I troubleshoot if metrics aren't appearing in Parseable?
Follow these diagnostic steps:
-
Check vLLM metrics endpoint:
curl http://your-vllm-host:8000/metrics
-
Verify proxy is running:
curl http://localhost:9090/health curl http://localhost:9090/metrics
-
Check OTel Collector logs:
podman compose -f compose-otel.yml logs otel-collector
-
Verify Parseable connectivity:
curl -X GET http://localhost:8080/api/v1/logstream/vLLMmetrics \ -H "Authorization: Basic YWRtaW46YWRtaW4="
-
Check for data in Parseable:
SELECT COUNT(*) FROM vLLMmetrics WHERE p_timestamp >= NOW() - INTERVAL '5 minutes';
Can I integrate this with existing monitoring tools (Grafana, Datadog, etc.)?
Yes! You have several options:
- Query Parseable from Grafana: Use Parseable's PostgreSQL-compatible interface
- Dual export: Configure OTel Collector to send metrics to multiple destinations
- Parseable as primary: Query and aggregate in Parseable, then forward to other tools
What's the cost of running this monitoring stack?
The stack uses open-source components, so the only costs are infrastructure:
- Compute: ~$10-30/month for a small VM (2-4 cores, 4-8GB RAM)
- Storage: ~$0.02-0.05/GB/month for object storage (S3, MinIO)
- Network: Minimal egress costs for metrics data
For comparison, managed observability solutions charge $0.10-1.00+ per GB ingested, making this stack significantly more cost-effective for high-volume metrics.
Conclusion
Monitoring vLLM inference with Parseable provides the observability foundation necessary for operating production AI workloads. This solution offers:
- Complete visibility into model serving performance
- Actionable insights for optimization and troubleshooting
- Scalable architecture that grows with your deployment
- Cost-effective monitoring using open-source components
As AI inference becomes central to modern applications, having robust monitoring is no longer optional—it's essential for delivering reliable, performant, and cost-effective AI services.