Parseable

vLLM

Monitor vLLM inference with OpenTelemetry and Parseable


Monitor vLLM inference workloads with OpenTelemetry and Parseable. Collect metrics, build dashboards, and analyze GPU costs for production LLM serving.

Overview

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed at UC Berkeley, vLLM has evolved into a community-driven project for high-performance model serving.

Integrate vLLM with Parseable to:

  • Monitor Inference Performance - Track latency, throughput, and GPU utilization
  • Analyze Token Usage - Measure input/output tokens and costs
  • Debug Issues - Identify slow requests and errors
  • Optimize Resources - Right-size GPU infrastructure based on metrics

Architecture

┌─────────────┐    ┌──────────────┐    ┌────────────┐    ┌────────────┐
│    vLLM     │───▶│   Metrics    │───▶│    OTel    │───▶│  Parseable │
│   Service   │    │    Proxy     │    │  Collector │    │            │
└─────────────┘    └──────────────┘    └────────────┘    └────────────┘
       ↓                  ↓                  ↓                  ↓
    Metrics         Sanitization       Collection        Observability

Prerequisites

  • vLLM deployment with metrics endpoint accessible
  • Docker or Podman with Compose
  • Parseable instance (local or cloud)

Quick Start

1. Clone the Repository

git clone https://github.com/opensourceops/vllm-inference-metrics.git
cd vllm-inference-metrics

2. Configure vLLM Endpoint

Edit compose.yml to point to your vLLM deployment:

services:
  proxy:
    environment:
      - VLLM_METRICS_URL=https://your-vllm-endpoint/metrics

For local vLLM deployments:

environment:
  - VLLM_METRICS_URL=http://localhost:8000/metrics

3. Start the Stack

Using Docker:

docker compose -f compose-otel.yml up -d

Using Podman:

podman compose -f compose-otel.yml up -d

4. Access Services

ServiceURLCredentials
Parseable UIlocalhost:8080admin/admin
Metrics endpointlocalhost:9090/metrics-
Health checklocalhost:9090/health-

5. Verify Metrics Collection

# View proxy metrics
curl http://localhost:9090/metrics

# Check OTel Collector logs
docker compose logs -f otel-collector

# Query metrics in Parseable
curl -X POST http://localhost:8080/api/v1/query \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"query": "SELECT * FROM vLLMmetrics LIMIT 10"}'

Docker Compose Configuration

Complete compose-otel.yml example:

services:
  parseable:
    image: parseable/parseable:edge
    command: ["parseable", "local-store"]
    env_file: ./parseable.env
    volumes:
      - parseable-staging:/staging
    ports: ["8080:8000"]
    restart: unless-stopped

  proxy:
    image: python:3.11-alpine
    volumes: ["./proxy.py:/app/proxy.py:ro"]
    environment:
      - VLLM_METRICS_URL=<your-vllm-metrics-url>
    command: >
      sh -c "pip install --no-cache-dir flask requests && python /app/proxy.py"
    ports: ["9090:9090"]
    restart: unless-stopped
    depends_on: [parseable]

  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml:ro
    ports:
      - "4317:4317"  # OTLP/gRPC
      - "4318:4318"  # OTLP/HTTP
    restart: unless-stopped
    depends_on:
      proxy:
        condition: service_healthy
      parseable:
        condition: service_started

volumes:
  parseable-staging:

Environment Variables

VariableDescriptionDefault
VLLM_METRICS_URLvLLM metrics endpoint URLRequired
P_USERNAMEParseable usernameadmin
P_PASSWORDParseable passwordadmin
SCRAPE_INTERVALMetrics collection interval2s

Key Metrics

Metric NameTypeDescription
vllm_num_requests_runningGaugeActive inference requests
vllm_num_requests_waitingGaugeQueued requests
vllm_gpu_cache_usage_percGaugeGPU KV-cache utilization
vllm_num_preemptions_totalCounterRequest preemptions
vllm_prompt_tokens_totalCounterTotal prompt tokens processed
vllm_generation_tokens_totalCounterTotal tokens generated
vllm_request_latency_secondsHistogramEnd-to-end request latency
vllm_time_to_first_token_secondsHistogramTTFT latency
vllm_time_per_output_token_secondsHistogramInter-token latency

Example Queries

Request Latency Analysis

SELECT 
  DATE_TRUNC('minute', p_timestamp) AS minute,
  AVG(vllm_request_latency_seconds) AS avg_latency,
  MAX(vllm_request_latency_seconds) AS max_latency
FROM vLLMmetrics
WHERE p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY minute
ORDER BY minute;

GPU Cache Utilization

SELECT 
  p_timestamp,
  vllm_gpu_cache_usage_perc
FROM vLLMmetrics
WHERE p_timestamp > NOW() - INTERVAL '30 minutes'
ORDER BY p_timestamp;

Token Throughput

SELECT 
  DATE_TRUNC('hour', p_timestamp) AS hour,
  SUM(vllm_prompt_tokens_total) AS input_tokens,
  SUM(vllm_generation_tokens_total) AS output_tokens
FROM vLLMmetrics
GROUP BY hour
ORDER BY hour DESC
LIMIT 24;

Cost Analysis

Track GPU costs with token metrics:

SELECT 
  DATE_TRUNC('day', p_timestamp) AS day,
  COUNT(*) AS total_requests,
  SUM(vllm_prompt_tokens_total + vllm_generation_tokens_total) AS total_tokens,
  -- A100 PCIe at $1.64/hour example
  ROUND(COUNT(*) * 0.001, 2) AS estimated_cost_usd
FROM vLLMmetrics
GROUP BY day
ORDER BY day DESC;

Alerting

Set up alerts in Parseable for:

  • High Latency: Alert when vllm_request_latency_seconds exceeds threshold
  • Queue Buildup: Alert when vllm_num_requests_waiting grows
  • GPU Memory: Alert when vllm_gpu_cache_usage_perc approaches 100%

Resources

Was this page helpful?

On this page