Parseable is on GitHub Star us now
Parseable

Inferencing GPT-OSS-20B with vLLM and Parseable: Complete Observability for AI Workloads

O
Ompragash Viswanathan (Guest) and Debabrata Panigrahi
August 18, 2025
Inferencing GPT-OSS-20B with vLLM and Parseable: Complete Observability for AI Workloads

Introduction

Modern AI inference represents the future of computational workloads. As organizations deploy increasingly powerful open models like GPT-OSS-20B on high-performance GPU infrastructure, the need for comprehensive observability becomes critical. This guide presents a complete metrics collection and monitoring solution for vLLM deployments using Fluent Bit and Parseable with Prometheus-format compatibility.

What is Inferencing and What's vLLM?

Inferencing refers to the process of using a pre-trained machine learning model to make predictions or decisions on new data. In the context of large language models, this means taking a trained model like GPT-OSS-20B and using it to generate text, answer questions, or perform other language tasks based on user inputs.

vLLM: Fast and Easy LLM Inference

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM

Overview

This solution provides a complete observability stack for vLLM services by:

  • Proxying vLLM metrics with Prometheus-format compatibility fixes
  • Collecting metrics using Fluent Bit's efficient scraping capabilities
  • Storing metrics in Parseable for analysis and visualization
  • Containerized deployment with Podman/Docker Compose for easy setup

Whether you're running GPT-OSS-20B, Llama models, or any other LLM through vLLM, this stack ensures you have complete visibility into your inference operations.

Why Monitor vLLM Inference?

While open-source models like GPT-OSS-20B deployed on high-performance hardware (GPUs via RunPod, AWS, or on-premise) deliver exceptional capabilities, understanding what happens under the hood through metrics provides critical insights.

Cost Analysis: GPT-OSS-20B on A100 PCIe

The following analysis demonstrates the cost-effectiveness of GPT-OSS-20B inference using real production metrics from a 3.15-hour deployment window.

Performance Metrics Table

MetricValueUnit
Infrastructure
Instance TypeA100 PCIe-
Hourly Cost$1.64USD/hour
Deployment Duration3.15hours
Total Infrastructure Cost$5.166USD
Request Performance
Total Requests Processed5,138requests
Requests per Hour1,631requests/hour
Average Request Rate0.453requests/second
Cost Efficiency
Cost per Request$0.001005USD/request
Cost per 1,000 Requests$1.005USD/1K requests
Cost per Million Requests$1,005USD/1M requests
Token Economics
Cost per Request-Hour$0.001005USD/(req·hr)
Throughput Efficiency995.1requests/USD

Token Usage Analysis

The performance metrics demonstrate exceptional cost-effectiveness for GPT-OSS-20B inference on A100 PCIe hardware. With 5,138 requests processed during the monitoring period, the model achieved a cost efficiency of $0.001005 per request, translating to a sustained throughput of 1,631 requests per hour. This cost structure, at $1.64 per hour for A100 PCIe instances, provides significant economic advantages over commercial API pricing models that typically charge $0.002-$0.02 per 1K tokens. The mathematical relationship between infrastructure utilization and request volume demonstrates optimal resource efficiency, with the system maintaining consistent sub-millisecond cost granularity across the entire deployment window.

Now let's deep-dive into how we setup the complete observability stack along with vLLM metrics collection to gather the above cost analysis data.

Architecture

The solution follows a streamlined data pipeline architecture:

┌─────────────┐     ┌──────────────┐     ┌────────────┐     ┌────────────┐
│    vLLM     │────▶│   Metrics    │────▶│  Fluent    │────▶│ Parseable  │
│   Service   │     │    Proxy     │     │    Bit     │     │            │
└─────────────┘     └──────────────┘     └────────────┘     └────────────┘
      ↓                    ↓                    ↓                   ↓
   Metrics           Sanitization          Collection          Observability

Data Flow

  1. vLLM Service exposes raw metrics in Prometheus format
  2. Metrics Proxy sanitizes metric names for compatibility
  3. Fluent Bit scrapes and forwards metrics via OpenTelemetry
  4. Parseable stores and provides query interface for analysis

Components

1. Metrics Proxy (proxy.py)

The metrics proxy serves as a critical compatibility layer:

Features:

  • Flask-based HTTP proxy service
  • Sanitizes vLLM metric names by replacing colons with underscores
  • Ensures Prometheus-format compatibility
  • Runs on port 9090
  • Includes health check endpoint for monitoring

2. Fluent Bit

Fluent Bit is an Observability agent that scrapes metrics from the metrics proxy and forwards them to Parseable. Here's the configuration for the Fluent Bit.

Note: This configuration uses OpenTelemetry format output, which requires Parseable Enterprise for full compatibility.

Capabilities:

  • Scrapes metrics every 2 seconds (configurable)
  • Forwards metrics via OpenTelemetry protocol
  • Adds custom labels for filtering
  • Automatic retry and buffering

3. Parseable

Parseable is an unified observability platform that can handle high volumes of metrics and logs powered by cost-effective object storage. It provides a web UI(Prism) for visualizing and analyzing metrics and logs.

Features:

  • Time-series data storage optimized for metrics
  • Web UI available on port 8080
  • SQL-based query interface
  • Real-time streaming and historical analysis
  • Stores metrics in the vLLMmetrics stream

Parseable Dashboard

Prerequisites

Before deploying the monitoring stack, ensure you have:

  • Container runtime: Podman with Podman Compose (or Docker with Docker Compose)
  • Network access: Open ports 9090 (proxy) and 8080 (Parseable UI)
  • vLLM deployment: Running vLLM service with metrics endpoint accessible
  • System resources: Minimum 2GB RAM, 10GB storage for metrics retention

Quick Start

1. Clone the Repository

git clone https://github.com/opensourceops/vllm-inference-metrics.git
cd vllm-inference-metrics

2. Configure vLLM Endpoint

Edit compose.yml to point to your vLLM deployment:

services:
  proxy:
    environment:
      - VLLM_METRICS_URL=https://your-vllm-endpoint/metrics

For local vLLM deployments:

environment:
  - VLLM_METRICS_URL=http://localhost:8000/metrics

3. Start the Stack

Using Podman:

podman compose up -d

Using Docker:

docker compose up -d

4. Access Services

5. Verify Metrics Collection

Check that metrics are flowing:

# View proxy metrics
curl http://localhost:9090/metrics

# Check Fluent Bit logs
podman compose logs -f fluentbit

# Query metrics in Parseable
curl -X POST http://localhost:8080/api/v1/query \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"query": "SELECT * FROM vLLMmetrics LIMIT 10"}'

Configuration

Environment Variables

Configure the stack through environment variables:

VariableDescriptionDefault
VLLM_METRICS_URLvLLM metrics endpoint URLRequired
P_USERNAMEParseable usernameadmin
P_PASSWORDParseable passwordadmin
P_ADDRParseable listen address0.0.0.0:8000
P_STAGING_DIRParseable staging directory/staging
PROXY_PORTMetrics proxy port9090
SCRAPE_INTERVALMetrics collection interval2s

Docker Compose Configuration

Complete compose.yml example:

version: '3.8'

services:
  parseable:
    image: parseable/parseable:edge
    ports:
      - "8080:8000"
    env_file:
      - parseable.env
    volumes:
      - parseable-data:/staging
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  proxy:
    build: .
    ports:
      - "9090:9090"
    environment:
      - VLLM_METRICS_URL=${VLLM_METRICS_URL}
    depends_on:
      parseable:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9090/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  fluentbit:
    image: fluent/fluent-bit:latest
    volumes:
      - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
    depends_on:
      proxy:
        condition: service_healthy
    restart: unless-stopped

volumes:
  parseable-data:

Fluent Bit Advanced Configuration

For production deployments, consider these Fluent Bit optimizations:

[SERVICE]
    Flush                1
    Daemon               Off
    Log_Level            info
    Parsers_File         parsers.conf
    Plugins_File         plugins.conf
    HTTP_Server          On
    HTTP_Listen          0.0.0.0
    HTTP_Port            2020
    Storage.metrics      On
    Storage.path         /var/log/fluentbit-storage
    Storage.sync         normal
    Storage.checksum     On
    Storage.max_chunks_up 128
    Storage.backlog.mem_limit 5M

[INPUT]
    Name                 prometheus_scrape
    Tag                  vllm.metrics
    Host                 proxy
    Port                 9090
    Metrics_Path         /metrics
    Scrape_Interval      2s
    Buffer_Max_Size      2MB
    
[FILTER]
    Name                 record_modifier
    Match                vllm.metrics
    Record               cluster ${CLUSTER_NAME}
    Record               region ${AWS_REGION}
    Record               model ${MODEL_NAME}

[OUTPUT]
    Name                 opentelemetry
    Match                vllm.metrics
    Host                 parseable
    Port                 8000
    Metrics_uri          /v1/metrics
    Retry_Limit          5
    Compress             gzip

Monitoring

Key Metrics to Track

Monitor these critical vLLM metrics for optimal performance:

Request Metrics

Request Queue Monitoring

Token Generation Performance

Creating Dashboards

Use Parseable's visualization capabilities to create comprehensive dashboards:

  1. Real-time Performance Dashboard

    • Request latency histogram
    • Token generation rate time series
    • Active request count gauge
    • Error rate percentage
  2. Resource Utilization Dashboard

    • GPU memory usage over time
    • GPU utilization percentage
    • CPU and system memory metrics
    • Model loading times

  1. Business Metrics Dashboard
    • Total requests served
    • Token usage by model
    • Cost per request calculations
    • User request distribution

Metrics Format

The proxy transforms vLLM metrics to ensure compatibility:

Original vLLM Format

Transformed Prometheus-Compatible Format

Complete Metrics Reference

Key vLLM metrics available for monitoring:

Metric NameTypeDescription
vllm_num_requests_runningGaugeActive inference requests
vllm_num_requests_waitingGaugeQueued requests
vllm_gpu_cache_usage_percGaugeGPU KV-cache utilization
vllm_num_preemptions_totalCounterRequest preemptions
vllm_prompt_tokens_totalCounterTotal prompt tokens processed
vllm_generation_tokens_totalCounterTotal tokens generated
vllm_request_latency_secondsHistogramEnd-to-end request latency
vllm_model_forward_time_secondsHistogramModel forward pass duration
vllm_time_to_first_token_secondsHistogramTTFT latency
vllm_time_per_output_token_secondsHistogramInter-token latency

Real-World Use Cases

Use Case 1: Multi-Model Serving Optimization

Scenario: Running multiple models (GPT-OSS-20B, Llama-70B, CodeLlama) on shared GPU infrastructure.

Monitoring Strategy:

Optimization Actions:

  • Adjust model-specific batch sizes based on latency targets
  • Implement dynamic model loading based on request patterns
  • Scale GPU resources per model based on utilization metrics

Use Case 2: Cost-Optimized Inference

Scenario: Minimizing GPU costs while maintaining SLA targets.

Monitoring Strategy:

Optimization Actions:

  • Implement request batching during low-utilization periods
  • Use spot instances for batch processing workloads
  • Autoscale based on queue depth and utilization thresholds

Use Case 3: Real-Time Chat Application

Scenario: Supporting a customer service chatbot with strict latency requirements.

Monitoring Strategy:

Optimization Actions:

  • Prioritize interactive requests over batch jobs
  • Implement streaming token generation
  • Cache common prompt prefixes

Troubleshooting

Common Issues and Solutions

1. Connection Refused to vLLM

Symptoms: Proxy returns 502 errors, no metrics collected

Diagnosis:

# Test vLLM endpoint directly
curl -v https://your-vllm-endpoint/metrics

# Check proxy logs
podman compose logs proxy | grep ERROR

# Verify network connectivity
podman exec proxy ping your-vllm-host

Solutions:

  • Verify VLLM_METRICS_URL is correct
  • Check firewall rules and security groups
  • Ensure vLLM is configured with --enable-metrics

2. Parseable Not Receiving Data

Symptoms: No data visible in Parseable UI

Diagnosis:

# Check Fluent Bit logs
podman compose logs -f fluentbit

# Verify proxy health
curl http://localhost:9090/health

# Test Parseable API
curl -X GET http://localhost:8080/api/v1/health \
  -H "Authorization: Basic YWRtaW46YWRtaW4="

Solutions:

  • Verify Fluent Bit configuration syntax
  • Check Parseable authentication credentials
  • Ensure proper network connectivity between services

3. High Memory Usage

Symptoms: Container OOM kills, system slowdown

Diagnosis:

# Monitor container resources
podman stats

# Check Parseable storage
podman exec parseable df -h /staging

# Review Fluent Bit buffer usage
podman exec fluentbit ls -la /var/log/fluentbit-storage

Solutions:

  • Implement retention policies in Parseable
  • Adjust Fluent Bit buffer limits
  • Add resource limits to containers

4. Metrics Lag or Missing Data

Symptoms: Metrics appear delayed or have gaps

Diagnosis:

-- Check metric ingestion lag
SELECT 
    MAX("p_timestamp") as "latest_metric",
    NOW() - MAX("p_timestamp") as "lag"
FROM "vLLMmetrics";

-- Check data availability by metric type for your dataset
SELECT 
    "metric_name",
    COUNT(*) as "record_count",
    MIN("p_timestamp") as "earliest_record",
    MAX("p_timestamp") as "latest_record"
FROM "vLLMmetrics"
WHERE "p_timestamp" >= (NOW() - INTERVAL '1 hour')
  AND "metric_name" IN (
    'vllm_e2e_request_latency_seconds',
    'vllm_time_to_first_token_seconds',
    'vllm_num_requests_running',
    'vllm_num_requests_waiting'
)
GROUP BY "metric_name"
ORDER BY "metric_name";

Solutions:

  • Reduce scrape interval if metrics change rapidly
  • Increase Fluent Bit retry limits
  • Check for network packet loss

Debug Mode

Enable debug logging for detailed troubleshooting:

# In compose.yml
services:
  proxy:
    environment:
      - FLASK_ENV=development
      - LOG_LEVEL=DEBUG
  
  fluentbit:
    command: ["/fluent-bit/bin/fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.conf", "-v", "debug"]

Conclusion

Monitoring vLLM inference with Parseable provides the observability foundation necessary for operating production AI workloads. This solution offers:

  • Complete visibility into model serving performance
  • Actionable insights for optimization and troubleshooting
  • Scalable architecture that grows with your deployment
  • Cost-effective monitoring using open-source components

As AI inference becomes central to modern applications, having robust monitoring is no longer optional—it's essential for delivering reliable, performant, and cost-effective AI services.

Resources

Scale for high-volume observability

See how Parseable makes observability a breeze

Explore the Playground
Share:
See Parseable in Action

See Parseable in Action

Log in to the Demo instance to experience Parseable

Parseable Demo

Subscribe

We'll share the latest news, updates and new features on Parseable. (No spam, ever!)
Parseable
©2024-2025 Parseable, Inc. All rights reserved.
Privacy Policy