Baseten Monitoring: Complete Observability for AI Model Inference

D
Debabrata Panigrahi
November 6, 2025
Learn how to set up end-to-end metrics collection and monitoring for Baseten using Fluent Bit and Parseable.
Baseten Monitoring: Complete Observability for AI Model Inference

Introduction

Modern AI inference is moving into production at scale. As teams deploy powerful models on platforms like Baseten, observability and monitoring becomes essential to track performance, costs, and reliability.

In this post, we'll show you how to set up end‑to‑end metrics collection and monitoring for Baseten using Fluent Bit to scrape metrics from Baseten's API in Prometheus format, and Parseable to store, query, and visualize the data. By the end, you'll have a working monitoring stack for your AI model inference workloads.

Why Monitor Baseten Deployments?

When running AI models in production on Baseten, you need visibility into:

  • Performance metrics: Track inference latency, throughput, and response times
  • Resource utilization: Monitor GPU/CPU usage, memory consumption, and queue depths
  • Cost optimization: Understand usage patterns to optimize your deployment configuration
  • Reliability: Detect errors, timeouts, and anomalies before they impact users

Baseten provides a Prometheus-compatible metrics endpoint, making it easy to integrate with modern observability tools. In this guide, we'll build a lightweight monitoring pipeline using Fluent Bit and Parseable.

Architecture Overview

Our monitoring stack consists of three components working together:

Key benefits of this architecture:

  • Lightweight: Fluent Bit has minimal resource footprint
  • Flexible: Easy to add additional data sources or outputs
  • Cost-effective: Parseable stores data efficiently in object storage
  • Scalable: Handles high-volume metrics without performance degradation

Prerequisites

Before we begin, ensure you have:

  1. Baseten Account: Active deployment with models running
  2. Baseten API Key: Available from your Baseten dashboard
  3. Parseable Instance: Running locally or in the cloud (installation guide)
  4. Fluent Bit: Version 2.0 or higher (download here)

Step 1: Configure Fluent Bit

Create a configuration file named fluent-bit-baseten.conf:

[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info

[INPUT]
    Name              prometheus_scrape
    Host              app.baseten.co
    Port              443
    Scrape_Interval   60
    Metrics_Path      /metrics
    HTTP_User         ${BASETEN_API_KEY}
    HTTP_Passwd       ""
    tls               On
    tls.verify        On

[OUTPUT]
    Name              http
    Match             *
    Host              localhost
    Port              8000
    URI               /api/v1/ingest
    Format            json
    Header            X-P-Stream baseten-metrics
    http_User         admin
    http_Passwd       admin
    tls               Off

Configuration breakdown:

  • Scrape_Interval: Set to 60 seconds to respect Baseten's rate limit (6 requests/minute)
  • Metrics_Path: Baseten's Prometheus endpoint
  • HTTP_User: Your Baseten API key for authentication
  • X-P-Stream: The Parseable stream name where metrics will be stored

Step 2: Set Up Environment Variables

Export your Baseten API key as an environment variable:

export BASETEN_API_KEY="your_baseten_api_key_here"

Security tip: Never hardcode API keys in configuration files. Use environment variables or a secrets management system.

Step 3: Start the Monitoring Pipeline

Create a startup script start-baseten-monitoring.sh:

#!/bin/bash

# Check if Parseable is running
if ! curl -s http://localhost:8000/api/v1/about > /dev/null; then
    echo "Error: Parseable is not running on localhost:8000"
    echo "Please start Parseable first"
    exit 1
fi

# Check if API key is set
if [ -z "$BASETEN_API_KEY" ]; then
    echo "Error: BASETEN_API_KEY environment variable is not set"
    exit 1
fi

echo "Starting Baseten metrics collection..."
fluent-bit -c fluent-bit-baseten.conf

Make it executable and run:

chmod +x start-baseten-monitoring.sh
./start-baseten-monitoring.sh

You should see output indicating Fluent Bit is scraping metrics and sending them to Parseable.

Step 4: Verify Data in Parseable

  1. Open your browser and navigate to http://localhost:8000
  2. Log in with your Parseable credentials (default: admin/admin)
  3. Look for the baseten-metrics stream in the streams list
  4. Click on the stream to view incoming metrics

You should see metrics flowing in with fields like:

  • model_inference_latency_ms: Time taken for model inference
  • request_count: Number of requests processed
  • error_rate: Percentage of failed requests
  • gpu_utilization_percent: GPU usage metrics
  • queue_depth: Number of requests waiting in queue

Understanding Baseten Metrics

Baseten exposes a comprehensive set of metrics for monitoring your AI deployments:

Performance Metrics

  • Inference Latency: Time from request to response
  • Throughput: Requests processed per second
  • Cold Start Time: Time to initialize a new model instance

Resource Metrics

  • GPU Utilization: Percentage of GPU compute being used
  • Memory Usage: RAM and VRAM consumption
  • CPU Usage: Host CPU utilization

Reliability Metrics

  • Error Rate: Failed requests as a percentage of total
  • Timeout Rate: Requests that exceeded time limits
  • Queue Depth: Backlog of pending requests

For a complete list, refer to Baseten's metrics documentation.

Querying Metrics with SQL

One of Parseable's key features is the ability to query metrics using SQL. Here are some useful queries:

Average Inference Latency Over Time

SELECT 
    DATE_TRUNC('minute', p_timestamp) AS time_bucket,
    AVG(model_inference_latency_ms) AS avg_latency_ms
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY time_bucket
ORDER BY time_bucket DESC;

Error Rate Analysis

SELECT 
    DATE_TRUNC('hour', p_timestamp) AS hour,
    SUM(CASE WHEN status_code >= 400 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS error_rate_percent
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;

Peak Usage Identification

SELECT 
    DATE_TRUNC('hour', p_timestamp) AS hour,
    MAX(request_count) AS peak_requests,
    MAX(gpu_utilization_percent) AS peak_gpu_usage
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY peak_requests DESC
LIMIT 10;

Setting Up Alerts

You can configure alerts in Parseable to notify you of critical issues:

High Error Rate Alert

SELECT COUNT(*) as error_count
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '5 minutes'
  AND status_code >= 400
HAVING error_count > 10;

High Latency Alert

SELECT AVG(model_inference_latency_ms) as avg_latency
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '5 minutes'
HAVING avg_latency > 1000;

Cost Optimization Tips

Monitoring helps you optimize costs:

  1. Identify Low-Traffic Periods: Scale down during off-peak hours
  2. Detect Inefficient Models: Find models with high latency or resource usage
  3. Optimize Batch Sizes: Analyze throughput vs. latency trade-offs
  4. Right-Size Instances: Match instance types to actual resource needs

Example: Daily Cost Analysis Query

SELECT 
    DATE(p_timestamp) AS date,
    SUM(request_count) AS total_requests,
    AVG(gpu_utilization_percent) AS avg_gpu_usage,
    -- Estimate cost based on your Baseten pricing
    SUM(request_count) * 0.001 AS estimated_cost_usd
FROM baseten_metrics
WHERE p_timestamp > NOW() - INTERVAL '30 days'
GROUP BY date
ORDER BY date DESC;

Troubleshooting

Parseable Connection Issues

Symptom: Fluent Bit cannot connect to Parseable

Solutions:

  1. Verify Parseable is running: curl http://localhost:8000/api/v1/about
  2. Check credentials match your Parseable configuration
  3. Ensure no firewall is blocking port 8000
  4. Review Fluent Bit logs for specific error messages

No Metrics Appearing

Symptom: Stream exists but no data is flowing

Solutions:

  1. Verify Baseten API key is valid and has metrics access
  2. Check Fluent Bit logs for authentication errors
  3. Ensure your Baseten models are deployed and receiving traffic
  4. Confirm the metrics endpoint is accessible: curl -H "Authorization: Bearer $BASETEN_API_KEY" https://app.baseten.co/metrics

Rate Limit Errors (HTTP 429)

Symptom: Baseten returns 429 Too Many Requests

Solutions:

  • Keep scrape interval at 60 seconds or higher (Baseten allows 6 requests/minute)
  • Check if multiple scrapers are running simultaneously
  • Review Baseten's rate limit documentation for your plan tier

High Memory Usage

Symptom: Fluent Bit consuming excessive memory

Solutions:

  • Reduce buffer size in Fluent Bit configuration
  • Increase flush interval to reduce memory pressure
  • Consider filtering out unnecessary metrics

Advanced Configuration

Filtering Specific Metrics

To reduce data volume, filter only the metrics you need:

[FILTER]
    Name    grep
    Match   *
    Regex   __name__ (model_inference_latency_ms|request_count|error_rate)

Adding Custom Labels

Enrich metrics with additional context:

[FILTER]
    Name    modify
    Match   *
    Add     environment production
    Add     region us-west-2

Multiple Baseten Deployments

Monitor multiple Baseten accounts by creating separate input sections:

[INPUT]
    Name              prometheus_scrape
    Alias             baseten_production
    Host              app.baseten.co
    HTTP_User         ${BASETEN_PROD_API_KEY}
    Tag               baseten.production

[INPUT]
    Name              prometheus_scrape
    Alias             baseten_staging
    Host              app.baseten.co
    HTTP_User         ${BASETEN_STAGING_API_KEY}
    Tag               baseten.staging

Conclusion

You now have a complete observability solution for your Baseten AI deployments. This setup provides:

  • Real-time visibility into model performance and resource usage
  • Cost insights to optimize your infrastructure spending
  • Reliability monitoring to catch issues before they impact users
  • Flexible querying with SQL for custom analysis

The combination of Baseten's metrics API, Fluent Bit's lightweight collection, and Parseable's efficient storage creates a powerful, cost-effective monitoring stack that scales with your AI workloads.

For questions or feedback, join the Parseable community or check out the documentation.

Share:
Try out Parseable for free - no credit card required

Try out Parseable for free - no credit card required

First 100 workspaces get 1TB / month ingestion free for lifetime

Sign up for free tier

Subscribe to our newsletter

Get the latest updates on Parseable features, best practices, and observability insights delivered to your inbox.

SFO

Parseable Inc.

584 Castro St, #2112

San Francisco, California

94114-2512

Phone: +1 (650) 444 6216

BLR

Cloudnatively Services Private Limited

JBR Tech Park

Whitefield, Bengaluru

560066

Phone: +91 9480931554