Extending Ray monitoring with Parseable

D
Debabrata Panigrahi
January 7, 2026
Step-by-step guide to monitor Ray deployment with Parseable
Extending Ray monitoring with Parseable

Introduction

Ray.io is an open-source, unified compute framework that allows you to scale Python and AI applications from a single machine to a massive cluster with minimal code changes. But once you move beyond the hello-world stage, a new question shows up quickly:

What is my Ray cluster actually doing right now and how healthy is it?

The built-in Ray Dashboard is great for an at-a-glance view, but for real operations work you need something more:

  • Historical metrics across many runs
  • Centralized storage for all cluster metrics
  • Cross-layer analysis with logs, traces, and audit data

In this post, we’ll set up a observability pipeline to monitor Ray with Parseable using the data ingested through Fluent Bit.

  • Ray exposes Prometheus metrics on multiple ports.
  • Fluent Bit scrapes those metrics and sends them to Parseable in OpenTelemetry metrics format.
  • Parseable receives them under a dedicated raymetrics dataset.

Let’s walk through the architecture, configuration, and how to actually use these metrics in Parseable.

Architecture Overview

The pipeline looks like this:

  1. Ray
    Ray head and worker components expose Prometheus metrics endpoints on different ports.

  2. Fluent Bit (v4.0.7)
    Uses the prometheus_scrape input plugin to scrape those endpoints every 15 seconds, and the opentelemetry output plugin to send metrics in OTLP format.

  3. Parseable
    Receives OTLP metrics at http://localhost:8000/v1/metrics
    Uses headers to route data into a dedicated raymetrics dataset
    Stores all metrics in object storage (S3) and exposes them via SQL.

Visually:

Ray Prometheus endpoints → Fluent Bit (Prometheus scrape) → OTLP → Parseable (dataset: raymetrics)

Prerequisites

You’ll need:

  • A running Ray cluster (local or remote).
  • Fluent Bit v4.x installed.
  • A running Parseable instance accessible at localhost:8000 (or your own host).
  • Network connectivity from Fluent Bit to Parseable.

We’ll assume:

  • Ray Dashboard: http://localhost:8265
  • Parseable: http://localhost:8000
  • Parseable dataset for Ray metrics: raymetrics

If your hostnames or ports differ, just substitute them in the configuration snippets below.

Configure Fluent Bit to Scrape Ray Metrics

First, create a configuration file, for example fluent-bit-ray.conf, that tells Fluent Bit:

  • How often to flush data.
  • Which Prometheus endpoints to scrape.
  • Where to send data in OTLP format.

Here is the configuration you used:

[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    debug

[INPUT]
    Name              prometheus_scrape
    Host              127.0.0.1
    Port              55979
    Tag               ray.metrics.1
    Scrape_Interval   15

[INPUT]
    Name              prometheus_scrape
    Host              127.0.0.1
    Port              44217
    Tag               ray.metrics.2
    Scrape_Interval   15

[INPUT]
    Name              prometheus_scrape
    Host              127.0.0.1
    Port              44227
    Tag               ray.metrics.3
    Scrape_Interval   15

[OUTPUT]
    Name          opentelemetry
    Match         ray.metrics.*
    Host          localhost
    Port          8000
    Metrics_uri   /v1/metrics
    Log_response_payload True
    Tls           Off
    Http_User     admin
    Http_Passwd   admin
    Header        X-P-Stream raymetrics
    Header        X-P-Log-Source otel-metrics
    compress      gzip
    Retry_Limit   3

A quick breakdown:

  • [SERVICE]
    Flush 5 – send data every 5 seconds.
    Log_Level debug – very useful while you’re wiring everything up.

  • [INPUT] prometheus_scrape
    Host 127.0.0.1 / Port 55979, 44217, 44227 – your Ray metrics endpoints.
    Scrape_Interval 15 – scrape each endpoint every 15 seconds.
    Tag ray.metrics.X – tags used to route metrics to the correct output.

  • [OUTPUT] opentelemetry
    Match ray.metrics.* – send all three input streams to this output.
    Host anton, Port 8010, Metrics_uri /v1/metrics – your Parseable OTLP metrics endpoint.
    Header X-P-Stream raymetrics – tells Parseable to store everything in stream raymetrics.
    Header X-P-Log-Source otel-metrics – optional metadata about the source.
    Log_response_payload True – log Parseable’s response body (handy for debugging).
    Retry_Limit 3 – retry failed sends a few times before giving up.

Start Fluent Bit and Verify Data Flow

Start Fluent Bit with your configuration:

fluent-bit -c fluent-bit-ray.conf > fluent-bit.log 2>&1 &

Your log file fluent-bit.log should show all the interesting stages:

  • Inputs initialized:
[ info]  inputs:
[ info]      prometheus_scrape
[ info]      prometheus_scrape
[ info]      prometheus_scrape
  • Output configured:
[ info]  outputs:
[ info]      opentelemetry.0
  • Metrics being encoded and sent:
[debug] [output:opentelemetry:opentelemetry.0] cmetrics msgpack size: 87913
[debug] [output:opentelemetry:opentelemetry.0] final payload size: 108896
[debug] [upstream] KA connection #45 to anton:8010 is connected
[ info] [output:opentelemetry:opentelemetry.0] anton:8010, HTTP status=200
[debug] [output:opentelemetry:opentelemetry.0] http_post result FLB_OK

Whenever Parseable is temporarily unavailable, you will also see messages like:

[error] [http_client] broken connection to anton:8010 ?
[error] [output:opentelemetry:opentelemetry.0] anton:8010, HTTP status=0
[debug] [output:opentelemetry:opentelemetry.0] http_post result FLB_ERROR

…and then Fluent Bit retries until it can reconnect and see HTTP status=200 again.

This alone is useful telemetry:

  • You know Fluent Bit is scraping Ray (tasks being created for prometheus_scrape inputs).
  • You know Parseable is receiving metrics (HTTP status=200).
  • You can see payload sizes and retry patterns when the connection is unstable.

At this point, your setup is:

  • Ray Dashboard: http://127.0.0.1:8265
  • Fluent Bit: running in the background, scraping 3 Ray metrics endpoints every 15 seconds.
  • Parseable: receiving OTLP metrics at localhost:8000/v1/metrics in stream raymetrics.

Confirm Ray Metrics in Parseable

In Parseable, you should now see a dataset called raymetrics populated with OpenTelemetry metrics from Ray.

Assuming you created the dataset (or Parseable auto-created it on first write), you can:

  1. Open the Parseable UI and select the raymetrics stream.
  2. Filter on recent timestamps to verify new data is flowing.
  3. Inspect a few records to see what the schema looks like for your OTLP metrics.

Ray metrics in Parseable

The Ray metrics dataset includes columns like:

ColumnDescription
metric_nameMetric identifier (e.g., ray_node_cpu_utilization)
data_point_valueThe metric value
p_timestampParseable ingestion timestamp
time_unix_nanoOriginal metric timestamp
metric_NodeAddressRay node IP address
metric_JobIdRay job identifier
metric_StateTask/actor state
metric_ObjectStateObject store state
metric_SessionNameRay cluster session name
metric_WorkerIdWorker process identifier

Core Ray Health Dashboards in Parseable

Once Ray metrics land in Parseable, you can start building Ray health views that go beyond the built-in dashboard.

1. Node CPU and Memory Utilization

You want a quick answer to:

"Which Ray nodes are under the heaviest CPU and memory load?"

Conceptually, you can:

  • Filter ray_node_cpu_utilization by node.
  • Bucket in time (for example 1-minute windows).
  • Plot average CPU over time per node.
SELECT
  DATE_TRUNC('minute', raymetrics.p_timestamp) AS minute,
  raymetrics."metric_NodeAddress" AS node,
  AVG(raymetrics.data_point_value) AS avg_cpu
FROM raymetrics
WHERE raymetrics."metric_name" = 'ray_node_cpu_utilization'
GROUP BY minute, node
ORDER BY minute, node;

Ray node CPU utilization

Do the same for memory utilization metrics (for example ray_node_mem_used) to catch nodes that are close to OOM before they cause job failures.

2. Task Throughput and Failures

Ray emits metrics about tasks by state (running, queued, failed). You can:

  • Track how many tasks enter a failed state over time.
  • Group by job or task name to identify hotspots.

For example:

SELECT
  DATE_TRUNC('minute', p_timestamp)  AS window,
  metric_JobId                       AS job,
  metric_State                       AS state,
  SUM(data_point_value)              AS task_count
FROM raymetrics
WHERE metric_name = 'ray_tasks'
  AND metric_State = 'FAILED'
GROUP BY window, job, state
ORDER BY window, task_count DESC;

This helps you answer:

  • Which jobs are most error-prone?
  • Did a particular deployment spike task failures?

3. Object Store Pressure

Ray’s performance depends heavily on its object store. Metrics such as "object store memory used" and "spillover" are crucial.

You can:

  • Monitor memory usage as a percentage of the object store.
  • Alert when it exceeds a threshold (for example 80%).
SELECT
  DATE_TRUNC('minute', p_timestamp)  AS minute,
  metric_NodeAddress                 AS node,
  metric_ObjectState                 AS object_state,
  AVG(data_point_value)              AS avg_memory
FROM raymetrics
WHERE metric_name = 'ray_object_store_memory'
GROUP BY minute, node, object_state
ORDER BY minute, node;

If object store pressure correlates with spikes in failed tasks, you have just connected cluster health to application reliability.

Extending the Setup

Once the basics are in place, there are several easy extensions.

Add Ray Logs to Parseable

Use Fluent Bit tail or forward inputs to send Ray component logs into a raylogs stream. Now you can correlate metrics spikes with log error patterns.

Add Traces from Ray Workers (If Applicable)

If your Ray workers emit OpenTelemetry traces, point them at Parseable’s OTLP traces endpoint and tie task-level traces back to the node metrics.

Multi-Cluster or Anyscale Deployments

For Ray clusters running on Anyscale or across multiple environments, add cluster labels (for example cluster_id, env) to your metrics and logs. Then:

SELECT
  metric_SessionName        AS cluster,
  metric_NodeAddress        AS node,
  AVG(data_point_value)     AS avg_cpu
FROM raymetrics
WHERE metric_name = 'ray_node_cpu_utilization'
GROUP BY cluster, node;

This gives you a fleet-wide view of Ray health in one place.

Conclusion

You now have a complete monitoring path:

  • Ray exposes Prometheus metrics on several ports.
  • Fluent Bit scrapes those endpoints every 15 seconds using prometheus_scrape.
  • Fluent Bit sends metrics in OpenTelemetry format to Parseable via the opentelemetry output.
  • Parseable stores them in the raymetrics stream, where you can query, visualize, and correlate them with other signals.

With this setup, monitoring Ray stops being an ad-hoc dashboard task and becomes part of your unified observability story:

  • Node health, task reliability, and object store pressure in one place.
  • Fluent Bit transport health along the same timeline.
  • The ability to layer in logs, traces, and audit data as your system grows.
Share:

Subscribe to our newsletter

Get the latest updates on Parseable features, best practices, and observability insights delivered to your inbox.

SFO

Parseable Inc.

584 Castro St, #2112

San Francisco, California

94114-2512

Phone: +1 (650) 444 6216

BLR

Cloudnatively Services Private Limited

JBR Tech Park

Whitefield, Bengaluru

560066

Phone: +91 9480931554