Introduction

In the last few years, the field of time-series forecasting has seen a fundamental shift. Where we once depended solely on classic statistical methods, think ARIMA, SARIMA, and Prophet, new “foundation” models have emerged, promising to bring the power and flexibility of large language models (LLMs) into the world of time-series data. The allure is obvious: can we build a single, reusable forecasting model that works across a variety of datasets and domains, instead of painstakingly training a new model for every scenario?

Parseable is built to handle our users’ observability data at any scale, a nonstop stream of raw ingest counts, infrastructure vitals, and fine-grained application signals. Running a separate, hand-tuned forecasting model for every slice quickly turns into a treadmill: each new stream or workload tweak demands fresh hyper-params, retrains, and ever-growing config sprawl. All that manual churn slows forecasts and breeds drift, so the results never feel fully trustworthy.

Then came the rise of foundation models, which revolutionised natural language processing by offering strong zero-shot and transfer learning capabilities. Researchers began asking a natural question: if LLMs can generalise to new tasks with minimal retraining, could similar techniques be applied to time-series data? What if you could just hand any telemetry stream to a pre-trained foundation model and immediately get a high-quality forecast, regardless of whether the model had seen data from that source before?

Motivated by this possibility, we set out to benchmark a new generation of time-series foundation models, Amazon Chronos, Google TimesFM, IBM Tiny Time-Mixers, and Datadog Toto . Our goal was to assess how well these models perform on two representative tasks: a relatively straightforward forecasting problem (predicting ingestion volumes) and a more complex multivariate problem (forecasting multiple pod-level metrics). Along the way, we compared them to classical baselines and took note of both practical and technical trade-offs.

This post details our methodology, the challenges we encountered, how we evaluated the models, and what we learned from putting foundation models to the test on real-world observability data.

Why Foundation Models?

The idea of “foundation models” has fundamentally changed how we approach complex machine learning problems. In natural language processing, models like GPT have shown that a single, large model trained on vast and diverse datasets can generalize well to entirely new tasks sometimes even without fine-tuning. This zero-shot capability means a single model can perform sentiment analysis, summarization, translation, or question-answering, just by changing the prompt.

In the world of time-series forecasting, the appeal of such flexibility is obvious, especially for modern data engineering and observability platforms. Traditionally, every new data stream whether it’s CPU utilization, request rates, or disk I/O required its own model, hyperparameter tuning, and regular retraining. For an SRE or platform engineer, this quickly becomes unmanageable as the number of streams explodes. If a pipeline ingests data from a hundred microservices, does every service metric really need its own hand-tuned ARIMA or Prophet model? The answer, up until recently, was “yes.”

Foundation models for time series are built to change that. The core motivation is scalability and adaptability: train a single, large model (often with billions of parameters) on a wide range of time-series datasets and let it learn the underlying “language” of temporal data. Once trained, this model should ideally handle a completely new telemetry stream, even if it has never seen data of that exact shape or domain before. In theory, you could input any new time series, whether it’s network packet counts, database query durations, or energy consumption readings and get a high-quality forecast without retraining.

This is a huge leap from traditional approaches. Classic statistical models like ARIMA or seasonal decomposition excel when you have clean, stationary data and a well-understood seasonal pattern, but fall short when faced with missing values, sudden regime changes, or non-standard periodicity. More importantly, these models can’t transfer knowledge from one dataset to another; every dataset is a blank slate.

On the other hand, foundation models bring the promise of:

Zero-shot forecasting: Run predictions on new datasets without needing to retrain the model.
Robustness to data variety: Adapt to changing data distributions, missing values, and previously unseen behaviors.
Simplified operations: Lower the engineering overhead of managing hundreds of individual models.
Transfer learning: Leverage patterns learned from one domain (e.g., retail traffic) to help forecast another (e.g., system metrics).

In practical terms, this means you can dramatically speed up time to value for forecasting tasks, especially in fast-moving environments like cloud infrastructure and observability, where new streams appear all the time and data is messy by default. Instead of spending weeks building and maintaining forecasting pipelines for each stream, you could plug in a foundation model and start generating insights almost immediately.

Of course, there are open questions: Do these models really perform well on operational data? Can they match (or beat) hand-tuned classic models? And what are the compute and operational trade-offs involved? Our benchmarking journey set out to answer exactly those questions, by seeing how today’s top time-series foundation models actually perform on real, production-scale telemetry.

Models Explored

To understand what “zero-shot” time-series forecasting looks like in practice, we selected a range of recently released foundation models. Each of these models represents a different philosophy and technical approach to the problem, some focus on universality, others on resource efficiency, and a few on tackling multivariate streams directly. Here’s a brief tour of the models we benchmarked:

Amazon Chronos

Chronos is designed to be a universal forecaster for time-series data, capable of handling both univariate and multivariate streams. Built with transformer-based architecture and trained on a massive collection of open time-series datasets, Chronos aims to generalize well across domains, from finance and retail to infrastructure telemetry. With support for both batch and streaming predictions, Chronos is particularly attractive for use cases where the underlying data distributions can shift rapidly.

Technical highlights

Architecture: Transformer-based (details in the official paper)
Supports: Univariate & Multivariate forecasting
Typical parameter count: Tens of millions (specific version-dependent)
License: Apache 2.0

Google TimesFM

Google’s TimesFM (Time Series Foundation Model) is positioned as the GPT-style model for time-series analysis. Trained on billions of data points, TimesFM leverages a large, attention-based architecture to capture temporal dependencies and seasonalities. It is primarily geared towards univariate forecasting tasks, such as predicting sales, energy usage, or metrics like CPU utilization, and is often used in research for benchmarking “zero-shot” and “few-shot” performance.

Technical highlights

Architecture: Large language model adaptation for time-series
Supports: Univariate forecasting
Parameter count: Billions
License: Apache 2.0

IBM Tiny Time-Mixers

IBM’s Tiny Time-Mixers take the opposite approach: Instead of scaling up, they focus on making time-series foundation models small and efficient enough to run on the edge or in resource-constrained environments. Despite their compact size, these models are trained on diverse datasets and often deliver surprising accuracy, making them a good fit for IoT, embedded devices, or any observability scenario where every CPU cycle counts.

Technical highlights

Architecture: Lightweight mixer-based neural network
Supports: Univariate forecasting
Parameter count: Hundreds of thousands to a few million
License: Apache 2.0

Datadog Toto

Datadog’s Toto is a production-grade, multivariate time-series foundation model. It’s tailored for real-world infrastructure monitoring, where multiple correlated metrics need to be forecasted together, think CPU, memory, and network traffic. Toto is also designed for scalability and reliability, with emphasis on fast warm-up times and high throughput.

Technical highlights

Architecture: Multivariate deep learning model (architecture details partially proprietary)
Supports: Multivariate forecasting
Parameter count: Several million (exact details not fully disclosed)
License: Apache 2.0

Model	Publisher	Params	Uni/Multi	License	Notable Feature
Amazon Chronos	AWS	~10M–100M+	Both	Apache 2.0	General-purpose, scalable
Google TimesFM	Google	~1B+	Uni	Apache 2.0	Large, transformer-based
IBM Tiny Time-Mixers	IBM	<5M	Uni	Apache 2.0	Ultra-lightweight, edge-ready
Datadog Toto	Datadog	~10M+ (est.)	Multi	Apache 2.0	Production multivariate

These models collectively showcase the diversity in approaches, trade-offs, and intended deployment environments in the modern time-series forecasting landscape. In the following sections, we’ll dive into how we evaluated them and what we learned.

Evaluation Metric

When benchmarking forecasting models, especially across a wide variety of time-series tasks, choosing the right evaluation metric is critical. It needs to be robust, interpretable, and fair across datasets that may vary widely in scale, seasonality, and behavior. For our study, we selected Mean Absolute Percentage Error (MAPE) as the primary metric.

Why MAPE?

MAPE stands out for a few practical reasons. First, it’s easy to interpret: the result is simply the average absolute error, expressed as a percentage of the true values. A MAPE of 5% means, on average, your forecasts are within 5% of the real-world values, intuitive enough for both data scientists and engineers managing the infrastructure. This makes it a great fit when you need to quickly assess whether a model’s performance is “good enough” to trust in production.

Second, MAPE is scale-invariant. That means we can compare errors from streams measured in bytes per second with those measured in milliseconds or CPU units, without worrying that one type of metric will dominate the results simply because of its numeric range. This is especially important in observability, where you might be forecasting everything from request rates (hundreds per second) to memory utilization (gigabytes).

How does MAPE work?
At its core, MAPE calculates the average of the absolute differences between the predicted and actual values, divided by the actual values, and multiplies this ratio by 100 to express it as a percentage:

MAPE = (1/n) × Σ |(actual - forecast) / actual| × 100

Where:

actual is the real observed value,
forecast is the predicted value,
and n is the total number of observations in the test set.

When is MAPE well-suited?
MAPE excels in most practical forecasting scenarios where the underlying data doesn’t have long stretches of zeros (since dividing by zero is undefined) or negative values. In our benchmarking, we filtered or masked data accordingly to ensure fair comparisons.

We also considered other metrics like RMSE (Root Mean Squared Error) and sMAPE (Symmetric MAPE) for completeness, but kept the primary focus on MAPE for its clarity and direct relevance to production observability pipelines. When the goal is to provide easy-to-grok, actionable metrics for engineers and SREs, who might need to explain model accuracy to their teams, MAPE simply makes sense.

Dataset Used

The reliability of any benchmarking effort depends heavily on the diversity and realism of the underlying datasets. For this study, we set out to mimic the kind of time-series data that an observability or infrastructure engineering team actually deals with, no sanitized academic benchmarks, just real production telemetry.

Data Sources

For this benchmark, we focused exclusively on a complex, multivariate forecasting task designed to reflect real-world challenges in modern observability. Our dataset consisted of Kubernetes pod metrics collected from a production retail checkout application. These streams included CPU usage, memory consumption, and request latency, all sampled at one-second intervals. This setup provided a “ground truth” that included steady workloads, sudden spikes, and all the operational quirks you see in live systems, perfect for stress-testing both classic and foundation forecasting models.

Pre-Processing Steps

Time-series forecasting models (especially deep learning ones) are highly sensitive to noise, missing values, and sampling inconsistencies. To level the playing field, we applied the following pre-processing pipeline:

Resampling:
All pod metric streams were originally sampled at 1 Hz (one-second intervals). For the benchmarking tasks, we downsampled these to 1-minute averages to reduce noise and make the forecasting task more tractable, while still preserving major trends and events.
Missing Value Handling:
Short gaps in the metrics (caused by scraping hiccups or minor pod restarts) were filled using forward-fill imputation. Larger missing intervals (from major outages or prolonged downtime) were masked out, so models weren’t unfairly penalized for “forecasting the impossible.”
Normalization:
Each metric was normalized using a z-score transformation (mean and standard deviation calculated on the training split). This ensures all models focused on learning actual usage patterns, not just reacting to scale differences between, say, CPU and latency.
Sliding Window Split:
The full dataset for each metric was split into training (70%), validation (15%), and test (15%) windows using a sliding window approach. This setup reflects how forecasting models are actually deployed, predicting the next segment of data given a history of prior values.
Multivariate Structuring:
CPU, memory, and latency metrics were bundled into multivariate records for each pod. Every model was tasked with jointly forecasting all three metrics at each timestep, testing their ability to handle correlated signals, a weak point for traditional “single-metric” forecasters.

Ensuring Fair Evaluation

All pre-processing, normalization, and data splits were defined in config files and applied identically across every model. This strict protocol ensured fairness and reproducibility—no hand-tuning for individual models, no hidden data leaks. Our goal was to see how these models really perform on real production telemetry, not just sanitized academic datasets.

Results and Observations

Model	License	Uni/Multi-variate	Size	Granularity	Input Length	Output Length	MAPE	MAE	HuggingFace Link
amazon chronos bolt	Apache 2.0	Multivariate	205M	1m	512	64	0.046	0.04395	chronos-bolt-base
				1h	220	64	1.79	1.72385
				1d	10	3	2.697	2.6576
google timesfm	Apache 2.0	Univariate	500M	1m	512	64	0.108	0.09553	timesfm-2.0-500m
				1h	128	24	0.534	0.51253
				1d	-	-	-	-
lag-llama	Apache 2.0	Univariate	2.45M	1m	512	64	0.537	0.47321	Lag-Llama
				1h	220	24	9.983	9.5478
				1d
ibm-ttm	Apache 2.0	Multivariate	805K	1m	512	96	1.121	1.00742	granite-timeseries-ttm-r2
				1h	180	60	2.592	2.54402
				1d
datadog toto	Apache 2.0	Multivariate	151M	1m	512	64	0.006	0.00646	Toto-Open-Base-1.0
				1h	220	24	3.866	3.69394
				1d	10	3	0.541	0.52186

After running our battery of models across the complex multivariate pod-metrics scenario, a few patterns, and some honest surprises, emerged. Our analysis zoomed in not just on headline accuracy, but on where each model type shines, where it falls flat, and what it all means for anyone looking to do this in production.

Results and Observations

1. Multivariate Pod-Metrics: Foundation Models in the Trenches

For the hard stuff, jointly forecasting CPU, memory, and latency from real production Kubernetes workloads, foundation models proved to be better.

Datadog Toto emerged as the top performer among foundation models. It often matched or outperformed classic baselines like Vector-ARIMA, especially on datasets where metric relationships were stable over time. Toto’s ability to handle multiple, correlated inputs with minimal tuning was a real advantage in high-variety, high-noise environments.
Amazon Chronos and IBM TTM posted solid results as well. Chronos worked well across diverse pods, while IBM’s “Tiny Time-Mixer” was especially notable for its efficiency, delivering decent accuracy with minimal compute, making it a great fit for edge or cost-sensitive scenarios.
Classic Vector-ARIMA stayed surprisingly competitive, especially for “steady-state” workloads where metric relationships didn’t shift much. In these situations, its simplicity, speed, and predictable performance kept it firmly in the running.

2. Robustness and Real-World Behavior

Zero-shot generalization: Even the best foundation models sometimes stumbled on data patterns far outside their training (think: sudden config changes, outages, or highly non-stationary behavior). When that happened, a freshly retrained ARIMA still sometimes pulled ahead, though at the expense of more manual work.
Inference latency: Toto and Chronos required a brief “warm-up” before settling into fast predictions. By contrast, classical approaches like Vector-ARIMA and Prophet offered sub-second responses from the jump, useful for latency-critical monitoring loops.
Licensing: Open-licensed models (Apache 2.0, MIT) were a no-brainer for production. Anything with a research-only or restrictive license (Moirai, etc.) was dropped from consideration.

3. Qualitative Patterns

Handling outliers: Foundation models generally absorbed short-lived outliers better, producing less erratic forecasts when recent history got spiky. In contrast, classical models sometimes “overreacted” to recent volatility.
Learning new regimes: No model, foundation or classical, nailed true first-of-its-kind events. But with a bit of fine-tuning, foundation models did recover more gracefully as new patterns emerged.
Resource efficiency: IBM’s TTM (“Tiny Time-Mixer”) especially stood out for its low hardware requirements, offering a pragmatic trade-off between accuracy and footprint.

4. When Do Foundation Models Win?

If you have a fleet of fast-changing, multivariate streams, and want to avoid constantly retraining classical models, foundation models like Toto or Chronos offer serious operational wins.
For predictable, steady-state workloads, classical models still shine for simplicity, speed, and cost.
For high-noise, high-variance environments (think retail, real-world infrastructure), foundation models generalize better, but need careful monitoring as workloads evolve.

Conclusion

Foundation models have earned their place in the time-series forecasting toolbox. While they’re not a universal fix, their ability to deliver strong out-of-the-box performance, handle data variety, and reduce operational overhead is a genuine step forward especially for modern observability and platform engineering teams juggling countless data streams. Classical models still matter, especially for narrow, stable use cases or when resources are tight. But for teams who need flexibility, scale, and less manual tuning, foundation models are rapidly becoming the new default.

We’re excited to see how the landscape evolves, and even more excited to keep building alongside the community. Whether you’re a foundation model skeptic or an enthusiast, now’s the time to experiment, benchmark, and share what works and what doesn’t.

What's Next?

We're actively working on exciting roadmap features, including correlation and dynamic dashboards! 🚀

If you love what we're building, show us some ❤️ by starring our repository, it keeps our team motivated to keep pushing for fast and seamless observability on S3!

Zero-Shot Forecasting: Our Search for a Time-Series Foundation Model

Introduction

Why Foundation Models?

Models Explored

Amazon Chronos

Google TimesFM

IBM Tiny Time-Mixers

Datadog Toto

Evaluation Metric

Why MAPE?

Dataset Used

Data Sources

Pre-Processing Steps

Ensuring Fair Evaluation

Results and Observations

Results and Observations

1. Multivariate Pod-Metrics: Foundation Models in the Trenches

2. Robustness and Real-World Behavior

3. Qualitative Patterns

4. When Do Foundation Models Win?

Conclusion

What's Next?

Scale for high-volume observability

See Parseable in Action

See Parseable in Action

Subscribe

Home

Pricing

Resources

Company