Markdown

Time-series foundation models for infrastructure telemetry (TimesFM)¶

Scope: TimesFM (Google Research), the decoder-only time-series foundation model from arXiv 2310.10688, as the reference architecture for zero-shot forecasting, and what it offers a GPU platform team: forecasting cluster telemetry such as utilization, power, and queue depth without building a per-metric training pipeline. This page covers the patched-decoder architecture and its design ablations, the pretraining corpus, zero-shot accuracy against supervised baselines, the TimesFM 2.5 release, and the operational caveats of trusting zero-shot forecasts for capacity planning and alerting decisions. It complements the metric-collection side in telemetry, monitoring and alerting and the curve-fitting flavor of forecasting in learning-curve extrapolation.

The timesfm API snippets are reference templates (checked against the repo README as of 2026-07, unexecuted here; pin the version). The numpy example is executed and asserted. Architecture and benchmark numbers come from the paper; release facts come from the google-research/timesfm README.

flowchart LR
  Y["Context y(1:L), up to 512 points at pretraining"] --> P["Split into input patches of 32"]
  P --> RB["Residual block per patch, to model_dim 1280"]
  RB --> PE["Add positional encoding"]
  PE --> TX["20 causal transformer layers, 16 heads"]
  TX --> ORB["Output residual block"]
  ORB --> OUT["Each token predicts the NEXT 128 points (output patch)"]
  OUT -->|"horizon > 128"| AR["Append forecast to context, decode again"]
  AR --> P

What it is¶

TimesFM is a 200M-parameter, decoder-only transformer pretrained on a large time-series corpus so that a single checkpoint forecasts previously unseen series zero-shot, across domains, context lengths, horizons, and time granularities.¹ The claim that made it a landmark: on standard public benchmarks its out-of-the-box accuracy comes close to, and sometimes beats, supervised models trained per dataset.

Three design choices define the architecture:

Input patching. The context is broken into contiguous non-overlapping patches of input_patch_len = 32 points; each patch is mapped by a residual block (an MLP with one hidden layer and a skip connection) to a model_dim = 1280 token, plus positional encoding. Patching is the time-series analogue of tokenization and cuts transformer sequence length by the patch factor.
Decoder-only training. Given a sequence of patch tokens, the model predicts the next patch from all past patches with causal attention, in parallel over the whole window, exactly as an LLM trains. This is what lets one checkpoint handle any context length at inference: a random masking strategy (mask a random fraction r in [0, p-1] of the first patch) exposes the model to every context length from 1 to the 512-point maximum during training.
Longer output patches. Each output token predicts the next output_patch_len = 128 points, four times the input patch. One-shot horizon prediction beats step-by-step autoregression on long horizons, but a foundation model cannot fix the horizon in advance; a long output patch is the middle ground. Forecasting 256 points from a 256-point context takes 2 generations with a 128-point output patch versus 8 if the output patch matched the input patch at 32.

The pretraining corpus is about 100B timepoints: Wikipedia pageviews (roughly 100B raw timepoints across granularities dominate the pool), Google Trends (about 1B), 3M synthetic series of 2,048 points each (ARMA processes, seasonal mixtures, trends, step functions), plus M4, Electricity, Traffic, Weather, and LibCity. The training mix samples 40% real and 60% synthetic, with equal weight to hourly-and-finer, daily, weekly, and monthly groups.¹

Why use it¶

Zero-shot accuracy near supervised baselines. On the Monash archive (18 datasets), TimesFM's scaled MAE (mean across datasets, each normalized by a naive last-value forecaster) is among the top-3 models, ahead of supervised DeepAR and more than 10% better than llmtime's GPT-3 prompting. On ETT long-horizon tasks (horizons 96 and 192, context 512) it is the top model, with supervised PatchTST the only baseline within statistical significance.¹
No per-metric training pipeline. A cluster exports thousands of telemetry series; maintaining a fitted ARIMA or a trained DeepAR per series is an operational tax. A zero-shot forecaster is one stateless inference service instead (application discussion, not a paper claim).
Small and cheap relative to LLM forecasting. 200M parameters against the billions of GPT-3-class prompting approaches (llmtime), with better accuracy; the paper positions purpose-built time-series pretraining as strictly more efficient than repurposing an LLM.¹
The design generalizes. The ablations are the transferable knowledge: error decreases monotonically with the number of parameters (17M to 200M models), longer output patches monotonically improve 512-step ETT forecasts (tested 32 vs 128), input patch 32 beats 8 on accuracy, and removing synthetic data hurts Monash.²

When to use it (and when not)¶

Use it for many heterogeneous series with shared infrastructure and no labeled history per series: telemetry forecasting, capacity dashboards, anomaly baselines, what-if projections where "good, instant, no training" beats "best possible, weeks of tuning".
Use fine-tuning (the repo ships a LoRA example via Hugging Face Transformers + PEFT) when a domain has enough history and the zero-shot gap matters; the paper's ETT fine-tuning study shows further gains over zero-shot.
Do not use it where covariates carry the signal (price changes, deployments, planned maintenance): TimesFM pretrains without covariates; the 2.5 release adds XReg-style covariate support at inference, but a supervised model with rich features can still win.
Do not use point forecasts alone for SLO decisions. The pretrained objective is MSE point forecasting; quantiles come from an optional head (2.5) and deserve their own validation before feeding burn-rate alerts.
Be skeptical below 10-minute granularity. The corpus is hourly and coarser plus some 10-15 minute data; second-scale scrape intervals are out of distribution.

Architecture¶

The forward path: split the context into patches of 32, zero out masked entries, map each patch through the input residual block, add positional encoding, run 20 transformer layers of multi-head causal self-attention (16 heads, FFN hidden size equal to model_dim 1280), then map each output token through the output residual block to a 128-point forecast of the window that follows that token's patch.¹ Training minimizes the average MSE of every next-128-point prediction across the window. A binary padding mask rides along the whole way: a patch is attended to only if at least one of its points is real, which is also how contexts that do not divide by 32 are handled (pad to a multiple, mark the padding).

Normalization is deliberately minimal: each context is scaled by the mean and standard deviation of its first input patch (the standard-normalization half of reversible instance normalization), so a 40-core CPU-utilization series and a megawatt power series land in the same numeric range without dataset statistics.

At inference the model decodes autoregressively in output-patch steps: forecast 128 points, append them to the context, forecast the next 128. The executed example below validates the two mechanics that matter operationally, the decode-step arithmetic and the mask-based lossless patching, plus the scaled-MAE metric used in the Monash evaluation:

# timesfm_mechanics.py - validated: patching arithmetic and the scaled-MAE metric. numpy only.
import math
import numpy as np

INPUT_PATCH: int = 32    # TimesFM 200M input_patch_len (paper, Appendix A.3)
OUTPUT_PATCH: int = 128  # TimesFM 200M output_patch_len

def decode_steps(horizon: int, output_patch: int) -> int:
    """Autoregressive generations needed to cover a horizon."""
    assert horizon > 0 and output_patch > 0
    return math.ceil(horizon / output_patch)

def patch(y: np.ndarray, p: int) -> tuple[np.ndarray, np.ndarray]:
    """Non-overlapping patches, left-padded to a multiple of p.
    Returns (patches [N, p], mask [N, p]); mask 1 marks padding to ignore,
    the paper's convention."""
    pad = (-len(y)) % p
    yp = np.concatenate([np.zeros(pad), y])
    m = np.concatenate([np.ones(pad), np.zeros(len(y))])
    return yp.reshape(-1, p), m.reshape(-1, p)

def unpatch(patches: np.ndarray, mask: np.ndarray) -> np.ndarray:
    flat, mflat = patches.reshape(-1), mask.reshape(-1)
    return flat[mflat == 0]

def mae(y: np.ndarray, yhat: np.ndarray) -> float:
    """Paper Eq. 6: mean absolute error over the horizon."""
    assert y.shape == yhat.shape
    return float(np.abs(y - yhat).mean())

def scaled_mae(y: np.ndarray, yhat: np.ndarray, last: float) -> float:
    """MAE scaled by the naive last-value forecaster, as in the Monash evaluation."""
    naive = mae(y, np.full_like(y, last))
    if naive == 0.0:
        raise ValueError("naive MAE is zero: scaled MAE undefined on a constant horizon")
    return mae(y, yhat) / naive

# 1) The paper's decode example: context 256, forecast 256 ahead.
assert decode_steps(256, OUTPUT_PATCH) == 2   # output patch 128: 2 generations
assert decode_steps(256, INPUT_PATCH) == 8    # output patch pinned to 32: 8 generations
assert decode_steps(512, OUTPUT_PATCH) == 4
assert decode_steps(1, OUTPUT_PATCH) == 1     # boundary: a 1-step horizon still costs one pass

# 2) Patch round-trip on divisible and non-divisible lengths.
rng = np.random.default_rng(0)
for L in (512, 100, 33, 1):                   # 512 divides by 32; 100, 33, 1 do not
    y = rng.normal(size=L)
    P, M = patch(y, INPUT_PATCH)
    assert P.shape == (math.ceil(L / INPUT_PATCH), INPUT_PATCH)
    np.testing.assert_array_equal(unpatch(P, M), y)   # lossless reconstruction

# 3) Metric: a perfect forecast scores 0; the naive baseline scores exactly 1.
y = rng.normal(size=96).cumsum() + 10.0
assert scaled_mae(y, y.copy(), last=5.0) == 0.0
assert scaled_mae(y, np.full_like(y, 5.0), last=5.0) == 1.0
try:                                          # adversarial: constant horizon breaks the scale
    scaled_mae(np.full(96, 7.0), np.zeros(96), last=7.0)
    raise AssertionError("degenerate naive baseline must be rejected")
except ValueError:
    pass
print("all TimesFM mechanics assertions passed")

Output: all TimesFM mechanics assertions passed. This validates the mechanics around the model, not the model itself.

How to use it¶

Reference template for the current 2.5 release (from the repo README; pin the version):

# Reference template (needs timesfm[torch]; unexecuted here). pip install timesfm[torch]
import numpy as np
import timesfm

model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")
model.compile(timesfm.ForecastConfig(
    max_context=1024, max_horizon=256,
    normalize_inputs=True, use_continuous_quantile_head=True,
    force_flip_invariance=True, infer_is_positive=True, fix_quantile_crossing=True,
))
point, quantiles = model.forecast(horizon=12, inputs=[np.linspace(0, 1, 100)])
# point: (1, 12); quantiles: (1, 12, 10) = mean, then 10th to 90th percentiles

TimesFM 2.5 (September 2025) moved to 200M parameters (down from 2.0's 500M), extended context to 16k points, dropped the frequency indicator that 1.0/2.0 required, and added an optional 30M-parameter continuous-quantile head good to 1k horizons. Covariate support (XReg) returned in October 2025. Older checkpoints load via pip install timesfm==1.3.0. Verify the current API on the repo; the project updates frequently (as of 2026-07 the PyPI package is timesfm 2.0.2).

How to develop with it¶

Feeding telemetry (application discussion). Export series from Prometheus or DCGM at a fixed step, forward-fill small gaps, and keep each series' unit consistent; the model normalizes per context, so mixed units across series are fine but unit changes within one series are a regime break.
Fine-tuning. The repo ships a LoRA fine-tuning example (Hugging Face Transformers + PEFT, added April 2026). Fine-tune when a metric family has months of history and systematic zero-shot bias; the paper's ETT study shows fine-tuning beats both zero-shot and per-dataset baselines.
Covariates. Two patterns from the paper: regress residuals on covariates in-context at inference, or add covariates as inputs to the residual blocks during fine-tuning. The 2.5 XReg path implements the former family.
Backtesting is non-negotiable. Before any operational use, run rolling-origin backtests per metric family against the naive last-value and seasonal-naive baselines using the scaled MAE above; a zero-shot model that cannot beat naive on a family should not forecast that family.

How to maintain it¶

Pin checkpoint and package versions. Forecast behavior is a model artifact; a silent checkpoint bump changes every downstream threshold. Treat the checkpoint hash like a driver version, and re-run backtests on upgrade (the 1.x to 2.5 jump changed parameter count, context limit, and API).
Monitor forecast error as a first-class metric. Log MAE-vs-naive per series family daily; drift in that ratio is the earliest sign the workload's statistics moved away from what zero-shot handles.
Re-evaluate after platform changes. Scheduler policy changes, new tenants, and driver upgrades all shift telemetry regimes; the model does not know a change point happened unless the context window shows it.

How to run it in production¶

Application discussion for a GPU platform (not paper claims): the natural deployment is a small stateless inference service (CPU or one small GPU serves a 200M model comfortably) that batch-forecasts telemetry on a schedule.

Capacity planning. Feed weekly and daily aggregates of GPU-hours consumed, queue wait times, and per-tenant demand into horizon forecasts that gate capacity-add decisions; quantile forecasts (2.5 head) give the safety margin, point forecasts alone do not.
Alert thresholding. Forecast-based dynamic baselines catch "normal for Tuesday 3am" anomalies that static thresholds miss; wire deviations into the burn-rate framework rather than paging on raw deviation.
Power and thermal. Facility power draw is strongly seasonal and trend-driven, a good fit for zero-shot forecasting; forecasted peaks inform power and thermal tuning and maintenance scheduling.
Keep a human on threshold changes. A forecast service can recompute thresholds; letting it silently rewrite paging rules couples an unvalidated model to the incident process.

Failure modes¶

Regime changes. A new tenant, a rescheduled batch window, or a driver upgrade breaks the stationarity the context implies; the model extrapolates the old regime until the context fills with the new one. Detect change points independently; do not rely on the forecaster to notice.
Constant or near-constant series. The naive baseline's MAE goes to zero and scaled metrics blow up (see the executed guard above); flat series also carry no signal for the model. Filter them out before batch evaluation.
Granularity mismatch. Pretraining granularities are hourly and coarser plus 10-15 minute data; 1-second DCGM scrapes are out of distribution. Aggregate to a supported granularity before forecasting.
Horizon overreach. Autoregressive decode compounds errors every 128 points; a 4-week hourly forecast (672 points, 6 generations) is a different reliability class from a 1-day one. Backtest at the horizon actually used.
Silent unit or counter resets. Prometheus counter resets and unit changes look like level shifts; per-context normalization hides them from casual inspection while still corrupting the forecast. Sanitize with rate() and unit checks upstream.
Treating zero-shot as calibrated. Point MSE training says nothing about tail calibration; using raw quantile-head output for SLO burn decisions without per-family calibration checks is how forecast services page people at 3am.

References¶

Das, Kong, Sen, Zhou (Google Research), A decoder-only foundation model for time-series forecasting (arXiv 2310.10688): https://arxiv.org/abs/2310.10688
ICML 2024 proceedings entry: https://proceedings.mlr.press/v235/das24c.html
Hugging Face paper page: https://huggingface.co/papers/2310.10688
google-research/timesfm (code, checkpoints, release notes): https://github.com/google-research/timesfm
TimesFM 2.5 PyTorch checkpoint: https://huggingface.co/google/timesfm-2.5-200m-pytorch
Google Research blog announcement: https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/
Nie et al., PatchTST: A Time Series is Worth 64 Words (arXiv 2211.14730): https://arxiv.org/abs/2211.14730
Gruver et al., Large Language Models Are Zero-Shot Time Series Forecasters (llmtime, arXiv 2310.07820): https://arxiv.org/abs/2310.07820

TimesFM (arXiv 2310.10688): 200M parameters, 20 layers, 16 heads, model_dim 1280 (FFN hidden equal), input_patch_len 32, output_patch_len 128; residual-block input/output layers; causal attention; MSE next-output-patch loss; random first-patch masking for all context lengths up to 512; corpus of about 100B timepoints (Wiki Pageviews, Google Trends, 3M synthetic series, M4, Electricity, Traffic, Weather, LibCity) mixed 40/60 real/synthetic; zero-shot among the top-3 on Monash (scaled MAE) and ETT horizons 96/192, within significance of the best on Darts. ↩↩↩↩↩
Paper Section 6.2: scaled-MAE error falls monotonically with the number of parameters across 17M/70M/200M models; MAE falls monotonically as output_patch_len grows 32 vs 128 on 512-step ETT; input_patch_len 32 beats 8; removing synthetic data degrades Monash. ↩