KubeRay: Ray on Kubernetes¶
Scope: running Ray on Kubernetes with the KubeRay operator, covering install, the RayCluster / RayJob / RayService CRDs, exposing GPUs and RDMA to Ray pods, autoscaling, and failure modes / maintenance.
Reference templates on real APIs; pin chart/image versions and validate on your hardware before production use.
What it is¶
KubeRay (ray-project/kuberay) is the Kubernetes operator that runs Ray clusters as native Kubernetes objects. It watches three Custom Resource Definitions and reconciles them into pods:
- RayCluster is a long-lived cluster: one head pod (GCS, scheduler, dashboard) plus one or more
workerGroupSpecs, each a homogeneous pool of worker pods. This is the unit of GPU/RDMA exposure. - RayJob runs to completion. It creates (or attaches to) a RayCluster, submits a job, and optionally tears the cluster down (
shutdownAfterJobFinishes). This is the natural fit for distributed training and RL post-training. - RayService is HA serving. It wraps a RayCluster plus a Ray Serve app config, and does zero-downtime upgrades by standing up a new cluster and shifting traffic. It backs OSS model serving (serving OSS models, inference serving).
As of mid-2026 the stable line is KubeRay v1.6.2 (operator and CRDs); the ray-cluster Helm chart tracks the same minor. Pin the exact tag; CRD schemas change across minors. Verify on the KubeRay releases page.
Why it matters¶
- One scheduler, not two. Bare Ray on a Kubernetes node duplicates scheduling and bypasses cluster quota. KubeRay makes Ray pods first-class, so the GPU Operator, Kueue quota / Volcano, and node taints all govern Ray the same as any workload (orchestration overview).
- Declarative lifecycle. RayCluster/RayJob/RayService are reconciled objects: GitOps applies them, the operator heals pod loss, RayService gives HA serving without bespoke glue.
- Autoscaling that reads Ray demand. The in-tree autoscaler scales worker pods on pending Ray tasks/actors, not on CPU%, so a placement-group request for 8 GPUs provisions 8 GPU pods.
- GPU + RDMA reuse the pod contract. GPUs via
nvidia.com/gpu, RDMA via the network operator's resource, the same limits any GPU pod uses (Kubernetes for GPUs, network operator).
When it is needed (and when not)¶
flowchart LR
Q["Workload on K8s\nneeds Ray?"] -->|"no distributed-Python"| K8S["Plain K8s Deployment\n/ Volcano job"]
Q -->|"yes, Ray runtime"| ON["On K8s already?"]
ON -->|"no, bare metal HPC"| SLURM["Ray on Slurm\n(cluster-slurm)"]
ON -->|"yes"| SHAPE["Workload shape?"]
SHAPE -->|"run-to-completion\n(train / RL / batch)"| RJ["RayJob"]
SHAPE -->|"long-lived interactive\n/ dev cluster"| RC["RayCluster"]
SHAPE -->|"online serving + HA"| RS["RayService"]
- Use KubeRay when you have a Kubernetes cluster and the workload needs the Ray runtime: RL post-training (RL libraries), Ray Train (FSDP, DiLoCo), Ray Data batch GPU inference, or Ray Serve LLM.
- Prefer a plain Deployment / Volcano job when the workload is a containerized service or
torchrungang job with no distributed-Python coordination. - Prefer Ray on Slurm (Slurm) for bare-metal HPC where Slurm already owns scheduling.
- CRD vs raw
ray start. Never run bare Ray on a Kubernetes node in production; you lose quota, healing, and the autoscaler. Use a CRD.
How: implement, integrate, maintain¶
Install the operator¶
KubeRay ships CRDs + operator as a Helm chart. Install CRDs/operator first, then clusters.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# operator (installs the three CRDs). Pin the chart version.
helm install kuberay-operator kuberay/kuberay-operator \
--version 1.6.2 \
--namespace kuberay-system --create-namespace
kubectl get crd | grep ray.io # rayclusters / rayjobs / rayservices
kubectl -n kuberay-system rollout status deploy/kuberay-operator
Prereqs already on a GPU cluster: GPU Operator (advertises nvidia.com/gpu) and, for RDMA, the network operator (advertises the RDMA resource, e.g. rdma/rdma_shared_device_a).
RayJob: GPU training with RDMA and an autoscaling worker group¶
Run-to-completion. The operator creates the cluster, runs entrypoint, then tears it down. RDMA exposed per worker so Ray Train NCCL uses GPUDirect RDMA, not TCP. K8sJobMode (default) creates a submitter Job.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rl-train
namespace: ml
spec:
entrypoint: python /workspace/train_rl.py
shutdownAfterJobFinishes: true # recycle the RayCluster on completion
submissionMode: K8sJobMode # default; submitter Job runs `ray job submit`
rayClusterSpec:
rayVersion: "2.55.1"
enableInTreeAutoscaling: true
autoscalerOptions:
version: v2 # V2 autoscaler (alpha, Ray 2.10+ / KubeRay 1.4+)
upscalingMode: Default
idleTimeoutSeconds: 60
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.55.1-py311-gpu
resources:
limits: { cpu: "8", memory: 32Gi }
requests: { cpu: "8", memory: 32Gi }
workerGroupSpecs:
- groupName: gpu-workers
replicas: 4 # initialize == minReplicas
minReplicas: 4
maxReplicas: 16 # autoscaler grows to here on pending demand
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.55.1-py311-gpu
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1" # GPUDirect RDMA into the pod
cpu: "48"
memory: 512Gi
requests:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
cpu: "48"
memory: 512Gi
env:
- { name: NCCL_IB_HCA, value: "mlx5" }
- { name: NCCL_NET_GDR_LEVEL, value: "SYS" }
kubectl apply -f rayjob.yaml
kubectl -n ml get rayjob rl-train -w # JOB_STATUS -> SUCCEEDED
kubectl -n ml logs job/rl-train-<submitter-hash> # entrypoint logs
RayService: HA OSS model serving (vLLM-backed)¶
serveConfigV2 carries the Ray Serve app graph; the operator does zero-downtime upgrades by standing up a new RayCluster and shifting traffic. See serving OSS models for the engine config.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: qwen3-serve
namespace: ml
spec:
serveConfigV2: |
applications:
- name: llm
import_path: app:app # build_openai_app result, see cluster-ray
route_prefix: /
rayClusterConfig:
rayVersion: "2.55.1"
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.55.1-py311-gpu
resources: { limits: { cpu: "8", memory: 32Gi } }
workerGroupSpecs:
- groupName: gpu-serve
replicas: 2
minReplicas: 2
maxReplicas: 8
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.55.1-py311-gpu
resources:
limits: { nvidia.com/gpu: "2", cpu: "32", memory: 256Gi }
kubectl apply -f rayservice.yaml
kubectl -n ml get rayservice qwen3-serve # wait for Running / serveStatus HEALTHY
kubectl -n ml port-forward svc/qwen3-serve-serve-svc 8000:8000
curl http://localhost:8000/v1/models # OpenAI-compatible
Integrate¶
- Dashboard / metrics. The head pod exports Prometheus metrics on
:8080(ray_*series). Scrape with a PodMonitor and wire alerts (telemetry, observability). Useful PromQL:
# pending tasks the autoscaler is reacting to
sum(ray_tasks{State="PENDING_NODE_ASSIGNMENT"})
# GPU resource demand vs cluster capacity
sum(ray_resources{Name="GPU", State="USED"})
/ sum(ray_resources{Name="GPU"})
- Quota / gang scheduling. Ray pods are normal pods; place them under Kueue or schedule the worker group with Volcano for gang admission so a partial cluster does not deadlock.
- Health gating. Node taints from GPU health gating keep Ray workers off degraded GPUs; KubeRay reschedules onto healthy nodes.
Maintain: failure modes¶
- Workers without RDMA → Ray Train NCCL silently falls back to TCP and throughput craters. Run
nccl-testsinside a worker (fabric bringup, smoke tests); confirm[GDRDMA]inNCCL_DEBUG=INFO. - GCS is a single point. Head-pod loss drops the cluster unless GCS fault tolerance (external Redis) is set. RayService configures this for HA serving; for a long-lived RayCluster, set
gcsFaultToleranceOptionswith a Redis backend. - Autoscaler not scaling.
replicasmust start== minReplicas; the V2 autoscaler only grows on actual pending tasks/actors, not on requests/limits headroom. Check the autoscaler sidecar logs on the head pod. - Image / Ray version skew. Head and worker images must share the same Ray version, and
rayVersionmust match the image. Mismatch yields GCS handshake failures. - Object-store spill. Large objects evicted to disk stall pipelines; size
object_store_memoryand watch spill metrics. - Bundle ≠ physical GPU order. Placement-group bundle index does not map to GPU rank; pin device affinity inside the worker, not by bundle index. See topology-aware scheduling.
- SLO breach during serving. RayService traffic shift or pod loss surfaces as latency spikes; follow inference SLO breach. Throughput regressions on training: MFU regression. Track targets in the SLO/SLI catalog.
References¶
- KubeRay docs: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
- KubeRay repo / releases: https://github.com/ray-project/kuberay/releases
- RayCluster quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html
- RayJob quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayjob-quick-start.html
- RayService quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html
- Autoscaling config (V2): https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html
- Helm charts: https://github.com/ray-project/kuberay/tree/master/helm-chart · repo
https://ray-project.github.io/kuberay-helm/ - GCS fault tolerance: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html
- Ray metrics: https://docs.ray.io/en/latest/cluster/metrics.html
Related: Ray · Kubernetes · Orchestration · RL Libraries · Serving OSS Models · Inference Serving · GPU Health Gating · Glossary