Markdown

KubeRay: Ray on Kubernetes¶

Scope: running Ray on Kubernetes with the KubeRay operator, covering install, the RayCluster / RayJob / RayService CRDs, exposing GPUs and RDMA to Ray pods, autoscaling, and failure modes / maintenance.

Reference templates on real APIs; pin chart/image versions and validate on your hardware before production use.

What it is¶

KubeRay (ray-project/kuberay) is the Kubernetes operator that runs Ray clusters as native Kubernetes objects. It watches three Custom Resource Definitions and reconciles them into pods:

RayCluster is a long-lived cluster: one head pod (GCS, scheduler, dashboard) plus one or more workerGroupSpecs, each a homogeneous pool of worker pods. This is the unit of GPU/RDMA exposure.
RayJob runs to completion. It creates (or attaches to) a RayCluster, submits a job, and optionally tears the cluster down (shutdownAfterJobFinishes). This is the natural fit for distributed training and RL post-training.
RayService is HA serving. It wraps a RayCluster plus a Ray Serve app config, and does zero-downtime upgrades by standing up a new cluster and shifting traffic. It backs OSS model serving (serving OSS models, inference serving).

As of mid-2026 the stable line is KubeRay v1.6.2 (operator and CRDs); the ray-cluster Helm chart tracks the same minor. Pin the exact tag; CRD schemas change across minors. Verify on the KubeRay releases page.

Why it matters¶

One scheduler, not two. Bare Ray on a Kubernetes node duplicates scheduling and bypasses cluster quota. KubeRay makes Ray pods first-class, so the GPU Operator, Kueue quota / Volcano, and node taints all govern Ray the same as any workload (orchestration overview).
Declarative lifecycle. RayCluster/RayJob/RayService are reconciled objects: GitOps applies them, the operator heals pod loss, RayService gives HA serving without bespoke glue.
Autoscaling that reads Ray demand. The in-tree autoscaler scales worker pods on pending Ray tasks/actors, not on CPU%, so a placement-group request for 8 GPUs provisions 8 GPU pods.
GPU + RDMA reuse the pod contract. GPUs via nvidia.com/gpu, RDMA via the network operator's resource, the same limits any GPU pod uses (Kubernetes for GPUs, network operator).

When it is needed (and when not)¶

flowchart LR
  Q["Workload on K8s\nneeds Ray?"] -->|"no distributed-Python"| K8S["Plain K8s Deployment\n/ Volcano job"]
  Q -->|"yes, Ray runtime"| ON["On K8s already?"]
  ON -->|"no, bare metal HPC"| SLURM["Ray on Slurm\n(cluster-slurm)"]
  ON -->|"yes"| SHAPE["Workload shape?"]
  SHAPE -->|"run-to-completion\n(train / RL / batch)"| RJ["RayJob"]
  SHAPE -->|"long-lived interactive\n/ dev cluster"| RC["RayCluster"]
  SHAPE -->|"online serving + HA"| RS["RayService"]

Use KubeRay when you have a Kubernetes cluster and the workload needs the Ray runtime: RL post-training (RL libraries), Ray Train (FSDP, DiLoCo), Ray Data batch GPU inference, or Ray Serve LLM.
Prefer a plain Deployment / Volcano job when the workload is a containerized service or torchrun gang job with no distributed-Python coordination.
Prefer Ray on Slurm (Slurm) for bare-metal HPC where Slurm already owns scheduling.
CRD vs raw ray start. Never run bare Ray on a Kubernetes node in production; you lose quota, healing, and the autoscaler. Use a CRD.

How: implement, integrate, maintain¶

Install the operator¶

KubeRay ships CRDs + operator as a Helm chart. Install CRDs/operator first, then clusters.

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# operator (installs the three CRDs). Pin the chart version.
helm install kuberay-operator kuberay/kuberay-operator \
  --version 1.6.2 \
  --namespace kuberay-system --create-namespace

kubectl get crd | grep ray.io          # rayclusters / rayjobs / rayservices
kubectl -n kuberay-system rollout status deploy/kuberay-operator

Prereqs already on a GPU cluster: GPU Operator (advertises nvidia.com/gpu) and, for RDMA, the network operator (advertises the RDMA resource, e.g. rdma/rdma_shared_device_a).

RayJob: GPU training with RDMA and an autoscaling worker group¶

Run-to-completion. The operator creates the cluster, runs entrypoint, then tears it down. RDMA exposed per worker so Ray Train NCCL uses GPUDirect RDMA, not TCP. K8sJobMode (default) creates a submitter Job.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rl-train
  namespace: ml
spec:
  entrypoint: python /workspace/train_rl.py
  shutdownAfterJobFinishes: true        # recycle the RayCluster on completion
  submissionMode: K8sJobMode            # default; submitter Job runs `ray job submit`
  rayClusterSpec:
    rayVersion: "2.55.1"
    enableInTreeAutoscaling: true
    autoscalerOptions:
      version: v2                       # V2 autoscaler (alpha, Ray 2.10+ / KubeRay 1.4+)
      upscalingMode: Default
      idleTimeoutSeconds: 60
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.55.1-py311-gpu
              resources:
                limits:   { cpu: "8",  memory: 32Gi }
                requests: { cpu: "8",  memory: 32Gi }
    workerGroupSpecs:
      - groupName: gpu-workers
        replicas: 4                      # initialize == minReplicas
        minReplicas: 4
        maxReplicas: 16                  # autoscaler grows to here on pending demand
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.55.1-py311-gpu
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    rdma/rdma_shared_device_a: "1"   # GPUDirect RDMA into the pod
                    cpu: "48"
                    memory: 512Gi
                  requests:
                    nvidia.com/gpu: "8"
                    rdma/rdma_shared_device_a: "1"
                    cpu: "48"
                    memory: 512Gi
                env:
                  - { name: NCCL_IB_HCA, value: "mlx5" }
                  - { name: NCCL_NET_GDR_LEVEL, value: "SYS" }

kubectl apply -f rayjob.yaml
kubectl -n ml get rayjob rl-train -w               # JOB_STATUS -> SUCCEEDED
kubectl -n ml logs job/rl-train-<submitter-hash>   # entrypoint logs

RayService: HA OSS model serving (vLLM-backed)¶

serveConfigV2 carries the Ray Serve app graph; the operator does zero-downtime upgrades by standing up a new RayCluster and shifting traffic. See serving OSS models for the engine config.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: qwen3-serve
  namespace: ml
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: app:app           # build_openai_app result, see cluster-ray
        route_prefix: /
  rayClusterConfig:
    rayVersion: "2.55.1"
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.55.1-py311-gpu
              resources: { limits: { cpu: "8", memory: 32Gi } }
    workerGroupSpecs:
      - groupName: gpu-serve
        replicas: 2
        minReplicas: 2
        maxReplicas: 8
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.55.1-py311-gpu
                resources:
                  limits: { nvidia.com/gpu: "2", cpu: "32", memory: 256Gi }

kubectl apply -f rayservice.yaml
kubectl -n ml get rayservice qwen3-serve            # wait for Running / serveStatus HEALTHY
kubectl -n ml port-forward svc/qwen3-serve-serve-svc 8000:8000
curl http://localhost:8000/v1/models                # OpenAI-compatible

Integrate¶

Dashboard / metrics. The head pod exports Prometheus metrics on :8080 (ray_* series). Scrape with a PodMonitor and wire alerts (telemetry, observability). Useful PromQL:

# pending tasks the autoscaler is reacting to
sum(ray_tasks{State="PENDING_NODE_ASSIGNMENT"})

# GPU resource demand vs cluster capacity
sum(ray_resources{Name="GPU", State="USED"})
/ sum(ray_resources{Name="GPU"})

Quota / gang scheduling. Ray pods are normal pods; place them under Kueue or schedule the worker group with Volcano for gang admission so a partial cluster does not deadlock.
Health gating. Node taints from GPU health gating keep Ray workers off degraded GPUs; KubeRay reschedules onto healthy nodes.

Maintain: failure modes¶

Workers without RDMA → Ray Train NCCL silently falls back to TCP and throughput craters. Run nccl-tests inside a worker (fabric bringup, smoke tests); confirm [GDRDMA] in NCCL_DEBUG=INFO.
GCS is a single point. Head-pod loss drops the cluster unless GCS fault tolerance (external Redis) is set. RayService configures this for HA serving; for a long-lived RayCluster, set gcsFaultToleranceOptions with a Redis backend.
Autoscaler not scaling. replicas must start == minReplicas; the V2 autoscaler only grows on actual pending tasks/actors, not on requests/limits headroom. Check the autoscaler sidecar logs on the head pod.
Image / Ray version skew. Head and worker images must share the same Ray version, and rayVersion must match the image. Mismatch yields GCS handshake failures.
Object-store spill. Large objects evicted to disk stall pipelines; size object_store_memory and watch spill metrics.
Bundle ≠ physical GPU order. Placement-group bundle index does not map to GPU rank; pin device affinity inside the worker, not by bundle index. See topology-aware scheduling.
SLO breach during serving. RayService traffic shift or pod loss surfaces as latency spikes; follow inference SLO breach. Throughput regressions on training: MFU regression. Track targets in the SLO/SLI catalog.

References¶

KubeRay docs: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
KubeRay repo / releases: https://github.com/ray-project/kuberay/releases
RayCluster quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html
RayJob quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayjob-quick-start.html
RayService quickstart: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html
Autoscaling config (V2): https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html
Helm charts: https://github.com/ray-project/kuberay/tree/master/helm-chart · repo https://ray-project.github.io/kuberay-helm/
GCS fault tolerance: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html
Ray metrics: https://docs.ray.io/en/latest/cluster/metrics.html