Markdown

Container image supply-chain provenance for GPU platforms¶

Scope: ensuring the container images that run on your GPU nodes are exactly what you built and authorized. The chain from build to registry to node: digest pinning, signing (cosign/Sigstore), provenance attestations (SLSA) and SBOMs, admission-time verification that blocks unsigned or unauthorized images, vulnerability scanning, and registry/key controls. This is the deep page behind the "control image provenance" bullet in security & multi-tenancy; it verifies the image the way remote GPU verification verifies the hardware. Distinct from node image & config management (OS/golden images) and the container toolkit (GPU runtime).

What it is¶

A container image is the payload that runs on your GPUs; its supply chain (base image, dependencies, build pipeline, registry, pull) is the attack surface. Image provenance is the set of controls that make that chain verifiable end to end:

Digest pinning: reference images by content digest (image@sha256:…), never a mutable tag like :latest. The digest is the integrity check: the bytes cannot change underneath you.
Signing: cryptographically sign the image at build so a verifier can confirm it came from your pipeline. cosign (part of Sigstore) is the de-facto tool; keyless signing binds a signature to a CI identity via an OIDC certificate (Fulcio) and records it in a transparency log (Rekor), so there is no long-lived key to leak.
Provenance + SBOM: a SLSA provenance attestation records how and where the image was built; an SBOM (e.g. via syft, in SPDX/CycloneDX) records what is inside it. Both are attached to the image as signed attestations.
Admission-time verification: a cluster admission controller (Kyverno verifyImages, the Sigstore policy-controller, or Connaisseur) checks the signature/provenance of every image at deploy time and rejects anything unsigned, unverified, or from a disallowed registry. Verification without enforcement is theatre; this is the enforcement.
Scanning + registry control: vulnerability scanning (trivy/grype) in CI and at rest, a private registry with scoped pull credentials, and a policy that restricts which registries the cluster may pull from at all.

flowchart LR
  BUILD["CI build"] --> SIGN["cosign sign (keyless: Fulcio/Rekor)"]
  BUILD --> SBOM["SBOM (syft) + SLSA provenance"]
  SIGN --> REG["Private registry (digest-addressed)"]
  SBOM --> REG
  REG --> ADM{"Admission verify:<br/>signed? provenance ok?<br/>allowed registry?"}
  ADM -->|"no"| REJECT["Reject deploy"]
  ADM -->|"yes"| NODE["GPU node pulls by digest"]
  SCAN["trivy scan (CI + registry)"] -.-> REG
  KMS["Cloud KMS / keyless identity"] -.-> SIGN

Why it matters¶

On a GPU platform the cost of a bad image is unusually high: an unauthorized or tampered image is code execution on expensive accelerators with access to model weights, training data, and tenant workloads. Three concrete threats:

Tampering / registry compromise. A mutable :latest tag lets an attacker (or a compromised registry) swap the image bytes silently; the next pull runs their code. Digest pinning + signature verification closes this: the digest cannot change and an unsigned swap is rejected at admission.
Unauthorized images on shared GPUs. In a multi-tenant platform, admission verification ensures only images from your pipeline (or an allow-listed source) ever land on a GPU node, not whatever a tenant references.
Dependency / build-chain attacks. SBOM + SLSA provenance make "what is in this image and how was it built" auditable, so a poisoned dependency or an unexpected builder is detectable rather than invisible.

It is also a reproducibility and compliance lever: digest-pinned, attested images make a run reproducible and satisfy supply-chain requirements (SLSA, regulated environments) that increasingly gate enterprise and sovereign deployments.

When to use it (and when not)¶

Apply the full chain when:

The platform is multi-tenant, runs third-party/community images, or pulls images onto nodes that touch sensitive weights or data (security & multi-tenancy, split-plane platform).
You face compliance/provenance requirements (SLSA level targets, SBOM mandates, regulated or sovereign deployments).
Images run on infrastructure you do not fully control (rented/neocloud GPU nodes), where the registry-to-node path crosses trust boundaries.

Scale it down when:

It is a single-tenant lab running only your own images on owned nodes: digest pinning and CI scanning are still worth it, but cluster-wide keyless signing + admission enforcement may be more machinery than the threat warrants. Start with pinning + scanning; add signing/verification as tenancy or exposure grows.

Always do the minimum, pin digests, never :latest, regardless of scale; it is nearly free and prevents the most common silent-swap failure.

How to implement and operate¶

Reference commands (unexecuted). Confirm cosign/Sigstore, Kyverno, and SBOM tooling flags against current docs for your versions; signing and policy APIs evolve.

1. Pin every image by digest¶

Resolve tags to digests and reference the digest in manifests. This is the baseline integrity and reproducibility control.

docker pull nvcr.io/nvidia/pytorch:24.10-py3
docker inspect --format='{{index .RepoDigests 0}}' nvcr.io/nvidia/pytorch:24.10-py3
# -> nvcr.io/nvidia/pytorch@sha256:<digest>   # reference THIS in manifests, not the tag

2. Sign at build (keyless preferred)¶

Sign in CI with cosign; keyless signing binds the signature to the CI's OIDC identity (no long-lived key), logged in Rekor for transparency. If you use keys, hold them in a cloud KMS, never in the repo.

# Keyless (CI with an OIDC token): identity is the workflow, recorded in the transparency log.
COSIGN_EXPERIMENTAL=1 cosign sign "$IMAGE@sha256:$DIGEST"

3. Attach an SBOM and provenance¶

Generate an SBOM and attach it (and a SLSA provenance attestation) as signed attestations so "what's inside" and "how it was built" travel with the image.

syft "$IMAGE@sha256:$DIGEST" -o spdx-json > sbom.spdx.json
COSIGN_EXPERIMENTAL=1 cosign attest --predicate sbom.spdx.json --type spdxjson "$IMAGE@sha256:$DIGEST"

4. Verify at admission and reject on failure¶

Enforce in-cluster so an unsigned, unverified, or disallowed-registry image cannot be deployed. Kyverno's verifyImages (or the Sigstore policy-controller) mutates references to digests and blocks failures.

# Kyverno verifyImages (reference shape) — fail closed.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: require-signed-images }
spec:
  validationFailureAction: Enforce        # block, do not just audit
  rules:
    - name: verify-signature
      match: { any: [ { resources: { kinds: [ Pod ] } } ] }
      verifyImages:
        - imageReferences: [ "registry.example.com/*" ]
          attestors:
            - entries: [ { keyless: { issuer: "https://token.actions.githubusercontent.com", subject: "https://github.com/your-org/*" } } ]

Pair it with a rule that restricts allowed registries so the cluster cannot pull from anywhere else.

5. Scan, and gate on findings¶

Run trivy/grype in CI and against the registry; fail the build on critical CVEs in the chosen severity policy. Scanning that never blocks is just a dashboard.

trivy image --severity CRITICAL,HIGH --exit-code 1 "$IMAGE@sha256:$DIGEST"

6. Operate the chain¶

Rotate signing keys (or rely on keyless to avoid them), monitor admission verification failures as a security signal in observability & monitoring, keep a documented break-glass path for emergencies that is audited and time-boxed, and re-scan images at rest as new CVEs land. Use a private registry with scoped, rotatable pull credentials held in a KMS.

Failure modes¶

:latest / mutable tags. Non-reproducible and silently swappable; a registry compromise changes the running code on the next pull. Pin digests.
Signing without enforcement. Images are signed but admission never checks: no protection at all. Enforce verification cluster-wide and fail closed.
Unsigned base images. A signed top layer over an unverified base inherits the base's risk. Verify the whole chain, including bases.
Key compromise. A leaked signing key forges trust. Prefer keyless (OIDC + transparency log) or a KMS-held key; never commit keys.
No SBOM/provenance. When a CVE lands, you cannot tell which images are exposed, or whether an image came from your pipeline. Attest both.
Allow-any-registry. Without a registry allow-list, a tenant or typo pulls an arbitrary image onto a GPU. Restrict registries at admission.
Scan-but-don't-block. CVEs reported and ignored. Gate the build/admission on the severity policy.

Open questions & validation¶

A drill: deploy a deliberately unsigned (and a tampered-digest) image and confirm admission rejects both; verification that fails open is worse than none.
Keyless vs. KMS-key signing for your trust model and audit needs; transparency-log retention.
SLSA level you actually target, and whether your build pipeline meets it (hermetic build, provenance completeness).
Coverage: every namespace and every node enforce the policy (no unguarded path onto a GPU), including system/DaemonSet images.
Break-glass procedure that is fast, audited, and time-boxed, and tested.

References¶

Sigstore (cosign, Fulcio, Rekor — signing and transparency): https://www.sigstore.dev/
cosign (container signing/attestation): https://github.com/sigstore/cosign
SLSA (Supply-chain Levels for Software Artifacts — provenance): https://slsa.dev/
in-toto (attestation framework): https://in-toto.io/
Syft (SBOM generation, SPDX/CycloneDX): https://github.com/anchore/syft
Trivy (image vulnerability scanning): https://github.com/aquasecurity/trivy
Kyverno — verify images (admission-time signature/attestation checks): https://kyverno.io/docs/writing-policies/verify-images/
Sigstore policy-controller (admission enforcement): https://docs.sigstore.dev/policy-controller/overview/