Markdown

Remote GPU verification: proving rented hardware¶

Scope: how to verify that a GPU you do not own (a rented neocloud instance, a node in a decentralized marketplace, a provider you cannot physically inspect) actually is the hardware it claims to be: the right model, with the full VRAM, at full clocks, not shared, not throttled, not spoofed. Covers why nvidia-smi self-report is not evidence, the timing-bound computational challenge (proof-of-GPU), memory-capacity and bandwidth proofs, throttle/contention detection, and the trust boundary versus confidential computing (which protects the workload) and health gating (which assumes you own the node).

What it is¶

Remote GPU verification is the adversarial counterpart to health gating. Health gating asks "is my GPU healthy?" on hardware you control and trust. Remote verification asks "is this GPU what the seller claims?" on hardware controlled by someone whose incentive may be to overstate it: bill an H100 price for an A100, expose 80 GB while 24 GB is already taken by another tenant, or sell the same physical GPU to several renters at once.

The core technique is a timing-bound computational challenge, often called proof-of-GPU. The verifier issues a random seed; the remote node must run a deterministic, compute-heavy kernel derived from that seed and return both the result and the wall-clock time. The construction has three properties that together make cheating detectable:

Deterministic per hardware: the same seed on the same GPU model produces the same result, so a wrong answer means the work was not done as specified.
Sized to the claimed memory: the challenge allocates a matrix scaled to the advertised VRAM (e.g. an N×N double matrix with N ≈ sqrt(VRAM_bytes / 8)), so a GPU with less memory than claimed cannot even hold the problem.
Timing-bound: a genuine GPU of the claimed class returns within a tight latency window; a weaker GPU, a CPU emulation, a shared/contended GPU, or a remote relay to different hardware is too slow and fails the bound.

A common kernel choice is a large GEMM (matrix multiply). It is the operation GPUs are built for, saturates tensor/FP units, and has a well-characterised runtime per GPU class, optionally wrapped in a nonlinear transform and a reduction so the result is a compact, checkable digest. The verifier knows the expected result and the expected time for each GPU model and checks both.

Why use it¶

On infrastructure you do not own, the seller controls every byte of telemetry you read. nvidia-smi, /proc, DCGM output, even a driver version string can be forged on a compromised or dishonest host. They are self-reports, not proofs. The economic incentive to lie is direct: GPU price differences across models are large, and a decentralized or neocloud marketplace that pays for capacity invites exactly this fraud. Verification turns "trust the dashboard" into "trust a measurement the seller cannot fake without actually owning the hardware."

The failure mode it prevents is paying for capacity you never receive:

Model spoofing: billing a top-tier GPU while running a cheaper one.
Memory overcommit: advertising full VRAM that is partly consumed by another tenant, so your job OOMs at a size that should fit.
Silent sharing / oversubscription: the same GPU sold to multiple renters; each sees full specs but contends for the SMs, and the timing challenge exposes the contention as latency.
Throttling: a GPU pinned to low clocks (power/thermal cap) that benchmarks far below its class.

This is the trust primitive under a build-vs-rent decision and under any overlay-stitched, multi-provider pool: admit a node to the pool only after it proves its hardware, then keep health-gating it like any other.

When to use it (and when not)¶

Use remote verification when:

You rent GPUs from parties you cannot physically audit (neoclouds, spot/marketplace capacity, decentralized pools) and pay by capability or by the hour.
You are building the marketplace/validator and must score or admit provider nodes honestly before settling payment.
A pool stitches nodes from several providers (overlay & mesh networking) and must gate admission on proven hardware.

Do not reach for it when:

You own and physically control the fleet. The seller and the buyer are the same party; there is no incentive to spoof. Use GPU health gating and DCGM diagnostics, which assume a trusted node and check health, not honesty.
The threat is protecting your weights/data from the host, not validating the host's hardware. That is confidential computing and device attestation: orthogonal, and complementary. CC attestation proves firmware identity cryptographically, while a timing challenge proves delivered capability. A strong marketplace uses both, attestation for identity and a benchmark for performance.
You only need a one-off sanity check you can run interactively. A quick microbenchmark (below) is then enough without a full challenge protocol.

Architecture¶

The protocol is a challenge-response loop between a trusted verifier and an untrusted remote node, scored against trusted reference numbers the verifier measured beforehand on hardware it controls.

Verifier (trusted): generates a fresh random seed per round, holds the expected digest and timing envelope per GPU model, and renders the accept/reject verdict.
Trusted baselines (per-model digest and timing): the calibration store. Without measured baselines there is nothing to compare against.
Remote node (untrusted): receives the seed and the claimed spec, runs the seeded challenge on its GPU, and returns the digest and elapsed time. It cannot forge a passing answer without actually owning the claimed hardware.
GPU under test: executes the seeded GEMM sized to the claimed VRAM. Insufficient memory fails allocation; insufficient speed fails the timing bound.
Verdict: accept only if the digest matches AND the elapsed time is inside the model bound. Either check alone is insufficient.

flowchart LR
  B["Trusted baselines<br/>(per-model digest + timing)"] --> V["Verifier"]
  V -->|"random seed + claimed spec"| N["Remote node (untrusted)"]
  N -->|"run seeded GEMM challenge<br/>sized to claimed VRAM"| GPU["GPU under test"]
  GPU -->|"result digest + elapsed time"| N
  N -->|"digest, t_elapsed"| V
  V --> C{"digest correct AND<br/>t_elapsed within model bound?"}
  C -->|"no"| FAIL["Reject: wrong/weaker/shared/spoofed GPU"]
  C -->|"yes"| PASS["Accept: hardware matches claim"]

How to use it¶

Reference techniques below. Confirm tool flags and expected per-model numbers against current NVIDIA docs and your own measured baselines before trusting any threshold. The whole method rests on having trusted reference numbers per GPU model; measure them on hardware you trust first (see "How to maintain it" below).

Never trust self-report, measure it¶

Treat nvidia-smi and driver strings as claims to be checked, not evidence. The verifier's job is to produce a number the host cannot forge without the real GPU.

The timing-bound compute challenge (the core)¶

Issue a seed, demand a deterministic GEMM-based result sized to the claimed VRAM, and check both correctness and elapsed time against the expected envelope for the claimed model.

The block below is a reference template (needs CUDA + PyTorch, not executed here); a runnable numpy validation of the same core math follows it.

# Reference shape of a proof-of-GPU challenge: illustrative, not a protocol spec.
# Requires CUDA + PyTorch. The numpy block below validates the core math instead.
import time, torch

def gpu_challenge(seed: int, claimed_vram_bytes: int) -> tuple[str, float]:
    g = torch.Generator(device="cuda").manual_seed(seed)
    # Size the problem to the CLAIMED memory: a GPU with less cannot hold it.
    n = int((claimed_vram_bytes / 8 / 4) ** 0.5)        # fp64; ~1/4 of VRAM per matrix (A,B,C), ~3/4 total
    a = torch.randn(n, n, generator=g, device="cuda", dtype=torch.float64)
    b = torch.randn(n, n, generator=g, device="cuda", dtype=torch.float64)
    torch.cuda.synchronize(); t0 = time.perf_counter()
    c = (a @ b).tanh()                                  # compute-heavy, GPU-shaped
    digest = c.sum().item()                             # compact, checkable result
    torch.cuda.synchronize()
    return f"{digest:.6f}", time.perf_counter() - t0

# Verifier accepts iff digest == expected(seed, model) AND elapsed <= bound(model).

The proof-of-GPU core math validated with numpy only (deterministic seeded GEMM, VRAM sizing, digest-plus-timing verdict). This block runs under a stock python3 with numpy and asserts its result, including adversarial cases (a memory overclaim, a relay that returns the wrong digest, a too-slow honest answer):

# numpy-only validation of the proof-of-GPU CORE MATH (no CUDA, no torch).
# Mirrors the torch reference: seeded GEMM -> nonlinear -> reduction -> digest,
# sized to CLAIMED VRAM, verified on BOTH digest correctness AND a timing bound.
import time
import numpy as np

GB = 1024 ** 3

def challenge_size(claimed_vram_bytes: int, dtype_bytes: int = 8, matrices: int = 3) -> int:
    # Largest N so that `matrices` NxN arrays at dtype_bytes each fit CLAIMED VRAM.
    return int((claimed_vram_bytes / dtype_bytes / matrices) ** 0.5)

def run_challenge(seed: int, n: int) -> float:
    # Deterministic per (seed, n): one seed on one GPU model yields one digest.
    rng = np.random.default_rng(seed)
    a = rng.standard_normal((n, n))
    b = rng.standard_normal((n, n))
    c = np.tanh(a @ b)            # compute-heavy, GPU-shaped kernel
    return float(c.sum())        # compact, checkable digest

def verify(digest: float, elapsed: float, expected_digest: float,
           time_bound: float, tol: float = 1e-6) -> bool:
    # BOTH must hold: right answer AND inside the model's timing envelope.
    return abs(digest - expected_digest) <= tol and elapsed <= time_bound

# 1. Sizing scales to the CLAIMED memory; a smaller node cannot hold the problem.
n_claimed = challenge_size(80 * GB)
n_smaller = challenge_size(24 * GB)
assert n_claimed > n_smaller
honest_bytes = 3 * n_claimed ** 2 * 8
assert honest_bytes > 24 * GB          # a 24GB node billed as 80GB OOMs -> caught
assert honest_bytes <= 80 * GB         # fits the honest 80GB node

# 2. Determinism: same seed + same size => bit-identical digest (repeatable per HW).
assert run_challenge(1234, 64) == run_challenge(1234, 64)

# 3. Fresh seed each round => different digest (defeats precompute/replay).
assert run_challenge(1234, 64) != run_challenge(5678, 64)

# 4. Equivalence to a slow reference: a relay that "optimizes" the kernel is caught.
rng = np.random.default_rng(42)
a = rng.standard_normal((16, 16))
b = rng.standard_normal((16, 16))
slow = 0.0
for i in range(16):
    for j in range(16):
        acc = 0.0
        for k in range(16):
            acc += a[i, k] * b[k, j]
        slow += np.tanh(acc)
fast = float(np.tanh(a @ b).sum())
assert abs(fast - slow) < 1e-9

# 5. Verifier needs BOTH the digest and the timing envelope.
t0 = time.perf_counter()
digest = run_challenge(2026, 128)
elapsed = time.perf_counter() - t0
assert elapsed > 0.0
assert verify(digest, elapsed, expected_digest=digest, time_bound=60.0)             # honest + fast
assert not verify(digest, elapsed=120.0, expected_digest=digest, time_bound=60.0)   # too slow: weaker/shared
assert not verify(digest + 1.0, elapsed=0.01, expected_digest=digest, time_bound=60.0)  # wrong digest: relay/spoof

print("proof-of-GPU core math validated:", n_claimed, "claimed-N,", round(digest, 3), "digest")

A weaker GPU is too slow; a GPU with less VRAM than claimed throws on allocation; a shared GPU's elapsed time inflates under a co-tenant's load; a relay to different hardware returns the wrong digest. Randomise the seed every round so results cannot be precomputed or cached.

Prove the memory is real and free¶

Capacity and availability are separate claims; verify both. Allocate close to the advertised VRAM and confirm it succeeds (catches overcommit), and exercise it to confirm it is not already occupied.

# Self-reported capacity: a CLAIM, to be checked against an allocation that must succeed.
nvidia-smi --query-gpu=name,memory.total,memory.used,clocks.sm,clocks.max.sm,power.draw --format=csv

Probe sustained bandwidth and clocks¶

Memory bandwidth is a strong per-model fingerprint and hard to fake. NVIDIA's nvbandwidth measures host↔device and device↔device bandwidth; a result far below the model's HBM spec signals a weaker or contended GPU. Read clocks.sm vs clocks.max.sm and watch for throttle reasons. A GPU pinned low will underperform its class regardless of what the name string says.

Run the challenge repeatedly and watch the variance: a dedicated GPU returns stable times; an oversubscribed one shows latency spikes as co-tenants compete for SMs. Sustained-load runs also surface thermal/power throttling that a single quick probe misses.

How to integrate it¶

Bind capability (the benchmark) to identity (cryptographic firmware attestation) so a passing challenge is tied to a specific attested device, not just some fast GPU somewhere. Pair the timing challenge with attestation where available; neither alone is sufficient against a determined adversary. Attestation proves which firmware, the challenge proves how fast: a relay that forwards the challenge to a genuinely faster GPU passes the timing check but fails attestation, and a device that attests but is throttled or shared passes attestation but fails the timing check.

Wire the verdict into the systems that already gate work: admission control for a multi-provider pool, the settlement path of a marketplace or validator, and the failure stream in observability & monitoring.

How to run it in production¶

Gate marketplace/pool admission on a passing challenge, then re-issue challenges periodically. Capability can change after admission (a co-tenant arrives, clocks get capped), so a one-time check at onboarding is not enough. Randomise the seed every round so results cannot be precomputed, cached, or replayed. Wire failures into observability & monitoring and treat a newly-failing node like any health-gate failure: drain it and stop settling payment. Set tolerance bands wide enough to avoid false rejects on legitimately busy nodes, yet tight enough to catch a class downgrade.

How to maintain it¶

The method rests on trusted reference numbers, so calibration is the maintenance burden. Measure per-model result and timing envelopes on hardware you trust, and re-measure them when they drift: driver and CUDA upgrades move kernel timings, and a new GPU class needs a fresh baseline before you can admit it. Keep the tolerance bands under review as the fleet and its baselines change, and validate any threshold against current NVIDIA docs rather than assuming last year's numbers still hold.

How to scale it¶

Continuous re-verification has a cost: every challenge burns GPU seconds you could otherwise bill or use, so trade cadence against the fraud it prevents. Re-verify hot, high-value, or newly-suspect nodes more often and sample the rest. Run admission and periodic re-checks as background load across a multi-provider pool, and let observability & monitoring surface which nodes to re-challenge next. Cryptographic attestation, where the fleet supports it, is cheap to re-check and lets you re-run the expensive timing challenge less often on already-attested identities.

Failure modes¶

Trusting nvidia-smi/driver strings as proof. They are self-reported and forgeable on a dishonest host. Only an unforgeable measurement counts.
No timing bound. Checking the result but not the elapsed time lets a slow or emulated path pass; the time is half the proof.
Static/predictable seed. A fixed challenge can be precomputed and replayed; randomise per round.
No trusted reference numbers. Without measured per-model baselines for result and timing, there is nothing to compare against, so calibrate on trusted hardware first.
Single quick probe. Misses sharing and thermal throttling that only appear under sustained or repeated load.
Conflating identity with capability. Cryptographic attestation proves which firmware, not how fast; a timing challenge proves delivered performance, not identity. Use both for a strong guarantee.

Open questions & validation¶

Per-model result/timing envelopes (and their drift across driver/CUDA versions) for every GPU class you admit, measured rather than assumed.
Tolerance bands wide enough to avoid false rejects on legitimately busy nodes yet tight enough to catch a class downgrade.
Whether a challenge that resists a determined adversary (precompute, relay to a faster GPU, partial-VRAM tricks) needs unpredictable problem structure, not just an unpredictable seed.
Cost/cadence of continuous re-verification vs. the fraud it prevents.
Binding capability proofs to cryptographic attestation so identity and performance are verified together.

References¶

NVIDIA cuBLAS (GEMM reference; basis for compute challenges and per-model timing): https://docs.nvidia.com/cuda/cublas/
NVIDIA nvbandwidth (host↔device / device↔device bandwidth measurement): https://github.com/NVIDIA/nvbandwidth
NVIDIA DCGM (diagnostics, including dcgmi diag capability/health levels): https://docs.nvidia.com/datacenter/dcgm/latest/
gpu-burn (sustained-load FP stress to surface throttling/instability): https://github.com/wilicc/gpu-burn
nvidia-smi query reference (the self-reported fields a verifier must not trust blindly): https://developer.nvidia.com/nvidia-system-management-interface
NVIDIA Confidential Computing / attestation (cryptographic identity, complementary to capability proofs): https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/