Legate Sparse: distributed scipy.sparse on GPU clusters¶
Scope: Legate Sparse (Yadav et al., SC'23), the system that runs unmodified scipy.sparse programs distributed across clusters of CPUs and GPUs, and how it composes with cuNumeric (now cuPyNumeric) so one Python program mixes distributed sparse and dense array computation. This page covers the Legion runtime foundation (regions, partitions, tasks, mappers), the region-based sparse matrix representation and image partitioning, constraint-based parallelization, DISTAL-generated kernels, and the paper's measured results against SciPy, CuPy, and PETSc. It sits with the CUDA libraries and framework pages; the kernel-authoring alternatives are Triton and CUTLASS.
Claims and numbers are from the SC'23 paper (Summit, V100s, CUDA 11.0.2) and describe the published prototype; the projects have moved since (cuNumeric is cuPyNumeric, and the cuPyNumeric repo now advertises SciPy coverage), so verify current API state on the nv-legate repos (as of 2026-07). The
legate_sparsesnippet is a reference template, unexecuted. The numpy example is executed and asserted.
flowchart TB
PY["Unmodified Python: scipy.sparse + NumPy idioms"] --> LS["Legate Sparse (sparse ops)"]
PY --> CN["cuPyNumeric (dense ops)"]
LS --> CONS["Constraint-based task launches<br/>equals / image constraints on regions"]
CN --> CONS
CONS --> SOLVE["Runtime constraint solver:<br/>reuse existing partitions, repartition least data"]
SOLVE --> LEGION["Legion: dependence analysis, comms insertion"]
LEGION --> CPU["CPU tasks (OpenMP, DISTAL-generated)"]
LEGION --> GPU["GPU tasks (CUDA + cuSPARSE)"]
What it is¶
Legate Sparse is a distributed, accelerated drop-in replacement for the scipy.sparse module. Programs keep their idiomatic SciPy and NumPy form (the paper's running example is a power-iteration eigenvalue estimate written in 20 lines) and run unchanged from one CPU socket to 192 GPUs; the import block falls back to stock SciPy/NumPy when Legate is absent.1 It is built, like cuNumeric, as a dynamic translation onto the Legion runtime: arrays become Legion regions, operations become tasks with declared read/write/reduce access, and Legion extracts parallelism from the sequential task stream, inserting the communication that preserves sequential semantics. A separate mapper assigns tasks to processors and regions to memories, so machine-specific decisions stay out of the application.
The prototype supports the COO, CSR, CSC, and DIA formats and implements 176 of the estimated 492 functions in the scipy.sparse API (35%): 14 generated by the DISTAL sparse tensor algebra compiler, 156 ported from SciPy/CuPy implementations by expressing them over cuNumeric and existing Legate Sparse kernels, and 6 hand-written. Ported functionality includes the CG, CGS, BiCG, BiCGSTAB, and GMRES iterative solvers, Runge-Kutta integration, and eigensolvers.4
Why use it¶
- Zero-modification distribution. The alternatives require expertise the target user lacks: SciPy/CuPy plus MPI/NCCL or Dask/Ray means manual partitioning and communication; PETSc and Trilinos expose lower-level explicitly parallel APIs where the user reasons about distribution. Legate Sparse keeps the SciPy API and makes distribution implicit.
- Competitive with hand-tuned systems. On a conjugate-gradient solve of a 2-D Poisson problem, Legate Sparse reaches 85% of PETSc's performance on one GPU and 65% at 192 GPUs of Summit; the abstract-level claim is 65% of PETSc on up to 1280 CPU cores and 192 GPUs, for a drop-in SciPy replacement.5
- Sparse and dense compose. Legate Sparse and cuNumeric are implemented without knowledge of each other, yet share partitions and allocations at runtime; a matrix can be assembled from cuNumeric arrays, and the arrays backing a sparse matrix can be extracted and operated on.
- Beyond single-GPU memory. In the sparse matrix factorization benchmark (MovieLens with SDDMM), CuPy runs out of memory past the 25M-rating dataset; Legate Sparse scales the same Python to the 50M and 100M datasets with 6 and 12 GPUs, no code changes.6
When to use it (and when not)¶
- Use it for SciPy-sparse-shaped scientific and data workloads (iterative solvers, integrations, spectral methods, sparse ML) that have outgrown one node or one GPU but do not justify a PETSc port.
- Use it when a workload mixes dense and sparse phases; the composition with cuPyNumeric is the differentiator over single-library approaches.
- Do not expect the full API. The prototype covered 35% of
scipy.sparse; LIL and DOK assembly formats are explicitly out of scope (shared-memory assembly formats), and BSR was future work in the paper. Check current coverage before porting. - Not for small, latency-bound problems. Legate's task-launch and metadata overheads dominate when kernels are tiny: CuPy was 30% faster on the single-GPU multi-grid solver and 40% faster on the quantum simulation, and 2.8x faster on the smallest ML dataset, all attributed to many small task launches.5
- Maximum per-node performance still belongs to hand-tuned code. If the last 35% versus PETSc matters more than productivity, stay with PETSc/Trilinos.
Architecture¶
Regions and image partitioning. Sparse matrices are stored as packs of Legion regions rather than per-rank local matrices (the PETSc/Trilinos choice). CSR keeps three regions: pos, storing for each row a range tuple over the coordinate region (a small variation on the standard indptr array), plus crd and vals. The payoff of the tuple form is that Legion's dependent-partitioning image operation applies directly: partition pos row-wise, and the image of pos gives the matching partition of crd and vals; the image of crd into a dense vector x yields exactly the (possibly aliased) pieces of x each rank needs for SpMV, which is how scatter/gather communication patterns are expressed at a high level. The cost of the global-regions choice is that local pieces are not valid standalone sparse matrices, so a small reshaping penalty is paid before calling cuSPARSE.2
Constraint-based parallelization. Tasks do not name concrete partitions; they declare constraints: the row-split SpMV declares equals(y, pos), image(pos, crd), image(pos, vals), and image(crd, x). A runtime solver picks concrete partitions satisfying all constraints, preferring existing partitions and, among solutions, the one that repartitions the least data. This is what makes two mutually unaware libraries performance-composable: partitions created by cuNumeric are reused by Legate Sparse and vice versa, and new operations need not enumerate the partitioning strategies of every other operation.3
Composable mapping. Both libraries share mapper infrastructure: a per-node store of region allocations is queried before new allocations, and an allocation-coalescing heuristic merges overlapping sub-region views into one larger allocation. In the power-iteration example this reaches a steady state where each iteration moves only a one-element halo of x between GPUs; without coalescing, a full vector copy would repeat every iteration.3
Kernel generation. The 14 DISTAL-generated functions are the performance-critical tensor algebra (SpMV, SpMM, SDDMM); their generated code is 46% of all C++/CUDA in the system (2,854 of 6,135 lines) and 12% of the Python (697 of 5,748). GPU variants were hand-modified to call cuSPARSE where applicable, which the authors flag as the most error-prone step.4
How to use it¶
Reference template (unexecuted; the paper's Figure 1 pattern, current package names on the nv-legate repos, where the sparse module imports as legate_sparse):
# Reference template: falls back to SciPy/NumPy when Legate is not installed.
try:
import cupynumeric as np # cuNumeric was renamed cuPyNumeric
import legate_sparse as sp # legate-sparse module; verify current import path
except ImportError:
import numpy as np
import scipy.sparse as sp
A = sp.random(n, n, format="csr")
A = 0.5 * (A + A.T) + n * sp.eye(n) # positive semi-definite
x = np.random.rand(A.shape[0])
for _ in range(iters): # power iteration, unchanged SciPy idiom
x = A @ x
x /= np.linalg.norm(x)
result = np.dot(x.T, A @ x)
Launch is through the Legate driver, which owns process placement across nodes and GPUs (see the Legate docs for legate launcher flags); the same script runs on a laptop CPU and a multi-node Slurm allocation.
How to develop with it¶
The mechanics the runtime distributes are ordinary CSR mechanics, and they are worth validating independently when debugging numerical differences between backends. This block is executed and asserted: COO to CSR conversion sums duplicate coordinates, CSR SpMV matches a dense reference at machine precision, and the row-block distribution the paper uses (an aligned tiling of y and pos, with images onto crd, vals, and x) recombines exactly across 1, 3, and 4 simulated ranks; adversarial cases cover a guaranteed-empty row, 50 duplicate coordinates, an all-zero matrix, and a rejected dimension mismatch:
# sparse_mechanics.py - validated: COO to CSR conversion with duplicate summing,
# CSR SpMV against a dense reference, and the row-partitioned distribution Legate
# Sparse uses, recombined exactly. Mechanics validation in numpy; does not run Legate.
import numpy as np
def coo_to_csr(rows: np.ndarray, cols: np.ndarray, vals: np.ndarray,
shape: tuple[int, int]) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Sort by (row, col), sum duplicate coordinates, build indptr (pos)."""
assert rows.shape == cols.shape == vals.shape
assert rows.size == 0 or (rows.max() < shape[0] and cols.max() < shape[1])
order = np.lexsort((cols, rows))
r, c, v = rows[order], cols[order], vals[order]
first = np.ones(r.size, dtype=bool)
first[1:] = (r[1:] != r[:-1]) | (c[1:] != c[:-1])
group = np.cumsum(first) - 1
uv = np.zeros(int(first.sum()), dtype=vals.dtype)
np.add.at(uv, group, v)
ur, uc = r[first], c[first]
pos = np.zeros(shape[0] + 1, dtype=np.int64)
np.add.at(pos, ur + 1, 1)
return np.cumsum(pos), uc, uv
def spmv(pos: np.ndarray, crd: np.ndarray, vals: np.ndarray,
x: np.ndarray, n_rows: int) -> np.ndarray:
"""CSR SpMV, the row loop of the DISTAL-generated task in the paper."""
assert crd.size == 0 or crd.max() < x.size, "column index out of range for x"
y = np.zeros(n_rows, dtype=np.float64)
for i in range(n_rows):
lo, hi = pos[i], pos[i + 1]
y[i] = vals[lo:hi] @ x[crd[lo:hi]]
return y
def row_partitioned_spmv(pos: np.ndarray, crd: np.ndarray, vals: np.ndarray,
x: np.ndarray, n_rows: int, ranks: int) -> np.ndarray:
"""Row-block distribution: each rank owns a block of rows (an aligned tiling of
y and pos) plus the image of its crd block into x; results concatenate."""
bounds = np.linspace(0, n_rows, ranks + 1, dtype=np.int64)
parts = []
for r in range(ranks):
lo, hi = bounds[r], bounds[r + 1]
lpos = pos[lo:hi + 1] - pos[lo] # rebased local pos region
lcrd = crd[pos[lo]:pos[hi]] # image(pos, crd) for this block
lvals = vals[pos[lo]:pos[hi]] # image(pos, vals)
parts.append(spmv(lpos, lcrd, lvals, x, int(hi - lo)))
return np.concatenate(parts)
rng = np.random.default_rng(0)
n, m, nnz = 96, 80, 600
rows = rng.integers(0, n, nnz)
cols = rng.integers(0, m, nnz)
vals = rng.standard_normal(nnz)
# Adversarial: force duplicate coordinates and empty rows.
rows[:50] = rows[0]; cols[:50] = cols[0] # 50 duplicates of one coordinate
rows = np.where(rows == 7, 8, rows) # row 7 is guaranteed empty
pos, crd, uvals = coo_to_csr(rows, cols, vals, (n, m))
dense = np.zeros((n, m)); np.add.at(dense, (rows, cols), vals)
x = rng.standard_normal(m)
assert pos[7] == pos[8], "empty row must have an empty pos range"
assert np.isclose(dense[rows[0], cols[0]], vals[:50].sum() + sum(
v for r_, c_, v in zip(rows[50:], cols[50:], vals[50:])
if r_ == rows[0] and c_ == cols[0])), "duplicates must sum"
y = spmv(pos, crd, uvals, x, n)
assert np.allclose(y, dense @ x, atol=1e-12), np.abs(y - dense @ x).max()
for ranks in (1, 3, 4):
yd = row_partitioned_spmv(pos, crd, uvals, x, n, ranks)
assert np.allclose(yd, dense @ x, atol=1e-12), f"rank split {ranks} diverges"
zpos, zcrd, zvals = coo_to_csr(np.array([], dtype=np.int64),
np.array([], dtype=np.int64),
np.array([], dtype=np.float64), (4, 4))
assert np.array_equal(spmv(zpos, zcrd, zvals, np.ones(4), 4), np.zeros(4))
try:
spmv(pos, crd, uvals, np.ones(m - 1), n) # x too short for crd's image
raise SystemExit("dimension mismatch must be rejected")
except AssertionError:
pass
print("nnz after dedup:", uvals.size, "of", nnz)
print("serial == dense reference; 1/3/4-rank row splits == reference")
print("all sparse-mechanics assertions passed")
Output: nnz after dedup: 532 of 600, serial == dense reference; 1/3/4-rank row splits == reference, all sparse-mechanics assertions passed. When extending the library itself, the paper's development pattern is the guide: express new functions over existing cuNumeric and Legate Sparse operations first (156 of 176 functions were obtained this way, with no distributed programming), reach for the DISTAL compiler for performance-critical tensor algebra, and hand-write only sorts and auxiliary operations that neither path covers.
How to maintain it¶
- Track the rename and the merge of scope. cuNumeric is now cuPyNumeric (
nv-legate/cupynumeric, which as of 2026-07 advertises "NumPy and SciPy on Multi-Node Multi-GPU systems"), andnv-legate/legate-sparseremains a separate active repo; where sparse coverage lands over time may shift between the two. Re-check import paths and coverage tables on upgrade. - Pin the stack as a unit. Legate Sparse, cuPyNumeric, the Legate core, Legion, and the CUDA/cuSPARSE toolchain version together; the paper's builds pin GCC 9.3.0, CUDA 11.0.2, and GASNet 2022.9.0 for interconnect transport. Mixed versions surface as mapper or launch failures, not clean errors.
- Watch small-task overheads across releases. The known performance gaps versus CuPy (GMG, quantum simulation, small ML datasets) are Legate/Legion overheads on small tasks; the paper points to tracing and task fusion as fixes, and follow-up work on task and kernel fusion exists. Re-benchmark small-kernel workloads after upgrades rather than assuming the SC'23 ratios.
- Format conversions are part of the cost model. Unsupported format/operation combinations trigger conversions that can dominate runtime; log formats at library boundaries when profiling regressions.
Running it in production¶
- Right-size by memory first. The distribution model's clearest production win is capacity: workloads that OOM a single GPU scale by adding GPUs with zero code change (25M ratings on 2 GPUs, 100M on 12 in the paper's factorization benchmark). Treat GPU count as the memory knob before the throughput knob.6
- Expect communication-bound regimes. The quantum simulation weak-scales poorly past one node because its matrices induce near all-to-all exchanges of tens to hundreds of megabytes per processor pair, and its GPU version fell below CPU throughput at 16 GPUs (fewer nodes means less aggregate NIC bandwidth for the same traffic). Profile communication structure before promising GPU speedups on high-bandwidth sparsity patterns; the fabric budget is part of the application design.
- Plan headroom for the runtime. Legion reserves CPU cores for runtime work and GPU memory for itself and external CUDA libraries; the paper attributes both a CPU weak-scaling gap versus PETSc and a single-GPU capacity gap versus CuPy to these reservations. Do not size nodes to 100% utilization.
- Benchmark discipline. The paper's protocol (12 runs, drop fastest and slowest, average 10) is a reasonable default for iterative-solver benchmarks on shared clusters.
Failure modes¶
- Small tasks expose runtime overhead. Many tiny operations (multi-grid V-cycles, per-batch ML steps) shift the bottleneck from kernels to task launch and metadata management; consolidate steps or batch work per launch.
- OOM from halo growth. Aliased image partitions replicate the referenced pieces of dense vectors; wide-bandwidth matrices make halos grow with scale, and the paper's 64-GPU quantum run OOMed for exactly this reason. Wider matrices need more memory per processor at scale, not less.
- Silent CPU fallback. In a heterogeneous run, one operation without a GPU implementation forces data movement to the CPU and back; the paper names this as a first-order composition hazard. Confirm GPU coverage for the whole hot loop, not just the headline ops.
- Assuming PETSc parity. 65% of PETSc at scale is the measured contract for the CG solve; workloads dominated by dot-product-style all-reduces hit Legion collective overheads earlier than PETSc does.
- Treating partitioned pieces as matrices. Local pieces of the global regions are not valid standalone CSR matrices; interfacing external libraries requires the reshaping step the system performs internally, and custom tasks must do the same.
References¶
- Yadav, Lee, Elibol, Patti, Papadakis, Garland, Aiken, Kjolstad, Bauer, Legate Sparse: Distributed Sparse Computing in Python, SC'23 (DOI 10.1145/3581784.3607033): https://d1qx31qr3h6wln.cloudfront.net/publications/legate-sparse2023.pdf
- Legate Sparse repository (nv-legate): https://github.com/nv-legate/legate-sparse
- Legate core runtime repository: https://github.com/nv-legate/legate
- cuPyNumeric repository (formerly cuNumeric): https://github.com/nv-legate/cupynumeric
- Legate documentation (NVIDIA): https://docs.nvidia.com/legate/latest/
- Legion runtime system: https://legion.stanford.edu/
- Yadav, Aiken, Kjolstad, DISTAL: The Distributed Tensor Algebra Compiler (arXiv 2203.08069): https://arxiv.org/abs/2203.08069
- Yadav et al., Composing Distributed Computations Through Task and Kernel Fusion (follow-up addressing runtime overheads, arXiv 2406.18109): https://arxiv.org/abs/2406.18109
- scipy.sparse API reference: https://docs.scipy.org/doc/scipy/reference/sparse.html
Related: CUDA libraries · Frameworks (PyTorch/JAX/TensorRT) · Triton · CUTLASS · PyTorch custom CUDA extensions · Slurm · Distributed training · NCCL collectives
-
Legate Sparse (SC'23): transparent distribution of unmodified SciPy Sparse programs composing with cuNumeric; Legion regions/partitions/tasks/mappers as the substrate; static decisions (kernel specialization per format and processor) separated from dynamic ones (partition selection at runtime). ↩
-
Section 3: COO/CSR/CSC/DIA as packs of regions;
posstores per-row range tuples so Legion's image operation co-partitionspos/crd/valsand computes the referenced pieces of dense operands; global-region choice trades a cuSPARSE reshaping penalty for composability and direct Legion alignment. ↩ -
Sections 4.1-4.3: constraints for row-split SpMV are equals(y, pos), image(pos, crd), image(pos, vals), image(crd, x); the solver reuses existing partitions and minimizes repartitioned data; shared per-node allocation store plus coalescing heuristic; steady-state power iteration moves a one-element halo per iteration. ↩↩
-
Section 5: 176/492 functions (35%); 14 DISTAL-generated (2,854/6,135 C++/CUDA LOC, 697/5,748 Python LOC), 156 ported, 6 hand-written; CG/CGS/BiCG/BiCGSTAB/GMRES, Runge-Kutta, eigensolvers ported; BSR future work; LIL/DOK out of scope; hand-modifying generated CUDA to call cuSPARSE was the most error-prone step. ↩↩
-
Section 6 (Summit: dual-socket 40-core Power9, 6 V100 per node, NVLink 2.0, EDR InfiniBand): SpMV microbenchmark weak-scales perfectly; CG on 2-D Poisson reaches 85% of PETSc at 1 GPU and 65% at 192 GPUs (drop past 32 nodes traced to Legion all-reduce overheads); GMG (300 lines Python) trails CuPy by 30% on one GPU; quantum simulation (Rydberg arrays, 8th-order Runge-Kutta, exact 50-qubit wavefunctions) is communication-bound at scale and OOMs at 64 GPUs from halo growth. ↩↩
-
Section 6.2 (MovieLens factorization with bias, mini-batch SGD, DISTAL-generated SDDMM): CuPy 2.8x faster on 10M (197,156 vs 69,648 samples/s) but OOM beyond 25M; Legate Sparse 55,288 vs CuPy 28,590 samples/s on 25M, and runs 50M/100M on 6/12 GPUs; within 99.7% of SOTA prediction on 10M. ↩↩