GPU Acceleration

Simon Frost

Overview

Starsim.jl provides GPU acceleration through three package extensions supporting all major GPU platforms. Disease dynamics (transmission, recovery, state transitions) run on GPU while structurally dynamic operations (network rewiring, demographics) remain on CPU.

Architecture

The GPU extension uses a hybrid CPU/GPU approach:

GPU (MtlVector, Float32/UInt8):

  • Agent state arrays: susceptible, infected, recovered, exposed
  • Disease timing arrays: ti_infected, ti_recovered, ti_exposed
  • Transmission kernel (per-edge probability evaluation)
  • Recovery kernel (ti_recovered ≤ current_ti check)
  • SEIR exposure-to-infection transition
  • GPU-side result counting (sum() on MtlVector)

CPU (Vector, Float64/Bool):

  • Network edge lists (rebuilt each timestep for dynamic networks)
  • Recovery duration sampling (lognormal distribution)
  • People management (births, deaths, UID tracking)
  • SIS immunity waning (requires per-agent float arithmetic)

The GPU loop follows the exact same 16-step integration order as the CPU, ensuring identical disease dynamics.

Supported backends

BackendExtensionGPU typePackagePlatform
MetalStarsimMetalExtMtlVectorMetal.jlApple Silicon (macOS)
CUDAStarsimCUDAExtCuVectorCUDA.jlNVIDIA GPUs
ROCmStarsimAMDGPUExtROCVectorAMDGPU.jlAMD GPUs

Extensions load automatically when the corresponding GPU package is imported alongside Starsim:

using Starsim, Metal   # Apple Silicon
using Starsim, CUDA    # NVIDIA
using Starsim, AMDGPU  # AMD

All three backends share the same API (run_gpu!, to_gpu, to_cpu, etc.) and implement identical algorithms. All use Float32/UInt8 on GPU for maximum performance.

Supported diseases

DiseaseGPU supportNotes
SIRFull support including lognormal recovery
SISImmunity waning handled via CPU roundtrip
SEIRE→I transition with recovery time on GPU

Usage

using Starsim, Metal

# Simple — use run_gpu! directly
sim = Sim(
    n_agents = 100_000,
    diseases = [SIR(beta=0.05, dur_inf=10.0, init_prev=0.01)],
    networks = [RandomNet(n_contacts=10)],
    stop = 50.0, dt = 1.0,
)
Starsim.run_gpu!(sim; verbose=1, backend=:metal)

# Static network mode (edges generated once, cached on GPU)
sim2 = Sim(
    n_agents = 1_000_000,
    diseases = [SIR(beta=0.05, dur_inf=10.0, init_prev=0.01)],
    networks = [RandomNet(n_contacts=10)],
    stop = 50.0, dt = 1.0,
)
Starsim.run_gpu!(sim2; verbose=1, backend=:metal, cache_edges=true)

Performance

Benchmarks on Apple M2 Ultra (Metal GPU, 76 GPU cores):

Dynamic edges (default — edges regenerated each step):

AgentsCPU (M a-ts/s)GPU (M a-ts/s)Speedup
1K120.20.02x
10K121.60.13x
100K124.20.36x
500K95.10.55x
1M95.20.60x
5M64.80.79x

Cached edges (static network — single upload):

AgentsCPU (M a-ts/s)GPU (M a-ts/s)Speedup
10K122.00.16x
100K116.70.59x
500K98.00.91x
1M98.20.94x
5M67.81.28x

GPU overtakes CPU at ~5M agents with cached edges — the crossover where GPU parallelism outweighs kernel launch overhead. Julia's CPU code is highly optimized on Apple Silicon (native SIMD), making GPU acceleration less impactful than on platforms with weaker CPUs or discrete GPUs. The GPU path is most useful for:

  • Very large simulations (1M+ agents) with static networks
  • CUDA.jl / AMDGPU.jl on discrete GPUs (where dedicated VRAM and higher memory bandwidth should improve the crossover point)
  • Demonstrating the GPU-ready architecture

Correctness: GPU vs CPU mean trajectory correlation r > 0.999 for all three disease types (SIR, SIS, SEIR) over 30 seeds.

API reference

FunctionDescription
to_gpu(sim; backend=:auto)Convert initialized Sim to GPUSim
to_cpu(gsim)Copy GPU state back to CPU Sim
run_gpu!(sim; backend=:auto)Full GPU simulation lifecycle
gpu_step_state!(gsim, :sir; current_ti=ti)Recovery transitions on GPU
gpu_transmit!(gsim, :sir; current_ti=ti)Transmission with edge upload
cache_edges!(gsim)Upload edges once for static networks
gpu_transmit_cached!(gsim, :sir; current_ti=ti)Transmission with cached edges
sync_to_gpu!(gsim)Re-upload CPU state after CPU-side modifications

Summary

Starsim.jl provides GPU acceleration for SIR, SIS, and SEIR disease models across three platforms: Apple Silicon (Metal.jl), NVIDIA (CUDA.jl), and AMD (AMDGPU.jl). All backends share the same API and implement identical algorithms. Metal benchmarks on Apple Silicon show GPU overtaking CPU at ~5M agents with cached edges; discrete NVIDIA/AMD GPUs with dedicated VRAM are expected to show larger speedups at smaller agent counts. The GPU path produces statistically identical results to the CPU path (r > 0.999).