Featured image: Benchmarking AlphaGenome on NVIDIA GPUs: latency, memory, and feasibility across sequence lengths

Benchmarking AlphaGenome on NVIDIA GPUs: latency, memory, and feasibility across sequence lengths


Overview

AlphaGenome is a 450-million-parameter foundation model for genome biology released by Google DeepMind. It processes DNA sequences up to 1 Mb and predicts thousands of genomic tracks, from chromatin accessibility to 3D contact maps, at base-pair resolution. The official implementation is in JAX; a community PyTorch port makes the model accessible to the broader PyTorch ecosystem. We are contributors to that PyTorch port.

This post is a community reference: if you want to run AlphaGenome and are trying to figure out what fits on the GPU you have and how long it will take, this is where we share the numbers. We profile both the official JAX path (alphagenome-jax) and the community PyTorch port (alphagenome-pytorch, running in bf16) across seven NVIDIA GPUs under three workloads: inference, heads-only finetuning, and full-weights finetuning, each on real genomic data.

For background, see two companion posts: Fine-tuning AlphaGenome in native JAX/Haiku (the alphagenome-ft community package) and Porting AlphaGenome to PyTorch (the PyTorch port benchmarked here). DeepMind released the weights and JAX research code (alphagenome_research); the PyTorch port, alphagenome-ft, and this benchmarking effort are all community work. Our JAX finetuning runs are built on alphagenome_research directly, not through alphagenome-ft β€” alphagenome-ft wraps the same research code and should land on similar performance, but it is a separate package and not what these numbers measure.

Side note β€” what is “latency”? Throughout this post, latency means the wall-clock time for a single compiled model step on the GPU (one forward pass for inference, one forward + backward + optimizer step for finetuning). All latency numbers are reported in milliseconds (ms). Peak memory is the maximum GPU memory the step ever held, reported in gigabytes (GB).

Motivation

Researchers adapting AlphaGenome to their own tasks face a concrete planning question: “my lab has GPU X, what sequence lengths can I run, how long will each iteration take, and how much memory do I need to budget?” That question is hard to answer without running the model yourself, and the GPU landscape in academic clusters is diverse. Not every group has access to H200s or even A100s.

We ran controlled, reproducible benchmarks across the GPU tiers most commonly found in academic and cloud environments to make that question easier to answer. Both the official JAX implementation and the community PyTorch port are profiled so that the results are useful regardless of which path a lab is already using. Every number in this post comes from published, tracked benchmark runs that can be reproduced from our profiling repository.

Benchmark Setup

Models

  • alphagenome-jax: the official DeepMind JAX/Haiku implementation, compiled with @jax.jit. Finetuning uses code adapted from DeepMind’s alphagenome_research research-code release (the same backbone alphagenome-ft is built on), wrapped in a minimal custom training loop for these benchmarks
  • alphagenome-pytorch: the community PyTorch port from the Kundaje Lab, compiled with torch.compile and bf16 autocast; finetuning runs are a straightforward training loop built directly on that implementation
  • Borzoi is also benchmarked at its supported sequence lengths (262 kb and 524 kb) using borzoi-pytorch (pretrained weights johahi/borzoi-replicate-0) β€” a third-party community PyTorch reimplementation by Johannes Hingerl (Gagneur lab, TU Munich) β€” not the original Calico TensorFlow release, so that the Borzoi and AlphaGenome-PyTorch numbers share the same framework stack; see Also published: Borzoi

GPUs

We test on seven NVIDIA GPUs spanning Turing, Ampere, Ada Lovelace, and Hopper architectures:

GPUMemoryArchitectureCompute Capability
H200141 GBHopper9.0
H10080 GBHopper9.0
A10080 GBAmpere8.0
L40S48 GBAda Lovelace8.9
L4048 GBAda Lovelace8.9
A4048 GBAmpere8.6
RTX 600024 GBTuring7.5

Most benchmarks run on the University of Washington Hyak cluster; the H100 runs were collected on Stanford’s Marlowe cluster. Full hardware and software details are recorded in the profiling repo’s platform JSON descriptors.

Methodology

  • Batch size: 1
  • Warmup: 3 iterations, discarded
  • Timed: 10 iterations, median latency reported
  • Sequence lengths: 4 kb, 8 kb, 16 kb, 32 kb, 65 kb, 131 kb, 262 kb, 524 kb, and 1 Mb
  • Inference input: random one-hot DNA
  • Finetuning input: real GM12878 matched-ATAC data, using a single ATAC-seq track, in both heads-only and full-weights modes

Both implementations include full prediction heads in the timed path so that the numbers reflect end-to-end model cost. The exact alignment between the two is documented in our apples-to-apples audit.

Inference

Inference is the simplest workload: a single compiled forward pass through the full model including prediction heads, with no gradient computation.

Latency

At 1 Mb, inference latency spans roughly an order of magnitude across the tested GPUs, from about 200 ms on H200 to about 1.3-1.5 s on A40, with RTX 6000 unable to fit the full 1 Mb context. Scaling with sequence length is close to linear above about 131 kb: doubling the context roughly doubles latency. The two implementations produce similar numbers on most GPUs, with the JAX path running a bit faster on Hopper-class hardware; at shorter contexts (<=65 kb) the two paths are close to interchangeable.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax26.846.191.2197.6
H200 141 GBalphagenome-pytorch33.962.1126.1260.0
H100 80 GBalphagenome-jax30.251.099.0217.2
H100 80 GBalphagenome-pytorch37.168.4136.9286.7
A100 80 GBalphagenome-jax83.3152.8297.3681.1
A100 80 GBalphagenome-pytorch87.0168.0336.9709.7
L40S 48 GBalphagenome-jax89.6169.7369.3868.2
L40S 48 GBalphagenome-pytorch104.9209.9432.7922.2
L40 48 GBalphagenome-jax113.2212.5453.21040.1
L40 48 GBalphagenome-pytorch127.6258.9531.61151.3
A40 48 GBalphagenome-jax158.9298.7601.71326.7
A40 48 GBalphagenome-pytorch179.7339.9704.01493.4
RTX 6000 24 GBalphagenome-pytorch1153.32298.64742.9β€”

Values are median latency in milliseconds (ms); lower is better. Em-dash (β€”) marks configurations that ran out of memory.

The RTX 6000 is a special case: JAX bf16 is not supported on Turing (compute capability 7.5), so only the PyTorch path runs there. It is also the only tested card that cannot reach 1 Mb inference.

Peak GPU memory

Inference memory is dominated by activations and scales close to linearly with sequence length above about 131 kb: doubling the context roughly doubles peak memory. At 1 Mb, inference fits in about 35-41 GB depending on the implementation, which is what makes 1 Mb inference feasible on every 48 GB-and-up card we tested. Below 131 kb, the PyTorch path has the smaller footprint; above 262 kb, the JAX path runs a bit leaner.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax7.410.418.234.6
H200 141 GBalphagenome-pytorch6.611.521.240.8
H100 80 GBalphagenome-jax7.410.418.234.6
H100 80 GBalphagenome-pytorch6.611.521.240.8
A100 80 GBalphagenome-jax7.410.418.936.1
A100 80 GBalphagenome-pytorch6.611.421.240.7
L40S 48 GBalphagenome-jax8.111.518.935.8
L40S 48 GBalphagenome-pytorch6.611.421.240.7
L40 48 GBalphagenome-jax8.111.518.935.8
L40 48 GBalphagenome-pytorch6.611.421.240.7
A40 48 GBalphagenome-jax7.410.418.935.8
A40 48 GBalphagenome-pytorch6.611.421.240.7
RTX 6000 24 GBalphagenome-pytorch4.87.312.5β€”

Values are peak GPU memory in gigabytes (GB); lower is better. Em-dash (β€”) marks configurations that ran out of memory.

Peak memory at a given sequence length is nearly identical across GPUs for the same implementation, which means the memory column of this table is a reasonable estimate for untested GPUs too.

Cross-GPU inference and finetuning scaling

Heads-Only Finetuning

Heads-only finetuning freezes the AlphaGenome backbone and trains only the task-specific prediction head. This is the most memory-efficient way to adapt the model and is practical on every GPU we tested, up to the full 1 Mb context.

All finetuning benchmarks use a single GM12878 ATAC-seq track as the training target, a realistic setting for labs working with a specific assay in a specific cell type.

Latency

Heads-only iterations are a little cheaper than a full inference pass: the backbone runs only forward, and only the prediction-head gradients are computed. At 1 Mb, expect roughly 160-205 ms on H200, 180-224 ms on H100, about 560 ms on A100, 700-1200 ms on the 48 GB cards, and about 4.8 s on RTX 6000 at 524 kb. The two implementations sit within a narrow band on most GPUs.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax18.034.472.5160.2
H200 141 GBalphagenome-pytorch26.948.494.8204.0
H100 80 GBalphagenome-jax20.537.178.2180.5
H100 80 GBalphagenome-pytorch29.753.1105.1223.5
A100 80 GBalphagenome-jax63.0125.2253.0560.9
A100 80 GBalphagenome-pytorch67.2127.9253.2565.6
L40S 48 GBalphagenome-jax69.1139.9309.6760.5
L40S 48 GBalphagenome-pytorch75.1149.6312.3684.1
L40 48 GBalphagenome-jax86.7175.0379.5899.8
L40 48 GBalphagenome-pytorch91.5183.8382.9839.0
A40 48 GBalphagenome-jax126.5243.6514.81155.2
A40 48 GBalphagenome-pytorch133.1253.2512.01167.9
RTX 6000 24 GBalphagenome-pytorch1138.42317.34789.5β€”

Values are median per-step latency in milliseconds (ms), measured as a single forward + backward + optimizer step on the prediction head; lower is better.

Peak GPU memory

Heads-only is by far the lightest finetuning mode. 1 Mb peaks at about 14-27 GB depending on implementation and GPU, so every card we tested, down to the 24 GB RTX 6000 at 524 kb, has headroom for this workload.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax4.76.310.217.7
H200 141 GBalphagenome-pytorch3.24.88.014.5
H100 80 GBalphagenome-jax4.76.09.616.7
H100 80 GBalphagenome-pytorch3.24.88.014.5
A100 80 GBalphagenome-jax5.58.514.526.6
A100 80 GBalphagenome-pytorch3.14.88.014.5
L40S 48 GBalphagenome-jax4.96.09.616.7
L40S 48 GBalphagenome-pytorch3.14.88.014.5
L40 48 GBalphagenome-jax4.96.09.616.7
L40 48 GBalphagenome-pytorch3.14.88.014.5
A40 48 GBalphagenome-jax4.97.111.820.8
A40 48 GBalphagenome-pytorch3.14.88.014.5
RTX 6000 24 GBalphagenome-pytorch4.16.711.8β€”

Values are peak GPU memory in gigabytes (GB) during a heads-only training step.

Full-Weights Finetuning

Full-weights finetuning updates all 450 million parameters. This gives the model more capacity to adapt but comes at a steep memory cost: gradients and optimizer states for the entire backbone must fit in GPU memory.

Latency

Full-weights iterations are the most expensive workload. The missing entries in the table are as important as the numbers because they mark where each (GPU, implementation) pair runs out of memory.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax62.3110.4229.4527.8
H200 141 GBalphagenome-pytorch86.9145.2286.6587.0
H100 80 GBalphagenome-jax66.9119.7241.5578.6
H100 80 GBalphagenome-pytorch98.2169.8337.9β€”
A100 80 GBalphagenome-jax213.2418.8852.5β€”
A100 80 GBalphagenome-pytorch210.4378.3773.7β€”
L40S 48 GBalphagenome-jax220.5444.9β€”β€”
L40S 48 GBalphagenome-pytorch291.7543.9β€”β€”
L40 48 GBalphagenome-jax272.6573.4β€”β€”
L40 48 GBalphagenome-pytorch342.5644.8β€”β€”
A40 48 GBalphagenome-jax404.4793.8β€”β€”
A40 48 GBalphagenome-pytorch467.6853.2β€”β€”

Values are median per-step latency in milliseconds (ms) for a full forward + backward + optimizer step over all 450M parameters. Em-dash (β€”) marks (GPU, implementation) pairs that ran out of memory.

The practical feasibility picture is simple: 48 GB GPUs top out at 262 kb for full-weights finetuning, 80 GB cards reach 524 kb, and 1 Mb requires either H200 or an H100 using the JAX path.

Peak GPU memory

Full-weights peak memory roughly doubles with each doubling of sequence length above about 131 kb, mirroring the latency pattern. At 1 Mb, expect about 76-89 GB of peak memory.

GPUFramework131 kb262 kb524 kb1 Mb
H200 141 GBalphagenome-jax12.419.638.486.5
H200 141 GBalphagenome-pytorch13.121.540.789.0
H100 80 GBalphagenome-jax12.419.739.076.1
H100 80 GBalphagenome-pytorch13.121.540.7β€”
A100 80 GBalphagenome-jax13.120.840.6β€”
A100 80 GBalphagenome-pytorch13.021.540.6β€”
L40S 48 GBalphagenome-jax12.819.0β€”β€”
L40S 48 GBalphagenome-pytorch13.021.5β€”β€”
L40 48 GBalphagenome-jax12.820.2β€”β€”
L40 48 GBalphagenome-pytorch13.021.5β€”β€”
A40 48 GBalphagenome-jax13.118.9β€”β€”
A40 48 GBalphagenome-pytorch13.021.5β€”β€”

Values are peak GPU memory in gigabytes (GB) during a full-weights training step. Em-dash (β€”) marks (GPU, implementation) pairs that ran out of memory.

In short, full-weights finetuning of AlphaGenome at 1 Mb is a Hopper-class workload today, and even there the 80 GB tier is on the margin.

What Fits on Each GPU

GPUInferenceHeads-only finetuneFull-weights finetune
H200 141 GB1 Mb1 Mb1 Mb
H100 80 GB1 Mb1 Mb1 Mb (JAX only) / 524 kb (PyTorch)
A100 80 GB1 Mb1 Mb524 kb
L40S 48 GB1 Mb1 Mb262 kb
L40 48 GB1 Mb1 Mb262 kb
A40 48 GB1 Mb1 Mb262 kb
RTX 6000 24 GB524 kb (PyTorch only)524 kb (PyTorch only)not tested

A few practical notes:

  • Inference and heads-only finetuning are broadly accessible. Every card from 48 GB upward runs both at the full 1 Mb context; even the 24 GB RTX 6000 reaches 524 kb
  • Full-weights finetune is the memory bottleneck. If your task calls for full-weights updates and long contexts, plan for H200 access or stay at <=524 kb
  • On RTX 6000 and other Turing-class cards, JAX bf16 is not supported, so only the PyTorch path runs
  • For most workflow decisions, ecosystem familiarity and training-infrastructure fit matter more than modest latency differences

Also published: Borzoi

The same benchmark harness also profiles Borzoi (via the third-party borzoi-pytorch community port by Johannes Hingerl) at its supported 262 kb and 524 kb contexts. Borzoi is a useful reference because it is a much smaller model (~135M parameters vs AlphaGenome’s 450M), so it helps calibrate how much of the AlphaGenome latency is “foundation-model size” rather than anything implementation-specific.

The figure below compares borzoi-pytorch inference against both AlphaGenome implementations across the seven GPU classes at Borzoi’s two supported sequence lengths. RTX 6000 is omitted from the plot because AG-JAX does not run on Turing and the AG-PyTorch fp32-fallback latency (~2-5 s) would compress every other bar into illegibility; its Borzoi/AG-PyTorch numbers remain in the tables below.

Borzoi vs AlphaGenome inference comparison

The tables below give the same numbers in detail, using the same GPU + Framework row layout as the earlier inference tables.

Inference latency

GPUFramework262 kb524 kb
H200 141 GBborzoi-pytorch16.033.6
H200 141 GBalphagenome-jax46.191.2
H200 141 GBalphagenome-pytorch62.1126.1
H100 80 GBborzoi-pytorch17.938.6
H100 80 GBalphagenome-jax51.099.0
H100 80 GBalphagenome-pytorch68.4136.9
A100 80 GBborzoi-pytorch36.779.2
A100 80 GBalphagenome-jax152.8297.3
A100 80 GBalphagenome-pytorch168.0336.9
L40S 48 GBborzoi-pytorch36.190.0
L40S 48 GBalphagenome-jax169.7369.3
L40S 48 GBalphagenome-pytorch209.9432.7
L40 48 GBborzoi-pytorch41.599.4
L40 48 GBalphagenome-jax212.5453.2
L40 48 GBalphagenome-pytorch258.9531.6
A40 48 GBborzoi-pytorch64.1144.2
A40 48 GBalphagenome-jax298.7601.7
A40 48 GBalphagenome-pytorch339.9704.0
RTX 6000 24 GBborzoi-pytorch87.8β€”
RTX 6000 24 GBalphagenome-jaxβ€”β€”
RTX 6000 24 GBalphagenome-pytorch2298.64742.9

Values are median latency in milliseconds (ms); lower is better. Em-dash (β€”) marks configurations that did not run: AG-JAX bf16 is unsupported on Turing, and 524 kb Borzoi on RTX 6000 was not collected.

Inference peak memory

GPUFramework262 kb524 kb
H200 141 GBborzoi-pytorch1.93.0
H200 141 GBalphagenome-jax10.418.2
H200 141 GBalphagenome-pytorch11.521.2
H100 80 GBborzoi-pytorch1.93.0
H100 80 GBalphagenome-jax10.418.2
H100 80 GBalphagenome-pytorch11.521.2
A100 80 GBborzoi-pytorch1.93.0
A100 80 GBalphagenome-jax10.418.9
A100 80 GBalphagenome-pytorch11.421.2
L40S 48 GBborzoi-pytorch1.93.0
L40S 48 GBalphagenome-jax11.518.9
L40S 48 GBalphagenome-pytorch11.421.2
L40 48 GBborzoi-pytorch1.93.0
L40 48 GBalphagenome-jax11.518.9
L40 48 GBalphagenome-pytorch11.421.2
A40 48 GBborzoi-pytorch1.93.0
A40 48 GBalphagenome-jax10.418.9
A40 48 GBalphagenome-pytorch11.421.2
RTX 6000 24 GBborzoi-pytorch1.9β€”
RTX 6000 24 GBalphagenome-jaxβ€”β€”
RTX 6000 24 GBalphagenome-pytorch7.312.5

Values are peak GPU memory in gigabytes (GB); lower is better.

Borzoi inference runs roughly 3-5Γ— faster than AlphaGenome at the same sequence length and fits in a small fraction of the memory, which is expected given the ~3Γ— parameter-count gap (Borzoi ~135M vs AlphaGenome 450M) and the different architectural choices. The gap widens on older GPUs (4-5Γ— on Ampere/Ada) and narrows on Hopper (2.6-2.9Γ—), suggesting AlphaGenome benefits more from the newer hardware’s memory bandwidth and tensor-core throughput than Borzoi does. Full Borzoi inference, heads-only finetune, and full-weights finetune numbers are available in the published CSVs for readers who want a second reference model at these sequence lengths.

Limitations

A few caveats worth flagging explicitly:

  • No TPU results. The JAX path is designed to run on TPUs and would likely see further speed and memory improvements there; we benchmark NVIDIA GPUs only because that is what the academic labs in our network actually have. If you do have TPU access, the numbers here should be read as a lower bound on what JAX can deliver
  • batch_size=1 only. We did not sweep batch sizes larger than one. On H100 and H200, there is likely headroom for batch_size>1 at shorter contexts; sweeping batch size on Hopper-class hardware is on our follow-up list
  • Single workload per mode. We report one inference workload and one finetune workload (matched-ATAC, GM12878); results may vary with other datasets or multi-track heads
  • Borzoi is the PyTorch port, not the original TF release; numbers from the TF implementation on the same GPUs may differ

Resources

Acknowledgements

Thanks to the Genomics x AI community and the Kundaje Lab at Stanford, where the AlphaGenome PyTorch port is developed. Benchmarks were run on the University of Washington Hyak cluster and Stanford’s Marlowe cluster (H100).


Benchmarks were collected April 6-14, 2026. Numbers reflect batch_size=1, median of 10 timed iterations after 3 warmup passes. Finetuning uses a single GM12878 ATAC-seq track. Full methodology and reproducibility instructions are in the profiling repository.

Cite this post:

Xinming Tu, Alejandro Buendia, Anshul Kundaje, Sara Mostafavi. "Benchmarking AlphaGenome on NVIDIA GPUs: latency, memory, and feasibility across sequence lengths." Genomics Γ— AI Blog, 15 April 2026. https://genomicsxai.github.io/blogs/2026-005/.

Comments

Comment belowΒ· Authors: get notified

Add your reaction or comment below.