Skip to content

AlphaFold3 Scalability Tests

General Information

  • Version: Alphafold 3.0.1
  • Cluster: Hyperion
  • CPU: Intel Xeon Gold 6342, 48 cores per node
  • Memory per node: 16 × 16 GB 3200 MHz DDR4 (total 256 GB)
  • Interconnect: InfiniBand 200 Gb
  • OS: Rocky Linux 8.10

Target audience: Users of AlphaFold3 on Hyperion (recommendations for requesting compute resources).

Methodology

Inputs: randomly generated protein sequences; each residue chosen independently (synthetic dataset). These sequences are sampled using typical amino-acid frequencies from natural proteins, include an initial M for longer sequences, and avoid excessive rare residues in short sequences. Sequence lengths tested: 10, 25, 50, 100, 150, 300, 600, 1000, 1500, 2000 residues.

For each length we measured:

total run time (HH:MM:SS), and CPU efficiency reported by the run.

Resource sweep: number of CPUs ∈ {1,2,4,8,16,32,48}; memory set to 32 GB and 64 GB in separate runs.

Speedup is computed relative to the 1-CPU time for the same protein length and memory allocation (speedup = T(1CPU)/T(NCPUs)).

Runs were executed on a single Hyperion node. No GPUs were used for the MSA computation (CPU-intensive). For the "Structure Inference" steps 2 GPU types where used: RTX3090 and A100-PCIE.

MSA Stage (CPU intensive)

In AlphaFold3 (and earlier versions), the MSA (Multiple Sequence Alignment) step is the most CPU-demanding and time-consuming part of the pipeline. MSA generation requires scanning your query sequence against massive sequence databases — often hundreds of gigabytes or even terabytes in size. Each search involves comparing your sequence to millions of potential homologs. These algorithms perform billions of short string alignments (with scoring matrices, gap penalties, etc.). This is highly parallelizable, but still compute-bound. GPUs are not efficient for this kind of workload.

Why it becomes the bottleneck ?? While the deep learning inference (structure prediction by AlphaFold3 model) is GPU-accelerated and can take just a few minutes per target, the MSA generation step often takes tens of minutes to hours, even on many cores, because: - Database search scales roughly with sequence length × database size - The neural net inference is comparatively constant-time once the MSA is ready Hence, even on HPC clusters, the MSA step dominates total wall time.

Raw results

Obteined results for increasing number of residue number, CPU and memory. The given values in the table are total wall time (HH:MM:SS )and the CPU efficiency (%).

MSA step results
10 Residues 32GB 64GB
1cpu 00:26:57 / 96.7% 00:23:37 / 99.3%
2cpu 00:12:03 / 97.6% 00:14:08 / 93.0%
4cpu 00:08:10 / 75.3% 00:08:12 / 75.0%
8cpu 00:07:54 / 43.7% 00:10:19 / 37.4%
16cpu 00:07:06 / 23.9% 00:07:06 / 24.0%
32cpu 00:09:24 / 10.4% 00:07:20 / 11.8%
48cpu 00:08:44 / 7.0% 00:08:43 / 7.0%
25 Residues 32GB 64GB
1cpu 00:23:36 / 99.4% 00:23:33 / 99.6%
2cpu 00:12:04 / 97.6% 00:14:45 / 90.1%
4cpu 00:08:12 / 75.2% 00:11:21 / 62.7%
8cpu 00:07:59 / 43.3% 00:07:53 / 43.6%
16cpu 00:07:09 / 23.7% 00:09:49 / 19.3%
32cpu 00:07:09 / 12.0% 00:09:59 / 9.7%
48cpu 00:06:48 / 8.3% 00:09:11 / 6.8%
50 Residues 32GB 64GB
1cpu 00:28:14 / 97.5% 00:24:51 / 99.5%
2cpu 00:14:54 / 94.0% 00:12:41 / 98.3%
4cpu 00:08:19 / 77.7% 00:11:22 / 65.1%
8cpu 00:08:02 / 45.0% 00:08:02 / 44.7%
16cpu 00:09:14 / 21.5% 00:07:11 / 24.9%
32cpu 00:08:50 / 11.0% 00:07:15 / 12.4%
48cpu 00:09:38 / 6.8% 00:06:42 / 8.7%
100 Residues 32GB 64GB
1cpu 00:28:12 / 99.6% 00:32:13 / 97.1%
2cpu 00:14:14 / 99.1% 00:17:15 / 91.5%
4cpu 00:08:25 / 85.8% 00:08:22 / 86.3%
8cpu 00:10:11 / 44.1% 00:11:02 / 40.7%
16cpu 00:07:10 / 27.7% 00:09:39 / 23.0%
32cpu 00:07:11 / 13.9% 00:08:50 / 12.7%
48cpu 00:06:48 / 9.5% 00:06:56 / 9.5%
150 Residues 32GB 64GB
1cpu 00:36:19 / 97.4% 00:32:06 / 99.6%
2cpu 00:19:14 / 92.7% 00:16:16 / 98.6%
4cpu 00:09:10 / 89.0% 00:11:47 / 78.4%
8cpu 00:11:15 / 44.5% 00:08:16 / 53.6%
16cpu 00:09:42 / 25.4% 00:07:05 / 31.2%
32cpu 00:08:48 / 14.0% 00:07:12 / 15.5%
48cpu 00:07:00 / 10.5% 00:09:43 / 8.4%
300 Residues 32GB 64GB
1cpu 00:46:05 / 99.7% 00:46:05 / 99.7%
2cpu 00:23:25 / 98.5% 00:23:22 / 98.7%
4cpu 00:12:53 / 90.6% 00:12:56 / 90.2%
8cpu 00:08:54 / 69.0% 00:09:02 / 68.0%
16cpu 00:07:08 / 43.3% 00:07:11 / 43.3%
32cpu 00:06:59 / 22.2% 00:07:02 / 22.1%
48cpu 00:07:41 / 13.7% 00:07:16 / 14.3%
600 Residues 32GB 64GB
1cpu 01:18:33 / 99.7% 01:27:04 / 99.0%
2cpu 00:40:02 / 97.9% 00:44:48 / 96.4%
4cpu 00:20:19 / 97.7% 00:20:17 / 97.8%
8cpu 00:10:50 / 93.0% 00:13:58 / 80.4%
16cpu 00:07:54 / 64.4% 00:10:57 / 51.1%
32cpu 00:06:55 / 37.0% 00:06:58 / 36.8%
48cpu 00:09:05 / 20.5% 00:06:59 / 24.4%
1000 Residues 32GB 64GB
1cpu 02:02:08 / 99.7% 02:02:02 / 99.8%
2cpu 01:12:17 / 97.4% 01:04:59 / 98.6%
4cpu 00:31:10 / 97.6% 00:37:01 / 90.5%
8cpu 00:18:55 / 90.8% 00:16:01 / 96.4%
16cpu 00:11:02 / 77.2% 00:09:01 / 87.6%
32cpu 00:07:14 / 55.6% 00:07:17 / 55.4%
48cpu 00:08:12 / 31.9% 00:10:06 / 28.1%
1500 Residues 32GB 64GB
1cpu 02:54:09 / 99.7% 02:54:29 / 99.8%
2cpu 01:27:39 / 99.1% 01:37:49 / 97.4%
4cpu 00:44:16 / 97.7% 00:49:07 / 96.8%
8cpu 00:22:18 / 98.5% 00:25:44 / 94.3%
16cpu 00:12:50 / 87.6% 00:14:27 / 83.7%
32cpu 00:09:40 / 60.9% 00:09:42 / 60.6%
48cpu 00:09:03 / 41.1% 00:09:13 / 40.5%
2000 Residues 32GB 64GB
1cpu 04:03:56 / 99.5% 04:05:12 / 99.0%
2cpu 02:02:31 / 99.0% 01:51:17 / 99.4%
4cpu 01:02:03 / 98.3% 01:02:49 / 96.8%
8cpu 00:33:57 / 90.7% 00:28:26 / 98.3%
16cpu 00:18:51 / 81.5% 00:16:36 / 86.2%
32cpu 00:14:18 / 54.2% 00:14:09 / 54.9%
48cpu 00:13:25 / 38.4% 00:13:39 / 37.6%
Time vs CPU graphic

AlphaFold

Efficiency vs CPU graphic

AlphaFold

Speedup vs CPU graphic

AlphaFold

Interpretation notes & practical recommendations for the MSA stage

For very small proteins (10–50 residues) you observe poor scaling beyond a modest number of CPUs: wall time often plateaus or even increases for high CPU counts (32/48). For those, use 1–8 CPUs depending on queue availability; giving many CPUs wastes resources and does not significantly reduce wall time.

For medium proteins (100–600 residues) you get better scaling up to 8–16 CPUs; beyond that benefits diminish but may still help for some lengths (600 res shows good improvements up to 32 CPUs in some runs). Recommended: 8–32 CPUs and 64 GB if memory bound; check your particular dataset.

For large proteins (1000–2000 residues) you typically need 8–32 CPUs to get good runtime; some 1-CPU runs are missing — but available data shows strong gains with 8–32 CPUs. For very large inputs, prefer 64 GB and 16–32 CPUs as a starting point.

Memory: in many tests 64 GB sometimes gives slightly better times, but not always — performance vs memory tradeoffs appear workload dependent. If uncertain, request 64 GB for larger proteins (>600 residues).

I/O / DB downloads: AlphaFold needs database access. If many users run many parallel jobs, consider staging databases locally or on fast scratch to avoid network bottlenecks.

Queue requests for users: in job scripts, request a realistic CPU count (see recommendations above), set a reasonable walltime based on the figures (generate plot and pick typical runtime) and request 64 GB for proteins >600 residues.

Structure Inference Step (GPU intensive)

The Structure Inference phase of AlphaFold 3 corresponds to the actual 3D structure generation from the previously computed multiple sequence alignments (MSAs) and templates. This stage relies heavily on deep neural network inference, which involves executing a fixed sequence of matrix multiplications, convolutions, and attention operations. These operations are GPU-bound rather than CPU-bound.

In this benchmark, a 300-residue protein system was evaluated while varying the number of CPUs, memory allocation, and GPU type. The following table summarizes the results:

Inference stage with GPU
System RTX 3090 (32GB) RTX 3090 (64GB) A100 (32GB) A100 (64GB)
1cpu 00:01:37 / 95.88% 00:01:37 / 97.94% 00:01:39 / 79.80% 00:01:45 / 76.19%
2cpu 00:01:37 / 49.48% 00:01:37 / 48.45% 00:01:41 / 40.10% 00:01:51 / 36.49%
4cpu 00:01:33 / 25.54% 00:01:36 / 24.74% 00:01:23 / 24.70% 00:01:50 / 18.64%
8cpu 00:01:38 / 12.50% 00:01:36 / 12.76% 00:01:13 / 13.01% 00:01:14 / 12.84%
16cpu 00:01:34 / 6.52% 00:01:36 / 6.51% 00:01:15 / 6.33% 00:01:24 / 6.18%
32cpu 00:01:36 / 3.39% 00:01:35 / 3.36% 00:01:13 / 3.42% 00:01:13 / 3.34%
48cpu 00:01:35 / 2.37% 00:01:36 / 2.34% 00:01:10 / 2.38% 00:01:12 / 2.37%

Analysis

  1. Lack of CPU scaling Increasing the number of CPUs does not reduce the total runtime. This is because the inference kernels (matrix multiplications and tensor contractions) are executed almost entirely on the GPU using CUDA libraries (e.g., cuBLAS, cuDNN). The CPU only handles lightweight orchestration tasks — data loading, batch preparation, and invoking GPU kernels — which represent a negligible fraction of the total runtime. Once the GPU is saturated with data, adding more CPU threads has no measurable impact.

  2. Limited GPU performance gains Comparing RTX 3090 and A100 GPUs shows only marginal improvements in execution time. Although the A100 has more CUDA cores and higher theoretical FLOPS, AlphaFold’s inference workload is memory bandwidth and latency-bound, not compute-bound. The neural network uses a fixed model graph that fits well within the 24–40 GB of VRAM, and the additional GPU compute capacity is underutilized because the batch size and model architecture remain constant. Therefore, both GPUs reach similar throughput levels.

  3. Impact of system memory (RAM) The system’s main memory (32GB vs 64GB) does not affect this stage either. Once the model weights and inputs are loaded into GPU memory, data transfers between host and device are minimal.

Inference conclusion

The Structure Inference step of AlphaFold 3 is fully dominated by GPU performance and shows no CPU scalability. Performance gains from using more CPUs or increasing system RAM are negligible. The choice of GPU affects runtime only slightly, with the A100 offering minor acceleration due to its larger bandwidth and memory size.

Final remarks

For performance optimization, allocate more CPU cores to the MSA step and use a GPU with sufficient VRAM for the inference stage. For larger proteins, prefer GPUs like the A100 (40–80GB) to avoid out-of-memory errors rather than for speed gains. Adding more CPUs or RAM does not improve inference performance.

Summary
  • MSA >> scales with CPUs and benefits from parallelization.
  • Structure Inference >> GPU-bound, minimal gains from more CPUs.
  • Larger GPUs (A100) are valuable mainly for memory capacity, not speed.