AlphaFold3¶
AlphaFold3 Terms of Use and Restrictions
Important Legal Notice: AlphaFold3 model parameters and outputs are subject to strict terms of use:
- Non-commercial use only: Exclusively available for non-commercial research by academic institutions, non-profits, educational, journalism, and government bodies
- No commercial activities: Cannot be used for any commercial purposes, including research on behalf of commercial organizations
- No model training: Outputs cannot be used to train other machine learning models for biomolecular structure prediction
- Model parameters are confidential: Do not share or publish the model weights outside your organization
- Clinical use prohibited: Predictions are for theoretical modeling only and must not be used for clinical purposes or medical advice
By using AlphaFold3 on this system, you agree to comply with the full Terms of Use. Violations may result in termination of access and legal consequences.
Contact your institution's legal department if you have questions about compliance, especially regarding collaborations with commercial entities.
Overview¶
AlphaFold3 represents a significant advancement in computational structural biology, extending beyond protein folding to predict the structures of entire biomolecular complexes. This third iteration from DeepMind can model proteins, nucleic acids (DNA and RNA), small molecules, ions, and their interactions with high accuracy.
On the Hyperion cluster, AlphaFold3 is deployed as an optimized Apptainer container, providing researchers with access to this technology while ensuring reproducibility and efficient resource utilization.
Key Capabilities
- Multi-molecular complexes: Model proteins, DNA, RNA, and ligands together
- Post-translational modifications: Include phosphorylation, methylation, and other PTMs
- Covalent interactions: Specify bonds between proteins and ligands
- High accuracy: Achieves atomic-level precision for well-folded domains
Getting started¶
Environment modules¶
The AlphaFold3 installation on Hyperion uses the Lmod module system. Loading the module automatically configures all necessary paths and dependencies.
module load AlphaFold/3.0.1
This command sets up your environment with:
- The AlphaFold3 container path
- Access to model weights (neural network parameters)
- Reference databases (UniRef90, PDB, and RNA sequences)
- Optimized wrapper scripts for simplified execution
- Automatic Apptainer module loading
Module environment variables¶
| Variable | Description |
|---|---|
AF3_CONTAINER | Path to the AlphaFold3 Apptainer container |
AF3_WEIGHTS | Location of neural network model parameters |
AF3_DB | Reference sequence databases for MSA generation |
PATH | Updated to include AlphaFold3 wrapper scripts |
Available commands¶
| Command | Description |
|---|---|
alphafold3-run | Main prediction wrapper that handles container execution |
alphafold3-check | Validates installation and environment setup |
alphafold3-help | Displays comprehensive usage information |
Verifying the environment¶
Before running your first prediction, verify that everything is properly configured:
module load AlphaFold/3.0.1
alphafold3-check
Expected output
======================================
AlphaFold3 Installation Check
======================================
Environment Variables:
----------------------
AF3_CONTAINER: <location of the container file>
AF3_WEIGHTS: <location of neural network parameters>
AF3_DB: <location of reference sequence database>
Component Status:
-----------------
✓ Container found (2.8G)
✓ Model weights found (1.1G)
✓ Database directory found (10 files)
Key Database Files:
✓ uniref90_2022_05.fa
✓ pdb_seqres_2022_09_28.fasta
✓ bfd-first_non_consensus_sequences.fasta
GPU Status:
-----------
✓ GPU detected: NVIDIA A100-SXM4-80GB, 81920 MiB
CUDA_VISIBLE_DEVICES: 0
SLURM Context:
--------------
Job ID: 3209670
Partition: preemption
QOS: regular
CPUs: 1
Memory: 204800
Container Test:
---------------
Testing container execution...
Python OK
✓ Container execution successful
======================================
Check complete
Quick start guide¶
Interactive session¶
For development and testing, interactive sessions provide immediate feedback:
# Request GPU resources for 2 hours
srun --gres=gpu:1 --mem=100G --partition=general --qos=regular --pty bash
# Load the AlphaFold3 module
module load AlphaFold/3.0.1
# Run your prediction
alphafold3-run input.json output_dir/
Interactive mode is ideal for
- Testing input files before batch submission
- Debugging failed predictions
- Small proteins requiring quick results
- Learning the system
Batch submission¶
For production runs, submit jobs through SLURM:
Basic batch script
#!/bin/bash
#SBATCH --job-name=af3_prediction
#SBATCH --partition=general
#SBATCH --qos=regular
#SBATCH --gres=gpu:1
#SBATCH --constraint=a100-pcie
#SBATCH --cpus-per-task=48
#SBATCH --mem=100GB
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module load AlphaFold/3.0.1
# Your input file and output directory
alphafold3-run complex_input.json results_${SLURM_JOB_ID}/
Input format specification¶
AlphaFold3 uses JSON files to define molecular systems for prediction. This section covers the essential input format requirements.
Basic structure¶
Every AlphaFold3 input requires five mandatory fields:
{
"name": "my_prediction",
"sequences": [
{
"protein": {
"id": ["A"],
"sequence": "MAFSAEDVLK..."
}
}
],
"modelSeeds": [1, 2, 3],
"dialect": "alphafold3",
"version": 1
}
| Field | Type | Description |
|---|---|---|
name | string | Job identifier (alphanumeric, hyphens, underscores) |
sequences | array | List of molecular entities |
modelSeeds | array | Random seeds for generating multiple models (e.g., [1, 2, 3]) |
dialect | string | Must be exactly "alphafold3" |
version | integer | Format version: 1, 2, 3, or 4 |
Molecular entity types¶
Proteins¶
{
"protein": {
"id": ["A"],
"sequence": "MAFSAEDVLKEYDRRRM..."
}
}
id: Chain identifier(s). Use["A", "B"]for homodimerssequence: Standard single-letter amino acid codes
Nucleic acids¶
{
"rna": {
"id": ["R"],
"sequence": "AUGCAUGC"
}
}
{
"dna": {
"id": ["D"],
"sequence": "ATGCATGC"
}
}
Ligands¶
{
"ligand": {
"id": ["L"],
"smiles": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
}
}
Recommended for custom molecules. Generate SMILES from MOL2/SDF using OpenBabel or RDKit.
{
"ligand": {
"id": ["L"],
"ccdCodes": ["ATP"]
}
}
For standard PDB ligands. Find codes at RCSB PDB Chemical Component Dictionary.
Array requirement for CCD codes
The ccdCodes field must always be an array: ["ATP"] not "ATP"
Ions¶
{
"ion": {
"id": ["I"],
"ccdCodes": ["MG"]
}
}
Common ion codes: MG (magnesium), ZN (zinc), CA (calcium), FE (iron)
Complete examples¶
Protein-ligand complex (GLUT1 + glucose)
{
"name": "GLUT1_glucose_complex",
"sequences": [
{
"protein": {
"id": ["A"],
"sequence": "MEPSDKDKKE..."
}
},
{
"ligand": {
"id": ["B"],
"smiles": "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O"
}
}
],
"modelSeeds": [1, 2, 3],
"dialect": "alphafold3",
"version": 1
}
Protein-DNA complex with metal ion
{
"name": "zinc_finger_DNA",
"sequences": [
{
"protein": {
"id": ["A"],
"sequence": "MTCPECGE..."
}
},
{
"dna": {
"id": ["D"],
"sequence": "GCGTGGGCG"
}
},
{
"ion": {
"id": ["Z"],
"ccdCodes": ["ZN"]
}
}
],
"modelSeeds": [1, 2, 3, 4, 5],
"dialect": "alphafold3",
"version": 1
}
Format conversion¶
MOL2 and SDF formats are not directly supported. Convert to SMILES:
obabel input.mol2 -O output.smi
from rdkit import Chem
# From MOL2
mol = Chem.MolFromMol2File('input.mol2')
smiles = Chem.MolToSmiles(mol)
print(smiles)
# From SDF
suppl = Chem.SDMolSupplier('input.sdf')
for mol in suppl:
smiles = Chem.MolToSmiles(mol)
print(smiles)
Advanced features¶
For advanced use cases, consult the official documentation:
AlphaFold3 Input Format Documentation →
Advanced topics include:
- Post-translational modifications (phosphorylation, glycosylation)
- Covalent bond specifications between entities
- Custom MSA and template provision
- User-defined chemical components
- Chain stoichiometry and multiple copies
Understanding resource requirements¶
GPU selection strategy¶
AlphaFold3's memory requirements depend on the size of your molecular system. The Hyperion cluster provides two GPU types, each suited for different scales:
GPU: RTX 3090
System RAM: 32-48 GB
Expected runtime: 2-4 hours
#SBATCH --gres=gpu:1
#SBATCH --constraint=rtx3090
#SBATCH --mem=32G
GPU: RTX 3090
System RAM: 48 GB
Expected runtime: 4-8 hours
#SBATCH --gres=gpu:1
#SBATCH --constraint=rtx3090
#SBATCH --mem=48G
GPU: A100
System RAM: 64 GB
Expected runtime: 8-12 hours
#SBATCH --gres=gpu:1
#SBATCH --constraint=a100-pcie
#SBATCH --mem=64G
GPU: A100
System RAM: 128 GB
Expected runtime: 12-24 hours
#SBATCH --gres=gpu:1
#SBATCH --constraint=a100-pcie
#SBATCH --mem=128G
Token calculation¶
Understanding token counts is crucial for resource planning:
| Component | Tokens |
|---|---|
| Proteins | 1 token per amino acid residue |
| DNA/RNA | 1 token per nucleotide |
| Small molecules/ligands | 1 token per heavy atom (non-hydrogen) |
| Metal ions | 1 token each |
| Modified residues | Tokenized as individual atoms |
Token calculation example
A protein complex with:
- 500 amino acid residues = 500 tokens
- ATP ligand (31 heavy atoms) = 31 tokens
- Mg²⁺ ion = 1 token
Total: 532 tokens → Use RTX 3090 with 32GB RAM
Memory considerations¶
AlphaFold3 uses memory in two critical phases:
- MSA Generation (CPU-intensive): Searches sequence databases to find evolutionary related sequences
- Structure Inference (GPU-intensive): Predicts 3D coordinates using the neural network
Quadratic scaling
GPU memory requirement scales roughly quadratically with sequence length due to attention mechanisms in the neural network architecture.
Job submission templates¶
Standard single job¶
Standard batch script
#!/bin/bash
#SBATCH --job-name=af3_protein
#SBATCH --partition=general
#SBATCH --qos=regular
#SBATCH --gres=gpu:1
#SBATCH --constraint=a100-pcie
#SBATCH --cpus-per-task=48
#SBATCH --mem=100GB
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
# Create output directories
mkdir -p logs
# Load AlphaFold3
module load AlphaFold/3.0.1
# Define paths
INPUT_JSON="input.json"
OUTPUT_DIR="af3_output_${SLURM_JOB_ID}"
# Run prediction
alphafold3-run ${INPUT_JSON} ${OUTPUT_DIR}
# Report completion
if [ $? -eq 0 ]; then
echo "Successfully completed at $(date)"
echo "Results saved to: ${OUTPUT_DIR}"
else
echo "Prediction failed - check error logs"
fi
Array jobs for high-throughput screening¶
Process multiple structures efficiently with array jobs:
Array job batch script
#!/bin/bash
#SBATCH --job-name=af3_array
#SBATCH --partition=general
#SBATCH --qos=regular
#SBATCH --gres=gpu:1
#SBATCH --constraint=a100-pcie
#SBATCH --cpus-per-task=48
#SBATCH --mem=100GB
#SBATCH --array=1-10%2 # Process 10 structures, max 2 concurrent
#SBATCH --output=logs/array_%A_%a.out
#SBATCH --error=logs/array_%A_%a.err
module load AlphaFold/3.0.1
# Directory structure for array jobs
INPUT_DIR="inputs"
OUTPUT_DIR="outputs"
# Each array task processes one input file
INPUT_FILE="${INPUT_DIR}/input_${SLURM_ARRAY_TASK_ID}.json"
JOB_OUTPUT="${OUTPUT_DIR}/job_${SLURM_ARRAY_TASK_ID}"
mkdir -p ${JOB_OUTPUT}
echo "Processing structure ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_SIZE}"
alphafold3-run ${INPUT_FILE} ${JOB_OUTPUT}
Understanding the wrapper script¶
How the wrapper works¶
The alphafold3-run wrapper script abstracts the complexity of container execution.
alphafold3-run input.json output/
apptainer exec --nv \
--bind ${AF3_WEIGHTS}:/root/models:ro \
--bind ${AF3_DB}:/root/public_databases:ro \
--bind $(pwd):/work \
--pwd /work \
${AF3_CONTAINER} \
python /app/alphafold/run_alphafold.py \
--json_path=input.json \
--output_dir=output \
--db_dir=/root/public_databases \
--model_dir=/root/models
The wrapper handles:
- Container GPU enablement (
--nvflag) - Read-only mounting of databases and weights
- Working directory binding
- Path translations between host and container
Direct container usage¶
For advanced users needing custom parameters:
Direct container execution with custom options
module load AlphaFold/3.0.1
# Direct execution with custom options
apptainer exec --nv \
--bind ${AF3_WEIGHTS}:/root/models:ro \
--bind ${AF3_DB}:/root/public_databases:ro \
--bind $(pwd):/work \
--pwd /work \
${AF3_CONTAINER} \
python /app/alphafold/run_alphafold.py \
--json_path=input.json \
--output_dir=output \
--db_dir=/root/public_databases \
--model_dir=/root/models \
--num_diffusion_samples=5 \
--flash_attention_implementation=triton \
--buckets=256,512,768,1024,1280,1536,2048
Performance optimization¶
Automatic GPU configuration¶
Automatic optimization
The AlphaFold/3.0.1 module automatically detects your GPU type and configures optimal memory settings. These environment variables are set based on your hardware allocation.
Manual environment variables¶
Container environment variables
To override automatic settings, prefix variables with APPTAINERENV_ so they're passed to the container
# Conservative memory usage for 24GB VRAM
export APPTAINERENV_XLA_PYTHON_CLIENT_PREALLOCATE=false
export APPTAINERENV_XLA_PYTHON_CLIENT_MEM_FRACTION=0.85
export APPTAINERENV_TF_FORCE_GPU_ALLOW_GROWTH=true
These settings prevent memory exhaustion on smaller GPUs by:
- Disabling memory preallocation
- Limiting usage to 85% of VRAM
- Allowing gradual memory growth
# Maximum performance for 40/80GB VRAM
export APPTAINERENV_XLA_PYTHON_CLIENT_PREALLOCATE=true
export APPTAINERENV_XLA_PYTHON_CLIENT_MEM_FRACTION=0.95
# For very large systems
export APPTAINERENV_CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
These settings maximize performance by:
- Preallocating GPU memory
- Using 95% of available VRAM
- Enabling unified memory for oversized models
JAX compilation cache¶
Reduce startup time significantly by caching compiled kernels:
# Add to your SLURM script
export JAX_COMPILATION_CACHE_DIR=/scratch/$USER/.jax_cache
mkdir -p $JAX_COMPILATION_CACHE_DIR
Cache benefits
- First run: ~10-15 minutes compilation time
- Subsequent runs: Skip compilation, save time
- Persistence: Cache persists across jobs
- Sharing: Shared cache possible for research groups
Optimization strategies¶
Pre-compilation strategy
Run a minimal test case first to populate the JAX cache, then use the cached compilation for production runs. This saves 10-15 minutes on each subsequent run.
MSA reuse
Save MSA results from similar sequences to skip redundant database searches. This can reduce CPU time by 4-8 hours for related proteins.
Batch processing
- Group proteins of similar size together
- Optimize GPU utilization
- Reduce overall queue wait times
Off-peak submission
Submit large jobs during night hours (22:00-06:00) for better resource availability and faster queue movement.
Output structure and files¶
Directory organization¶
AlphaFold3 creates a structured output directory with all prediction results:
output_dir/
├── <job_name>_model_0.cif # Structure from seed 1
├── <job_name>_model_1.cif # Structure from seed 2
├── <job_name>_model_2.cif # Structure from seed 3
├── <job_name>_confidences.json # Confidence metrics
├── <job_name>_summary_confidences.json # Prediction summary
└── <job_name>_data.json # Complete prediction data
File descriptions¶
| File Type | Description | Usage |
|---|---|---|
.cif files | Crystallographic Information Format structures | Viewable in PyMOL or ChimeraX for 3D visualization |
_confidences.json | Contains pLDDT scores and PAE matrices | Quality assessment and confidence evaluation |
_summary_confidences.json | Metadata about the prediction run | Run parameters and summary statistics |
_data.json | Comprehensive prediction data | All intermediate results and detailed information |
Storage planning¶
| System Type | Size | Output Size | Storage Recommendation |
|---|---|---|---|
| Small protein | 500 residues | ~50 MB | Standard quota sufficient |
| Medium complex | 2000 tokens | ~200 MB | Consider compression |
| Large assembly | 5000 tokens | ~500 MB | Use scratch, then archive |
Troubleshooting guide¶
Common issues and solutions¶
Out of Memory Error
Error Message:
RuntimeError: Resource exhausted: Out of memory
Solutions:
-
Switch to larger GPU:
#SBATCH --constraint=a100-pcie # Instead of rtx3090 -
Increase system memory:
#SBATCH --mem=128GB # Increase from default -
Reduce prediction seeds:
"modelSeeds": [1] // Instead of [1,2,3,4,5] -
Use multi-stage workflow (see templates above)
Process Killed by SLURM
Error Message:
slurmstepd: error: Detected 1 oom-kill event(s)
Cause: System RAM exhaustion (not GPU memory)
Solution: Increase memory allocation:
#SBATCH --mem=200GB # Generous allocation for large systems
GPU Not Found
Error Message:
RuntimeError: No CUDA GPUs are available
Checklist:
-
Verify GPU request in SLURM:
#SBATCH --gres=gpu:1 -
Check module is loaded:
module list # Should show AlphaFold/3.0.1 -
Confirm you're on GPU partition:
#SBATCH --partition=gpu # or 'general' with GPU request
Invalid JSON Format
Error Message:
ValueError: AlphaFold 3 input JSON must contain `dialect` and `version` fields
Solution: Ensure your JSON includes all required fields:
{
"name": "my_protein",
"sequences": [...],
"modelSeeds": [1, 2, 3],
"dialect": "alphafold3", // Must be exactly "alphafold3"
"version": 1 // Integer, not string
}
Validation command:
python -m json.tool input.json # Checks JSON syntax
Data management best practices¶
Efficient scratch usage¶
The /scratch filesystem provides fast temporary storage ideal for AlphaFold3 runs:
# Recommended workflow
SCRATCH_DIR="/scratch/$USER/af3_runs/${SLURM_JOB_ID}"
mkdir -p ${SCRATCH_DIR}
cd ${SCRATCH_DIR}
# Run prediction in scratch
alphafold3-run input.json output/
# Move only essential results to permanent storage
rsync -av output/*.cif output/*_confidence*.json /home/$USER/af3_results/
# Cleanup scratch
rm -rf ${SCRATCH_DIR}
Storage guidelines¶
Storage quotas
- Home directory: Limited quota, avoid storing large outputs
- Scratch: Fast but temporary, should be cleaned periodically
Always compress completed predictions and remove intermediate files.
Quick reference¶
Essential information card¶
Quick Reference Card
Module Load:
module load AlphaFold/3.0.1
Basic Run:
alphafold3-run input.json output/
Token Calculation:
- Protein: 1 token/residue
- DNA/RNA: 1 token/nucleotide
- Ligand: 1 token/heavy atom
- Ion: 1 token
JSON Required Fields:
{
"name": "job_name",
"sequences": [...],
"modelSeeds": [1,2,3],
"dialect": "alphafold3",
"version": 1
}
GPU Selection:
- <2000 tokens: RTX 3090
-
2000 tokens: A100
Common Fixes:
- Out of memory → Use A100
- Process killed → Increase
--mem - No GPU → Add
--gres=gpu:1 - Invalid JSON → Check
dialectfield
Confidence interpretation¶
Understanding AlphaFold3's confidence metrics is crucial for interpreting results:
| pLDDT Score | Confidence Level | Interpretation |
|---|---|---|
| >90 | Very high | Atomic-level accuracy expected |
| 70-90 | Confident | Backbone reliable, sidechains approximate |
| 50-70 | Low | Overall topology likely correct |
| <50 | Very low | Potentially disordered region |
Additional resources¶
External links¶
- AlphaFold3 Paper - Original Nature publication
- GitHub Repository - Source code and updates
- Input Format Documentation - Complete JSON specification
- AlphaFold Database - Pre-computed structures
- ChimeraX Tutorials - Visualization guides