Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions H100-COMPATIBILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# H100 GPU Compatibility Guide for DeepMind Research

This document provides solutions for running DeepMind research projects on NVIDIA H100 GPUs (Hopper architecture).

## Problem

The H100 GPU requires specific CUDA and cuDNN versions that are not compatible with older TensorFlow versions used in many DeepMind research projects.

**H100 GPU Requirements:**
- CUDA 11.8+ (minimum)
- cuDNN 8.6+ (minimum)
- TensorFlow 2.8+ (recommended for full H100 support)

## Compatibility Matrix

| Project | Current TF Version | H100 Compatible | Recommended Version | Status |
|---------|-------------------|-----------------|---------------------|---------|
| Enformer | 2.5.0 | ⚠️ Limited | 2.10.0+ | Partial support |
| Sketchy | 2.0.0 | ❌ No | 2.10.0+ | Needs upgrade |
| Transporter | 1.13.1 | ❌ No | 2.10.0+ | Major upgrade needed |
| ScratchGAN | 1.15 | ❌ No | 2.10.0+ | Major upgrade needed |

## Solutions

### Option 1: Upgrade to H100-Compatible TensorFlow (Recommended)

For projects that can be upgraded, we provide updated requirements files with H100-compatible versions.

#### Enformer H100 Support
```bash
# Use the new H100-compatible requirements
pip install -r requirements-h100.txt
```

### Option 2: Docker Container with Proper CUDA Setup

For projects requiring specific TensorFlow versions, use containerized environments:

```dockerfile
# Example for TensorFlow 2.5.0 with H100 support
FROM nvidia/cuda:11.8-cudnn8-devel-ubuntu20.04
# Install compatible TensorFlow version...
```

### Option 3: Environment-Specific Installation

Use conda environments with specific CUDA versions:

```bash
# Create H100 environment
conda create -n h100-env python=3.8
conda activate h100-env
conda install cudatoolkit=11.8 cudnn=8.6
pip install tensorflow==2.10.0 # H100 compatible
```

## Project-Specific Solutions

### Enformer Project

The Enformer project can be upgraded to TensorFlow 2.10+ for full H100 support:

1. **Use updated requirements**: `requirements-h100.txt`
2. **Verify compatibility**: Run the included test script
3. **Performance benefits**: ~2-3x faster training on H100

### Legacy Projects (TF 1.x)

For projects using TensorFlow 1.x:

1. **Containerization**: Use Docker with TF 1.x + CUDA 10.2
2. **CPU fallback**: Run on CPU for development/testing
3. **Migration guide**: Follow TF 2.x migration path

## Testing H100 Compatibility

Use the provided test script to verify GPU compatibility:

```bash
python test_h100_compatibility.py
```

## Performance Optimization

Once H100 compatibility is achieved:

1. **Mixed precision**: Enable automatic mixed precision (AMP)
2. **Memory optimization**: Use gradient checkpointing
3. **Batch size tuning**: Leverage H100's 80GB memory

## Migration Timeline

- **Phase 1**: Update Enformer and recent projects (TF 2.x)
- **Phase 2**: Containerize legacy projects (TF 1.x)
- **Phase 3**: Full migration for actively maintained projects

## Contributing

When adding new research code:
- Use TensorFlow 2.10+ for H100 compatibility
- Include both standard and H100-optimized requirements
- Test on both V100 and H100 if available

## References

- [NVIDIA H100 Developer Guide](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/)
- [TensorFlow GPU Support Guide](https://www.tensorflow.org/install/gpu)
- [CUDA Compatibility Matrix](https://docs.nvidia.com/deploy/cuda-compatibility/)
22 changes: 21 additions & 1 deletion enformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,31 @@ Requirements:
* kipoiseq (0.5.2)
* numpy (1.19.5)
* pandas (1.2.3)
* tensoflow (2.4.1)
* tensorflow (2.5.0)
* tensorflow-hub (0.11.0)

See `requirements.txt`.

### NVIDIA H100 GPU Support

For H100 GPU compatibility, use the updated requirements:

```shell
pip install -r requirements-h100.txt
```

**H100 Requirements:**
* TensorFlow 2.8+ (for optimal performance)
* CUDA 11.8+
* cuDNN 8.6+

Test H100 compatibility:
```shell
python ../test_h100_compatibility.py
```

See [H100-COMPATIBILITY.md](../H100-COMPATIBILITY.md) for detailed setup instructions.

To run the unit test:

```shell
Expand Down
23 changes: 23 additions & 0 deletions enformer/requirements-h100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# H100-compatible requirements for Enformer
# This file provides TensorFlow and dependency versions compatible with NVIDIA H100 GPUs

# Core ML framework - H100 compatible version
tensorflow==2.10.1
tensorflow-hub==0.13.0

# DeepMind Sonnet - compatible version
dm-sonnet==2.0.1

# Scientific computing
numpy==1.21.6
pandas==1.5.3

# Bioinformatics
kipoiseq==0.5.2

# Performance optimization for H100
# Enable mixed precision and memory optimization
tensorflow-io-gcs-filesystem==0.29.0 # For efficient data loading

# Optional: Enhanced GPU memory management
gputil>=1.4.0 # For GPU monitoring
2 changes: 1 addition & 1 deletion gated_linear_networks/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
absl-py==0.10.0
aiohttp==3.6.2
aiohttp==3.12.14
astunparse==1.6.3
async-timeout==3.0.1
attrs==20.2.0
Expand Down
Loading