
Run large language models in a heterogeneous decentralized environment with offloading.
The rapid rise of generative AI has boosted demand for large language model (LLM) inference and fine-tuining services. While proprietary models are still favored, advancements in open-source LLMs have made them competitive. However, high costs and limited GPU resources hinder deployment. This work introduces BloomBee, a decentralized offline serving system that leverages idle GPU resources to provide cost-effective access to LLMs.
We rely on global GPU sharing, which includes more consumer-grade GPUs. If your GPU can only manage a small portion of a large language model, like the Llama3.1 (405B) model, you can connect to a network of servers that load different parts of the model. In this network, you can request inference or fine-tuning services.
pip install bloombee
git clone https://github.com/ai-decentralized/BloomBee.git
cd BloomBeeCreate and activate an environment (either one):
# Using venv
python3 -m venv bloombee-venv && source bloombee-venv/bin/activate
# OR using conda (recommended)
conda create -n bloombee python=3.10.16 && conda activate bloombeeThen install:
pip install -e .How to use BloomBee(Try now in Colab)
Start the DHT main node:
python -m bloombee.cli.run_dht --host_maddrs /ip4/0.0.0.0/tcp/31340 --identity_path bootstrapp1.idAfter running, you will see output similar to:
[INFO] Running a DHT instance. To connect other peers to this one, use:
--initial_peers /ip4/10.0.4.215/tcp/31340/p2p/QmZtZJwF8G2qspQxEVxXfipV4fR7EgpfnkXdbbzaEooaVf
Copy your own full address (including the /p2p/... part).
Each DHT node generates a unique Peer ID, so do not copy the example above.
You can provide this address as --initial_peers to connect workers or other backbone servers.
💡 Tip: If you want your swarm to be accessible outside of your local network, ensure you have a public IP address or set up port forwarding correctly.
Set your main server address (replace with your actual output from step 1):
export BBSERVER=/ip4/10.0.4.215/tcp/31340/p2p/QmZtZJwF8G2qspQxEVxXfipV4fR7EgpfnkXdbbzaEooaVfActivate the BloomBee environment on each worker (you can reuse the environment created in From Source).
Each worker should be started in a separate terminal (or on a separate node) after activating its environment.
Start the first worker to hold 16 blocks (e.g., 16 transformer layers):
python -m bloombee.cli.run_server huggyllama/llama-7b \
--initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_1.idStart the second worker in another activated terminal:
python -m bloombee.cli.run_server huggyllama/llama-7b \
--initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_2.idIf you encounter network issues (e.g., connection resets), please verify your worker IP configurations in the relevant config files.
Optional: If bitsandbytes causes a CUDA version error:
cd ~
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes && python setup.py installEnsure your CUDA library path matches your environment.
cd BloombBee/
python benchmarks/benchmark_inference.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --seq_len 128
cd BloomBee/
python benchmarks/benchmark_training.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --n_steps 20 --batch_size 32 --seq_len 128
BloomBee is built upon a few popular libraries:
- Hivemind - A PyTorch library for decentralized deep learning across the Internet.
- FlexLLMGen - An offloading-based system running on weak GPUs.
- Petals - A library for decentralized LLMs fine-tuning and inference without offloading.