Skip to content

Commit 9f81989

Browse files
jrbourbeaujgerh
andauthored
docs: Fix code highlighting typo in databricks doc (#1114)
Signed-off-by: James Bourbeau <jbourbeau@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
1 parent 4c16d73 commit 9f81989

File tree

1 file changed

+5
-4
lines changed

1 file changed

+5
-4
lines changed

docs/guides/llm/databricks.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Model Training on Databricks
22

3-
Databricks is a widely-used platform for managing data, models, applications, and compute on the cloud. This guide shows how to use Automodel for scalable, performant model training on Databricks.
3+
Databricks is a widely used platform for managing data, models, applications, and compute on the cloud. This guide shows how to use Automodel for scalable, performant model training on Databricks.
44

55
The specific example here fine-tunes a [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model using the [SQuAD dataset](https://huggingface.co/datasets/rajpurkar/squad) from Hugging Face, but any Automodel functionality (for example, {doc}`model pre-training <pretraining>`, {doc}`VLMs </model-coverage/vlm>`, {doc}`other supported models </model-coverage/overview>`) can also be run on Databricks.
66

@@ -9,7 +9,7 @@ The specific example here fine-tunes a [Llama-3.2-1B](https://huggingface.co/met
99
Let’s start by [provisioning](https://docs.databricks.com/aws/en/compute/configure) a Databricks classic compute cluster with the following setup:
1010

1111
- Databricks runtime: [18.0 LTS (Machine Learning version)](https://docs.databricks.com/aws/en/release-notes/runtime/18.0ml)
12-
- Worker instance type: `g6e.12xlarge` on AWS (4x L40S GPU per node)
12+
- Worker instance type: `g6e.12xlarge` on AWS (4x L40S GPUs per node)
1313
- Number of workers: 2
1414
- Global [environment variable](https://docs.databricks.com/aws/en/compute/configure#environment-variables): `GLOO_SOCKET_IFNAME=eth0` (see [this](https://docs.databricks.com/aws/en/machine-learning/train-model/distributed-training/spark-pytorch-distributor#gloo-failure-runtimeerror-connection-refused) for details)
1515
- Cluster-scoped [init script](https://docs.databricks.com/aws/en/init-scripts/cluster-scoped):
@@ -71,8 +71,9 @@ hf_token = getpass("HF token: ")
7171
```
7272
```bash
7373
!hf auth login --token {hf_token}
74+
```
7475

75-
### Single-node
76+
### Single-Node
7677

7778
To run fine-tuning, we’ll use the `finetune.py` script from the Automodel repository and our config file.
7879

@@ -126,7 +127,7 @@ Multi-GPU, single-node utilization of ~95% during model training.
126127
:::
127128

128129

129-
### Multi-node
130+
### Multi-Node
130131

131132
To scale further to multi-node training, we need to submit training jobs to all instances in our Databricks cluster.
132133

0 commit comments

Comments
 (0)