File tree Expand file tree Collapse file tree 1 file changed +14
-1
lines changed Expand file tree Collapse file tree 1 file changed +14
-1
lines changed Original file line number Diff line number Diff line change 10
10
config .arch_compat_overrides ()
11
11
config .no_graphs = True
12
12
model = ExLlamaV2 (config )
13
- model .load_tp (progress = True )
13
+
14
+ # Load the model in tensor-parallel mode. With no gpu_split specified, the model will attempt to split across
15
+ # all visible devices according to the currently available VRAM on each. expect_cache_tokens is necessary for
16
+ # balancing the split, in case the GPUs are of uneven sizes, or if the number of GPUs doesn't divide the number
17
+ # of KV heads in the model
18
+ #
19
+ # The cache type for a TP model is always ExLlamaV2Cache_TP and should be allocated after the model. To use a
20
+ # quantized cache, add a `base = ExLlamaV2Cache_Q6` etc. argument to the cache constructor. It's advisable
21
+ # to also add `expect_cache_base = ExLlamaV2Cache_Q6` to load_tp() as well so the size can be correctly
22
+ # accounted for when splitting the model.
23
+
24
+ model .load_tp (progress = True , expect_cache_tokens = 16384 )
14
25
cache = ExLlamaV2Cache_TP (model , max_seq_len = 16384 )
15
26
27
+ # After loading the model, all other functions should work the same
28
+
16
29
print ("Loading tokenizer..." )
17
30
tokenizer = ExLlamaV2Tokenizer (config )
18
31
You can’t perform that action at this time.
0 commit comments