Skip to content

Commit 82d2fba

Browse files
authored
Merge pull request #4 from JamesDConley/new_models
New models
2 parents 4ef0662 + 6a0f7da commit 82d2fba

File tree

13 files changed

+232
-29
lines changed

13 files changed

+232
-29
lines changed

README.md

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
# What is GLaDOS?
2-
GLaDOS is an open source/permissively licensed 20B model tuned to provide an open-source experience _similar_ _to_ ChatGPT.
3-
4-
This repo includes the model itself and a basic web server to chat with it.
2+
GLaDOS is a family of large language models tuned to provide an open-source experience _similar_ _to_ ChatGPT.
53

4+
This repo includes the models and a basic web server to chat with them.
65

76
## Motivation
8-
Similar models exist but often utilize LLAMA which is only available under a noncommercial license. GLaDOS avoids this by utilizing EleutherAI's/togethercomputers apache 2.0 licensed base models and CC0 data.
9-
7+
Similar models exist but often utilize LLAMA which is only available under a noncommercial license. GLaDOS avoids this by utilizing EleutherAI's/togethercomputers apach 2.0 licensed base models and CC0 data.
108
Additionally, GLaDOS is designed to be run fully standalone so you don't need to worry about your information being collected by a third party.
119

1210
## Quickstart
@@ -27,10 +25,33 @@ Then, from inside this container run
2725
```
2826
python src/run_server.py
2927
```
30-
or
28+
This will run the server with default settings of the 7b RedPajama based GLaDOS model.
29+
To run a different model you can pass the model path. For example
30+
```
31+
python src/run_server.py --model models/glados_together_20b
32+
```
33+
will run the 20 billion GPT-NeoX based model.
34+
35+
Various model options are listed below
36+
37+
## Model Options
38+
Each model is fine-tuned with LoRA on the GLaDOS dataset to produce conversation, github flavored markdown.\
39+
Bigger models require more video memory to run, but also perform better.\
40+
The default model is redpajama7b_base
41+
42+
NOTE : To run the starcoder model you need to pass a token to src/run_server.py in order to download the model.
43+
Ex.
3144
```
32-
accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=1 src/run_server.py
45+
python src/run_server.py --model models/glados_starcoder --token <YOUR TOKEN HERE>
3346
```
47+
48+
| Model Path | Base Model | Parameters | License | Strengths |
49+
| ----- | --- | --- | --- | --- |
50+
| models/glados_together_20b | togethercomputer/GPT-NeoXT-Chat-Base-20B | 20 Billion | Apache 2.0 | Best Overall Performance |
51+
| models/glados_redpajama7b_base (default) | togethercomputer/RedPajama-INCITE-Base-7B-v0.1 | 6.9 Billion | Apache 2.0 | Most resource efficient with good performance. (Default) |
52+
| models/glados_starcoder | bigcode/starcoder | 15.5 Billion | BigCode OpenRAIL-M v1 | Best code & related performance |
53+
| models/neox_20b_full (deprecated) | togethercomputer/GPT-NeoXT-Chat-Base-20B | 20 Billion | Apache 2.0 | Old version of glados_together_20b |
54+
3455
One the model comes online it will be available at localhost:5950 and will print a URL you can open in your browser.
3556

3657
The first time the model runs it will download the base model, which is `togethercomputer/GPT-NeoXT-Chat-Base-20B`.
@@ -42,7 +63,10 @@ If you want to leave the server running you can build the container inside tmux,
4263
## License
4364
Apache 2.0 License, see LICENSE.md
4465

45-
## Examples
66+
Note the starcoder basemodel uses an OpenRAIL license, and usage of the starcoder based model may be subject to that.
67+
See https://huggingface.co/bigcode/starcoder for more details. The jist of it is that usage for certain 'unethical' use cases is not allowed.
68+
69+
## Examples (Old)
4670
Basic Code Generation (Emphasis on basic)
4771
![code example](images/code_generation_example.png)
4872

@@ -53,9 +77,11 @@ Brainstorming
5377
![brainstorming example](images/mystery.png)
5478

5579
## Resource Requirements
56-
The current version of GLaDOS uses an FP16 model with ~20B parameters. This is runnable in just under 48GB of VRAM by modifying the generation options in run_server to use a beam width of 1. I am running this with two A6000's nvlinked together and so the default settings run on multiGPU.
80+
The default model is based on RedPajama 7b, and can run on 24GB Nvidia graphics Cards. Short sequences may also be possible on 16GB graphics cards, but this is untested/I wouldn't recommend it.
81+
82+
Other models currently require more video memory- with testing/my hosting being done on 48GB A6000 GPUs.
5783

58-
It should be possible to use GPTQ to reduce the memory requirements to ~16GB so that the model can be run on consumer grade graphics cards.
84+
It is possible to use GPTQ to reduce the memory about 4x, but there is no timeline for completion of this.
5985

6086
## Misc QnA
6187

@@ -72,7 +98,7 @@ Q : How does the model handle formatting?
7298
A : GLaDOS uses a slight variation on github flavored markdown to create lists tables and code blocks. Extra tags are added by the webserver to prettify the code blocks and tweak other small things.
7399

74100

101+
=======
75102
# Acknowledgements:
76103

77104
Big thanks to EleutherAI for GPT-NeoX, togethercomputer for GPT-Neoxt-chat-base and ShareGPT/RyokoAI for ShareGPT data!
78-
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"base_model_name_or_path": "togethercomputer/RedPajama-INCITE-Base-7B-v0.1",
3+
"bias": "none",
4+
"enable_lora": [
5+
true,
6+
false,
7+
true
8+
],
9+
"fan_in_fan_out": true,
10+
"inference_mode": true,
11+
"lora_alpha": 32,
12+
"lora_dropout": 0.1,
13+
"merge_weights": false,
14+
"modules_to_save": null,
15+
"peft_type": "LORA",
16+
"r": 16,
17+
"target_modules": [
18+
"query_key_value"
19+
],
20+
"task_type": "CAUSAL_LM"
21+
}
16 MB
Binary file not shown.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"base_model_name_or_path": "bigcode/starcoder",
3+
"bias": "none",
4+
"enable_lora": null,
5+
"fan_in_fan_out": false,
6+
"inference_mode": true,
7+
"lora_alpha": 32,
8+
"lora_dropout": 0.1,
9+
"merge_weights": false,
10+
"modules_to_save": null,
11+
"peft_type": "LORA",
12+
"r": 16,
13+
"target_modules": [
14+
"c_attn",
15+
"c_proj"
16+
],
17+
"task_type": "CAUSAL_LM"
18+
}
67.8 MB
Binary file not shown.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"base_model_name_or_path": "togethercomputer/GPT-NeoXT-Chat-Base-20B",
3+
"bias": "none",
4+
"enable_lora": [
5+
true,
6+
false,
7+
true
8+
],
9+
"fan_in_fan_out": true,
10+
"inference_mode": true,
11+
"lora_alpha": 32,
12+
"lora_dropout": 0.1,
13+
"merge_weights": true,
14+
"modules_to_save": null,
15+
"peft_type": "LORA",
16+
"r": 16,
17+
"target_modules": [
18+
"query_key_value"
19+
],
20+
"task_type": "CAUSAL_LM"
21+
}
33 MB
Binary file not shown.

src/get_args.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import argparse
2+
3+
def get_args():
4+
"""Get arguments for running GLaDOS
5+
6+
Returns:
7+
NameSpace: args object with member variables for each option
8+
"""
9+
parser = argparse.ArgumentParser(description='Get model choice and token')
10+
parser.add_argument('--model', default='models/glados_redpajama7b_base', help='Path to the model to run')
11+
parser.add_argument('--token', default=None, help='Huggingface token required for starcoder model download')
12+
parser.add_argument('--multi_gpu', action="store_true", default=False, help='If passed will distribute model across multiple GPUs')
13+
args = parser.parse_args()
14+
return args

src/glados.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
logger = logging.getLogger(__name__)
2626

2727
class GLaDOS:
28-
def __init__(self, path, stop_phrase="User :\n", device="cuda", half=False, cache_dir="models/hface_cache", use_deepspeed=False, int8=False, max_length=2048, multi_gpu=False):
28+
def __init__(self, path, stop_phrase="User :\n", device="cuda", half=True, cache_dir="models/hface_cache", use_deepspeed=False, int8=False, max_length=2048, multi_gpu=False, token=None, better_transformer=False):
2929
"""AI is creating summary for __init__
3030
3131
Args:
@@ -45,21 +45,24 @@ def __init__(self, path, stop_phrase="User :\n", device="cuda", half=False, cac
4545
# TODO : Make int8 work
4646
if int8:
4747
# THIS IS NOT TESTED
48-
model = AutoModelForCausalLM.from_pretrained(base_model_path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True)
48+
model = AutoModelForCausalLM.from_pretrained(base_model_path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True, use_auth_token=token)
4949
# Less than half!
5050
device = None
51-
model = PeftModel.from_pretrained(model, path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True)
51+
model = PeftModel.from_pretrained(model, path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True, use_auth_token=token)
5252

5353
# TODO : Make multi_gpu work (It used to work, when did it break?)
5454
elif multi_gpu:
55-
model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16)
55+
model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, use_auth_token=token)
5656
# Model should already be half
5757
half=True
5858
# Device map will be set automatically above, setting another device map break it
59-
model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16)
59+
model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, use_auth_token=token)
6060
else:
6161
# TODO : Create custom device map to load on single GPU without using intermediate
62-
model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, torch_dtype=torch.float16)
62+
model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, torch_dtype=torch.float16, use_auth_token=token)
63+
if better_transformer:
64+
logger.info("Converting model to better transformer model for speedup...")
65+
model = model.to_bettertransformer()
6366
model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir)
6467
# TODO : Does this do anything? Model should already be fp16. Would be nice to remove another argument from the long list
6568
if half:
@@ -68,9 +71,12 @@ def __init__(self, path, stop_phrase="User :\n", device="cuda", half=False, cac
6871
if device is not None:
6972
model.to(device)
7073

74+
7175
# Make sure it's in eval mode
7276
model.eval()
7377

78+
79+
7480
# Bookkeeping
7581
self.device = device
7682
self.base_model_path = base_model_path
@@ -80,7 +86,7 @@ def __init__(self, path, stop_phrase="User :\n", device="cuda", half=False, cac
8086
self.model = model
8187

8288
# Setup tokenizer
83-
self.tokenizer = AutoTokenizer.from_pretrained(base_model_path, truncation_side="left")
89+
self.tokenizer = AutoTokenizer.from_pretrained(base_model_path, truncation_side="left", use_auth_token=token, cache_dir=cache_dir)
8490
self.tokenizer.pad_token = self.tokenizer.eos_token
8591

8692
# Ban the model from generating certain phrases
@@ -133,7 +139,7 @@ def run_model(self, text, kwargs=None):
133139
base_kwargs = {
134140
"num_beams" : 16,
135141
"stopping_criteria" : self.stop_token_seqs,
136-
"max_new_tokens" : 256,
142+
"max_new_tokens" : 1024,
137143
"pad_token_id" : self.tokenizer.eos_token_id,
138144
"bad_words_ids" : self.bad_token_seqs,
139145
"no_repeat_ngram_size" : 12,

src/md_utils.py

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
1+
import logging
12
import re
23
import pandoc
4+
5+
logger = logging.getLogger(__name__)
6+
37
def fix_lines(base_md):
48
"""Doubles newlines outside of code blocks to fix formatting issue from model training code.
59
@@ -16,13 +20,38 @@ def fix_lines(base_md):
1620
if i % 2 == 0:
1721
sec = replace_newline_with_br(sec)
1822
fixed_sections.append(sec)
19-
return "```".join(fixed_sections)
23+
24+
updated_md = "```".join(fixed_sections)
25+
logger.debug(f"Original markdown : {base_md}")
26+
logger.debug(f"Updated markdown : {updated_md}")
27+
return updated_md
2028

21-
def replace_newline_with_br(text):
29+
30+
# TODO : Simplify this function
31+
# Alternately train the model to output breaks on it's own
32+
def identify_break_points(text):
2233
replace_spots = []
23-
for i, char in enumerate(text.strip()):
24-
if char == "\n" and (i > 0 and text[i-1]!= "\n") and (i < len(text) - 1 and text[i+1]!= "\n"):
34+
line_so_far = ""
35+
skippable = False
36+
for i, char in enumerate(text):
37+
if char == "\n" and \
38+
(i > 0 and text[i-1]!= "\n") and \
39+
(i < len(text) - 1 and text[i+1]!= "\n") and \
40+
"|" not in line_so_far and \
41+
not skippable:
2542
replace_spots.append(i)
43+
if char != "\n":
44+
line_so_far += char
45+
stripped = line_so_far.strip()
46+
if len(stripped) > 0 and (not stripped[0].isalpha()):
47+
skippable = True
48+
else:
49+
line_so_far = ""
50+
skippable = False
51+
return replace_spots
52+
53+
def replace_newline_with_br(text):
54+
replace_spots = identify_break_points(text)
2655
replace_spots.reverse()
2756
for i in replace_spots:
2857
text = text[:i] + "<br>\n" + text[i+1:]

0 commit comments

Comments
 (0)