Merge pull request #4 from JamesDConley/new_models

JamesDConley · web-flow · commit 82d2fbace9c7 · 2023-05-12T20:42:43.000-04:00
New models
diff --git a/README.md b/README.md
@@ -1,12 +1,10 @@
 # What is GLaDOS?
-GLaDOS is an open source/permissively licensed 20B model tuned to provide an open-source experience _similar_ _to_ ChatGPT. 
-
-This repo includes the model itself and a basic web server to chat with it.
+GLaDOS is a family of large language models tuned to provide an open-source experience _similar_ _to_ ChatGPT. 
 
+This repo includes the models and a basic web server to chat with them.
 
 ## Motivation
-Similar models exist but often utilize LLAMA which is only available under a noncommercial license. GLaDOS avoids this by utilizing EleutherAI's/togethercomputers apache 2.0 licensed base models and CC0 data.
-
+Similar models exist but often utilize LLAMA which is only available under a noncommercial license. GLaDOS avoids this by utilizing EleutherAI's/togethercomputers apach 2.0 licensed base models and CC0 data.
 Additionally, GLaDOS is designed to be run fully standalone so you don't need to worry about your information being collected by a third party.
 
 ## Quickstart
@@ -27,10 +25,33 @@ Then, from inside this container run
 ```
 python src/run_server.py
 ```
-or
+This will run the server with default settings of the 7b RedPajama based GLaDOS model.
+To run a different model you can pass the model path. For example
+```
+python src/run_server.py --model models/glados_together_20b
+```
+will run the 20 billion GPT-NeoX based model.
+
+Various model options are listed below
+
+## Model Options
+Each model is fine-tuned with LoRA on the GLaDOS dataset to produce conversation, github flavored markdown.\
+Bigger models require more video memory to run, but also perform better.\
+The default model is redpajama7b_base
+
+NOTE : To run the starcoder model you need to pass a token to src/run_server.py in order to download the model.
+Ex.
 ```
-accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=1 src/run_server.py
+python src/run_server.py --model models/glados_starcoder --token <YOUR TOKEN HERE>
 ```
+
+| Model Path | Base Model | Parameters | License | Strengths |
+| ----- | --- | --- | --- | --- |
+| models/glados_together_20b | togethercomputer/GPT-NeoXT-Chat-Base-20B | 20 Billion | Apache 2.0 | Best Overall Performance |
+| models/glados_redpajama7b_base (default) | togethercomputer/RedPajama-INCITE-Base-7B-v0.1 | 6.9 Billion | Apache 2.0 | Most resource efficient with good performance. (Default) |
+| models/glados_starcoder | bigcode/starcoder | 15.5 Billion | BigCode OpenRAIL-M v1 | Best code & related performance |
+| models/neox_20b_full (deprecated) | togethercomputer/GPT-NeoXT-Chat-Base-20B | 20 Billion | Apache 2.0 | Old version of glados_together_20b |
+
 One the model comes online it will be available at localhost:5950 and will print a URL you can open in your browser.
 
 The first time the model runs it will download the base model, which is `togethercomputer/GPT-NeoXT-Chat-Base-20B`.
@@ -42,7 +63,10 @@ If you want to leave the server running you can build the container inside tmux,
 ## License
 Apache 2.0 License, see LICENSE.md
 
-## Examples
+Note the starcoder basemodel uses an OpenRAIL license, and usage of the starcoder based model may be subject to that.
+See https://huggingface.co/bigcode/starcoder for more details. The jist of it is that usage for certain 'unethical' use cases is not allowed.
+
+## Examples (Old)
 Basic Code Generation (Emphasis on basic)
 ![code example](images/code_generation_example.png)
 
@@ -53,9 +77,11 @@ Brainstorming
 ![brainstorming example](images/mystery.png)
 
 ## Resource Requirements
-The current version of GLaDOS uses an FP16 model with ~20B parameters. This is runnable in just under 48GB of VRAM by modifying the generation options in run_server to use a beam width of 1. I am running this with two A6000's nvlinked together and so the default settings run on multiGPU.
+The default model is based on RedPajama 7b, and can run on 24GB Nvidia graphics Cards. Short sequences may also be possible on 16GB graphics cards, but this is untested/I wouldn't recommend it.
+
+Other models currently require more video memory- with testing/my hosting being done on 48GB A6000 GPUs.
 
-It should be possible to use GPTQ to reduce the memory requirements to ~16GB so that the model can be run on consumer grade graphics cards.
+It is possible to use GPTQ to reduce the memory about 4x, but there is no timeline for completion of this.
 
 ## Misc QnA
 
@@ -72,7 +98,7 @@ Q : How does the model handle formatting?
 A : GLaDOS uses a slight variation on github flavored markdown to create lists tables and code blocks. Extra tags are added by the webserver to prettify the code blocks and tweak other small things.
 
 
+=======
 # Acknowledgements:
 
 Big thanks to EleutherAI for GPT-NeoX, togethercomputer for GPT-Neoxt-chat-base and ShareGPT/RyokoAI for ShareGPT data!
-
diff --git a/models/glados_redpajama7b_base/adapter_config.json b/models/glados_redpajama7b_base/adapter_config.json
@@ -0,0 +1,21 @@
+{
+  "base_model_name_or_path": "togethercomputer/RedPajama-INCITE-Base-7B-v0.1",
+  "bias": "none",
+  "enable_lora": [
+    true,
+    false,
+    true
+  ],
+  "fan_in_fan_out": true,
+  "inference_mode": true,
+  "lora_alpha": 32,
+  "lora_dropout": 0.1,
+  "merge_weights": false,
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "target_modules": [
+    "query_key_value"
+  ],
+  "task_type": "CAUSAL_LM"
+}
diff --git a/models/glados_redpajama7b_base/adapter_model.bin b/models/glados_redpajama7b_base/adapter_model.bin
diff --git a/models/glados_starcoder/adapter_config.json b/models/glados_starcoder/adapter_config.json
@@ -0,0 +1,18 @@
+{
+  "base_model_name_or_path": "bigcode/starcoder",
+  "bias": "none",
+  "enable_lora": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "lora_alpha": 32,
+  "lora_dropout": 0.1,
+  "merge_weights": false,
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "target_modules": [
+    "c_attn",
+    "c_proj"
+  ],
+  "task_type": "CAUSAL_LM"
+}
diff --git a/models/glados_starcoder/adapter_model.bin b/models/glados_starcoder/adapter_model.bin
diff --git a/models/glados_together_20b/adapter_config.json b/models/glados_together_20b/adapter_config.json
@@ -0,0 +1,21 @@
+{
+  "base_model_name_or_path": "togethercomputer/GPT-NeoXT-Chat-Base-20B",
+  "bias": "none",
+  "enable_lora": [
+    true,
+    false,
+    true
+  ],
+  "fan_in_fan_out": true,
+  "inference_mode": true,
+  "lora_alpha": 32,
+  "lora_dropout": 0.1,
+  "merge_weights": true,
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "target_modules": [
+    "query_key_value"
+  ],
+  "task_type": "CAUSAL_LM"
+}
diff --git a/models/glados_together_20b/adapter_model.bin b/models/glados_together_20b/adapter_model.bin
diff --git a/src/get_args.py b/src/get_args.py
@@ -0,0 +1,14 @@
+import argparse
+
+def get_args():
+    """Get arguments for running GLaDOS
+
+    Returns:
+        NameSpace: args object with member variables for each option
+    """
+    parser = argparse.ArgumentParser(description='Get model choice and token')
+    parser.add_argument('--model', default='models/glados_redpajama7b_base', help='Path to the model to run')
+    parser.add_argument('--token', default=None, help='Huggingface token required for starcoder model download')
+    parser.add_argument('--multi_gpu', action="store_true", default=False, help='If passed will distribute model across multiple GPUs')
+    args = parser.parse_args()
+    return args
diff --git a/src/glados.py b/src/glados.py
@@ -25,7 +25,7 @@
 logger = logging.getLogger(__name__)
 
 class GLaDOS:
-    def __init__(self, path, stop_phrase="User :\n",  device="cuda", half=False, cache_dir="models/hface_cache", use_deepspeed=False, int8=False, max_length=2048, multi_gpu=False):
+    def __init__(self, path, stop_phrase="User :\n",  device="cuda", half=True, cache_dir="models/hface_cache", use_deepspeed=False, int8=False, max_length=2048, multi_gpu=False, token=None, better_transformer=False):
         """AI is creating summary for __init__
 
         Args:
@@ -45,21 +45,24 @@ def __init__(self, path, stop_phrase="User :\n",  device="cuda", half=False, cac
         # TODO : Make int8 work
         if int8:
             # THIS IS NOT TESTED
-            model = AutoModelForCausalLM.from_pretrained(base_model_path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True)
+            model = AutoModelForCausalLM.from_pretrained(base_model_path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True, use_auth_token=token)
             # Less than half!
             device = None
-            model = PeftModel.from_pretrained(model, path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True)
+            model = PeftModel.from_pretrained(model, path, return_dict=True, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, load_in_8bit=True, use_auth_token=token)
         
         # TODO : Make multi_gpu work (It used to work, when did it break?)
         elif multi_gpu:
-            model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16)
+            model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, use_auth_token=token)
             # Model should already be half
             half=True
             # Device map will be set automatically above, setting another device map break it
-            model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16)
+            model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir, device_map="auto", torch_dtype=torch.float16, use_auth_token=token)
         else:
             # TODO : Create custom device map to load on single GPU without using intermediate 
-            model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, torch_dtype=torch.float16)
+            model = AutoModelForCausalLM.from_pretrained(base_model_path, cache_dir=cache_dir, torch_dtype=torch.float16, use_auth_token=token)
+            if better_transformer:
+                logger.info("Converting model to better transformer model for speedup...")
+                model = model.to_bettertransformer()
             model = PeftModel.from_pretrained(model, path, cache_dir=cache_dir)
             # TODO : Does this do anything? Model should already be fp16. Would be nice to remove another argument from the long list
             if half:
@@ -68,9 +71,12 @@ def __init__(self, path, stop_phrase="User :\n",  device="cuda", half=False, cac
             if device is not None:
                 model.to(device)
         
+       
         # Make sure it's in eval mode
         model.eval()
         
+        
+
         # Bookkeeping
         self.device = device
         self.base_model_path = base_model_path
@@ -80,7 +86,7 @@ def __init__(self, path, stop_phrase="User :\n",  device="cuda", half=False, cac
         self.model = model
         
         # Setup tokenizer
-        self.tokenizer = AutoTokenizer.from_pretrained(base_model_path, truncation_side="left")
+        self.tokenizer = AutoTokenizer.from_pretrained(base_model_path, truncation_side="left", use_auth_token=token, cache_dir=cache_dir)
         self.tokenizer.pad_token = self.tokenizer.eos_token
         
         # Ban the model from generating certain phrases
@@ -133,7 +139,7 @@ def run_model(self, text, kwargs=None):
         base_kwargs = {
             "num_beams" : 16,
             "stopping_criteria" : self.stop_token_seqs,
-            "max_new_tokens" : 256,
+            "max_new_tokens" : 1024,
             "pad_token_id" : self.tokenizer.eos_token_id,
             "bad_words_ids" : self.bad_token_seqs,
             "no_repeat_ngram_size" : 12,
diff --git a/src/md_utils.py b/src/md_utils.py
@@ -1,5 +1,9 @@
+import logging
 import re
 import pandoc
+
+logger = logging.getLogger(__name__)
+
 def fix_lines(base_md):
     """Doubles newlines outside of code blocks to fix formatting issue from model training code.
 
@@ -16,13 +20,38 @@ def fix_lines(base_md):
         if i % 2 == 0:
             sec = replace_newline_with_br(sec)
         fixed_sections.append(sec)
-    return "```".join(fixed_sections)
+    
+    updated_md = "```".join(fixed_sections)
+    logger.debug(f"Original markdown : {base_md}")
+    logger.debug(f"Updated markdown : {updated_md}")
+    return updated_md
 
-def replace_newline_with_br(text):
+
+# TODO : Simplify this function
+# Alternately train the model to output breaks on it's own
+def identify_break_points(text):
     replace_spots = []
-    for i, char in enumerate(text.strip()):
-        if char == "\n" and (i > 0 and text[i-1]!= "\n") and (i < len(text) - 1 and text[i+1]!= "\n"):
+    line_so_far = ""
+    skippable = False
+    for i, char in enumerate(text):
+        if char == "\n" and \
+            (i > 0 and text[i-1]!= "\n") and \
+            (i < len(text) - 1 and text[i+1]!= "\n") and \
+            "|" not in line_so_far and \
+            not skippable:
             replace_spots.append(i)
+        if char != "\n":
+            line_so_far += char
+            stripped = line_so_far.strip()
+            if len(stripped) > 0 and (not stripped[0].isalpha()):
+                skippable = True
+        else:
+            line_so_far = ""
+            skippable = False
+    return replace_spots
+
+def replace_newline_with_br(text):
+    replace_spots = identify_break_points(text)
     replace_spots.reverse()
     for i in replace_spots:
         text = text[:i] + "<br>\n" + text[i+1:]
diff --git a/src/run_server.py b/src/run_server.py
@@ -13,10 +13,14 @@
 # Local Imports
 from glados import GLaDOS
 from md_utils import htmlify_convo, basicify_convo
+from get_args import get_args
 
 # Conversations are stored compressed
 from compression import encode_str, decode_str, encode_obj, decode_obj
 
+# Get Args
+args = get_args()
+
 # Setup Logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -37,7 +41,9 @@
 Session(app)
 
 LOG_FILE = "server_logs.log"
-bot = GLaDOS("models/neox_20b_full", multi_gpu=True)
+
+bot = GLaDOS(args.model, token=args.token, multi_gpu=args.multi_gpu)
+bot.add_stop_phrase("User:")
 
 @app.route('/')
 def splash_page():
@@ -91,7 +97,7 @@ def conversation():
     args = {
         "user_input" : new_text, 
         "conversation_history" : previous_convo, 
-        "kwargs" : {"max_new_tokens":512, "do_sample":True, "temperature":1.0, "num_beams":2, "no_repeat_ngram_size" : 5, "top_k" : 50}
+        "kwargs" : {"max_new_tokens":1024, "do_sample":True, "temperature":1.0, "num_beams":2, "no_repeat_ngram_size" : 5, "top_k" : 50}
     }
     #try:
     bot_response = bot.converse(**args)
diff --git a/src/templates/form.html b/src/templates/form.html
@@ -1,6 +1,6 @@
 <form method="POST" style='text-align:center'>
     <div style='text-align:center'>
-    <h1>Enter text below to run through GLaDOS V2.4, a chat bot.</h1>
+    <h1>Enter text below to run through GLaDOS V2.5</h1>
     </div>
     <div class="textbox-container" style='text-align:center'>
       <textarea name="text" rows="10" cols="100"></textarea>
@@ -16,16 +16,17 @@ <h3>GLaDOS Versions</h3>
         GLaDOSv2.2 : 6 Billion parameters. Trained with LoRA (R=4). Markdown support!<br>
         GLaDOSv2.3 : 6.9 Billion parameters. Trained with LoRA (R=16). More efficient Pythia tokenizer!<br>
         GLaDOSv2.4 : 20 Billion parameters. Better base model! Same everything else!<br>
+        GLaDOSv2.5 : Multiple Models to choose from!<br>
         <br><br>
         GLaDOS, is not very bright*.<br>
         Future versions will take advantage of more data and more effective training techniques to improve results.
     </div>
     <div style="font-size:10px">
     <br><br><br>
-    * Any information produced by the model is likely not true.<br>
-    * GLaDOS currently loses track of information more than ~1500 tokens back <br>
+    * Any information produced by the model may not be true.<br>
+    * GLaDOS currently loses track of information more than ~1000 tokens back <br>
     * GLaDOS uses sampling during generation so the same prompt may produce different results <br>
-    * There is a bug in GLaDOS that affects the formatting of poetry/songs and other items. <br>
+    * Formatting bugs should be fixed*** but please feel free to open a github issue if you find any<br>
     </div>
 </form>
   
diff --git a/test.py b/test.py