basetenlabs · kchatr · May 27, 2024 · May 27, 2024 · May 27, 2024 · May 27, 2024
diff --git a/json-generation/README.md b/json-generation/README.md
@@ -0,0 +1,163 @@
+This is an implementation of a JSON-mode for small LLMs, using a combination of a fine-tuned Mistral 7B, Hermes 2 Pro, and Jsonformers. 
+
+Hermes 2 Pro is finetuned from Mistral's 7b-v0.1 model, incorporating a newly developed Function Calling and JSON Mode dataset provided by Nous Research. As a result, Hermes is finetuned to better perform for both function calling as well as general structured data tasks. It was decided to go with the Hermes 2 Pro model over the base Mistral 7B due to its fantastic performance on structured JSON Output, achieving 84% on the evaluation created in partnership with Fireworks.AI. More information about the model and its development can be found on its HuggingFace card: https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B
+
+In order to further mitigate the risk of hallucination, we use the open-source library Jsonformer (https://github.com/1rgs/jsonformer/?tab=readme-ov-file). Jsonformer is a wrapper around Hugging Face models that fill in the _fixed_ tokens during the generation process, delegating only the task of generating the content tokens to the language model. As a result, the generated JSON will always be syntatically correct (as there is no opportunity for hallucinations thanks to the separation of concerns) with a high overall efficiency as only the content tokens need to be generated, not an entire JSON string. By wrapping Hermes with Jsonformer, we hope to prevent any possibility of malformed or invalid JSON structure while increasing model performance and speed on content token generation.
+
+The modifications I made to the Model class structure are the addition of a `schema` parameter, to allow the user to specify the desired JSON schema for generation, as well as adding a `latency_metrics` dictionary which records various metrics related to the latency of the model, namely prefill time, time to first token, time per output token, and total generation time.
+
+Although the model curently uses an LLM finetuned for the task of constrained decoding, due to wrapping the model in Jsonformer, it is possible to switch between various models for domain-specifc tasks (e.g. a JSON of medical information). As such, it should be quite easy to generalize, with the default model selected to optimize performance across a broad set of domains. 
+
+A preliminary assessment of this model against the baseline model, Mistral-7B-v0.1, showed immensely promising results. Given the following schema, 
+```json
+car = {
+  "type": "object",
+  "properties": {
+    "car": {
+      "type": "object",
+      "properties": {
+        "make": {"type": "string"},
+        "model": {"type": "string"},
+        "year": {"type": "number"},
+        "colors": {
+          "type": "array",
+          "items": {"type": "string"}
+        },
+        "features": {
+          "type": "object",
+          "properties": {
+            "audio": {
+              "type": "object",
+              "properties": {
+                "brand": {"type": "string"},
+                "speakers": {"type": "number"},
+                "hasBluetooth": {"type": "boolean"}
+              }
+            },
+            "safety": {
+              "type": "object",
+              "properties": {
+                "airbags": {"type": "number"},
+                "parkingSensors": {"type": "boolean"},
+                "laneAssist": {"type": "boolean"}
+              }
+            },
+            "performance": {
+              "type": "object",
+              "properties": {
+                "engine": {"type": "string"},
+                "horsepower": {"type": "number"},
+                "topSpeed": {"type": "number"}
+              }
+            }
+          }
+        }
+      }
+    },
+    "owner": {
+      "type": "object",
+      "properties": {
+        "firstName": {"type": "string"},
+        "lastName": {"type": "string"},
+        "age": {"type": "number"},
+      }
+    }
+  }
+}
+```
+the models were asked to generate an example car. The Hermes-2-Pro model + Jsonformer were able to successfully generate an example in __1min 4s ± 267 ms per loop__ (mean ± std. dev. of 7 runs, 1 loop each):
+```json
+{
+  car: {
+    make: "Toyota",
+    model: "Corolla",
+    year: 2020.5,
+    colors: [
+      "white",
+      "silver",
+      "gray",
+      "blue",
+      "black",
+      "red",
+      "green",
+      "yellow",
+      "orange",
+      "purple"
+    ],
+    features: {
+      audio: {
+        brand: "JBL",
+        speakers: 12.123,
+        hasBluetooth: True
+      },
+      safety: {
+        airbags: 7.8989,
+        parkingSensors: True,
+        laneAssist: True
+      },
+      performance: {
+        engine: "4-Cylinder Turbocharged E",
+        horsepower: 184.42,
+        topSpeed: 145.02
+      }
+    }
+  },
+  owner: {
+    firstName: "John",
+    lastName: "Doe",
+    age: 38.456
+  }
+}
+```
+
+Mistral, on the other hand, was unable to successfully generate an example (instead creating a false accident report) and took **3min 18s ± 75.3 ms per loop** (mean ± std. dev. of 7 runs, 1 loop each):
+
+```
+Car Accident Report
+
+Date: [Insert Date]
+Time: [Insert Time]
+Location: [Insert Address]
+
+Driver 1:
+Name: [Insert Name]
+Age: [Insert Age]
+Gender: [Insert Gender]
+Address: [Insert Address]
+Phone: [Insert Phone Number]
+
+Driver 2:
+Name: [Insert Name]
+Age: [Insert Age]
+Gender: [Insert Gender]
+Address: [Insert Address]
+Phone: [Insert Phone Number]
+
+Vehicle 1:
+Make: [Insert Make]
+Model: [Insert Model]
+Year: [Insert Year]
+Color: [Insert Color]
+License Plate Number: [Insert License Plate Number]
+
+Vehicle 2:
+Make: [Insert Make]
+Model: [Insert Model]
+Year: [Insert Year]
+Color: [Insert Color]
+License Plate Number: [Insert License Plate Number]
+
+Accident Summary:
+
+On [Insert Date] at [Insert Time], a car accident occurred at [Insert Address]. The accident involved two vehicles, a [Insert Make] [Insert Model] [Insert Year] [Insert Color] with license plate number [Insert License Plate Number], driven by [Insert Name], and a [Insert Make] [Insert Model] [Insert Year] [Insert Color] with license plate number [Insert License Plate Number], driven by [Insert Name].
+
+The accident occurred when Driver 1, who was traveling northbound on [Insert Road], failed to yield the right of way to Driver 2, who was traveling eastbound on [Insert Road]. The two vehicles collided at the intersection of [Insert Road] and [Insert Road], causing damage to both vehicles.
+
+There were no injuries reported as a result of the accident.
+
+Witnesses to the accident include [Insert Witness 1 Name], [Insert Witness 2 Name], and [Insert Witness 3 Name].
+
+The investigation into the accident is ongoing.
+```
+
+This model is both more accurate as well as efficient when compared to its base, as a result both of the fine-tuning, allowing the model to more effectively handle and understand JSON, as well as the constrained decoding methodology of Jsonformer which allowed for a separation of concerns between schema and output.
diff --git a/json-generation/config.yaml b/json-generation/config.yaml
@@ -0,0 +1,36 @@
+description: Mistral 7B, optimized for chat! Compatible with OpenAI Client
+environment_variables: {}
+external_package_dirs: []
+model_cache:
+- allow_patterns:
+  - '*.json'
+  - '*.safetensors'
+  - '*.model'
+  repo_id: NousResearch/Hermes-2-Pro-Mistral-7B
+model_metadata:
+  example_model_input:
+    messages:
+    - content: What is the mistral wind?
+      role: user
+  model: NousResearch/Hermes-2-Pro-Mistral-7B
+  repo_id: NousResearch/Hermes-2-Pro-Mistral-7B
+  pretty_name: Hermes 2 Pro - Mistral 7B
+  tags:
+  - text-generation
+  - openai-compatible
+model_name: Hermes 2 Pro - Mistral 7B
+python_version: py311
+requirements:
+- accelerate
+- transformers
+- torch
+- sentencepiece
+- protobuf
+- jsonformer
+resources:
+  accelerator: A10G
+  memory: 25Gi
+  use_gpu: true
+secrets:
+  hf_access_token: "ENTER HF ACCESS TOKEN HERE"
+system_packages: []
diff --git a/json-generation/model/__init__.py b/json-generation/model/__init__.py
diff --git a/json-generation/model/model.py b/json-generation/model/model.py
@@ -0,0 +1,109 @@
+import os
+import time
+from threading import Thread
+
+import torch
+from transformers import GenerationConfig, TextIteratorStreamer, pipeline
+from jsonformer.format import highlight_values
+from jsonformer.main import Jsonformer
+
+class Model:
+    def __init__(self, **kwargs):
+        self._repo_id = "NousResearch/Hermes-2-Pro-Mistral-7B"
+        self._hf_access_token = kwargs["secrets"]["hf_access_token"]
+        self._latency_metrics = dict()
+
+    def get_latency_metrics(self):
+        return self._latency_metrics
+
+    def load(self):
+        self._model = pipeline(
+            "text-generation",
+            model=self._repo_id,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            token=self._hf_access_token,
+        )
+
+
+    def preprocess(self, request: dict):
+        generate_args = {
+            "max_new_tokens": 512,
+            "temperature": 1.0,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repetition_penalty": 1.0,
+            "no_repeat_ngram_size": 0,
+            "use_cache": True,
+            "do_sample": True,
+            "eos_token_id": self._model.tokenizer.eos_token_id,
+            "pad_token_id": self._model.tokenizer.pad_token_id,
+            "return_full_text": False,
+        }
+
+        request["generate_args"] = {
+            k: request.get(k, generate_args[k])
+            for k in generate_args.keys()
+        }
+
+        return request
+
+    def stream(self, text_inputs: list, generation_args: dict):
+        streamer = TextIteratorStreamer(self._model.tokenizer)
+        generation_config = GenerationConfig(**generation_args)
+        generation_kwargs = {
+            "text_inputs": text_inputs,
+            "generation_config": generation_config,
+            "return_dict_in_generate": True,
+            "output_scores": True,
+            "max_new_tokens": generation_args["max_new_tokens"],
+            "streamer": streamer,
+        }
+
+        with torch.no_grad():
+            # Begin generation in a separate thread
+            thread = Thread(target=self._model, kwargs=generation_kwargs)
+            thread.start()
+
+            # Yield generated text as it becomes available
+            def inner():
+                for text in streamer:
+                    yield text
+                thread.join()
+
+        return inner()
+
+    def predict(self, schema: str, request: dict, prompt:str="Generate an example for the provided schema"):
+        start_time = time.time()
+        prefill_start = time.time()
+        model_inputs = self._model.tokenizer.apply_chat_template(messages, ...)
+        prefill_end = time.time()
+        prefill_time = prefill_end - prefill_start
+
+        stream = request.pop("stream", False)
+        messages = request.pop("messages")
+
+        generation_args = request.pop("generate_args")
+        jsonformer = jsonformer.Jsonformer(model=self._model, tokenizer=self._model.tokenizer, json_schema=schema, prompt=prompt)
+
+        generation_start = time.time() 
+
+        output = jsonformer()  
+
+        first_token_time = time.time() - generation_start
+        total_tokens = len(output.split())
+        total_time = time.time() - start_time
+        tpot = (total_time - first_token_time) / total_tokens if total_tokens > 0 else 0
+
+        self._latency_metrics = {
+            "prefill_time": prefill_time,
+            "time_to_first_token": first_token_time,
+            "time_per_output_token": tpot,
+            "total_generation_time": total_time,
+        }   
+
+
+        if len(output) > 0:
+            return output
+
+        raise Exception("No results returned from model")