Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions json-generation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
This is an implementation of a JSON-mode for small LLMs, using a combination of a fine-tuned Mistral 7B, Hermes 2 Pro, and Jsonformers.

Hermes 2 Pro is finetuned from Mistral's 7b-v0.1 model, incorporating a newly developed Function Calling and JSON Mode dataset provided by Nous Research. As a result, Hermes is finetuned to better perform for both function calling as well as general structured data tasks. It was decided to go with the Hermes 2 Pro model over the base Mistral 7B due to its fantastic performance on structured JSON Output, achieving 84% on the evaluation created in partnership with Fireworks.AI. More information about the model and its development can be found on its HuggingFace card: https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B

In order to further mitigate the risk of hallucination, we use the open-source library Jsonformer (https://github.com/1rgs/jsonformer/?tab=readme-ov-file). Jsonformer is a wrapper around Hugging Face models that fill in the _fixed_ tokens during the generation process, delegating only the task of generating the content tokens to the language model. As a result, the generated JSON will always be syntatically correct (as there is no opportunity for hallucinations thanks to the separation of concerns) with a high overall efficiency as only the content tokens need to be generated, not an entire JSON string. By wrapping Hermes with Jsonformer, we hope to prevent any possibility of malformed or invalid JSON structure while increasing model performance and speed on content token generation.

The modifications I made to the Model class structure are the addition of a `schema` parameter, to allow the user to specify the desired JSON schema for generation, as well as adding a `latency_metrics` dictionary which records various metrics related to the latency of the model, namely prefill time, time to first token, time per output token, and total generation time.

Although the model curently uses an LLM finetuned for the task of constrained decoding, due to wrapping the model in Jsonformer, it is possible to switch between various models for domain-specifc tasks (e.g. a JSON of medical information). As such, it should be quite easy to generalize, with the default model selected to optimize performance across a broad set of domains.

A preliminary assessment of this model against the baseline model, Mistral-7B-v0.1, showed immensely promising results. Given the following schema,
```json
car = {
"type": "object",
"properties": {
"car": {
"type": "object",
"properties": {
"make": {"type": "string"},
"model": {"type": "string"},
"year": {"type": "number"},
"colors": {
"type": "array",
"items": {"type": "string"}
},
"features": {
"type": "object",
"properties": {
"audio": {
"type": "object",
"properties": {
"brand": {"type": "string"},
"speakers": {"type": "number"},
"hasBluetooth": {"type": "boolean"}
}
},
"safety": {
"type": "object",
"properties": {
"airbags": {"type": "number"},
"parkingSensors": {"type": "boolean"},
"laneAssist": {"type": "boolean"}
}
},
"performance": {
"type": "object",
"properties": {
"engine": {"type": "string"},
"horsepower": {"type": "number"},
"topSpeed": {"type": "number"}
}
}
}
}
}
},
"owner": {
"type": "object",
"properties": {
"firstName": {"type": "string"},
"lastName": {"type": "string"},
"age": {"type": "number"},
}
}
}
}
```
the models were asked to generate an example car. The Hermes-2-Pro model + Jsonformer were able to successfully generate an example in __1min 4s ± 267 ms per loop__ (mean ± std. dev. of 7 runs, 1 loop each):
```json
{
car: {
make: "Toyota",
model: "Corolla",
year: 2020.5,
colors: [
"white",
"silver",
"gray",
"blue",
"black",
"red",
"green",
"yellow",
"orange",
"purple"
],
features: {
audio: {
brand: "JBL",
speakers: 12.123,
hasBluetooth: True
},
safety: {
airbags: 7.8989,
parkingSensors: True,
laneAssist: True
},
performance: {
engine: "4-Cylinder Turbocharged E",
horsepower: 184.42,
topSpeed: 145.02
}
}
},
owner: {
firstName: "John",
lastName: "Doe",
age: 38.456
}
}
```

Mistral, on the other hand, was unable to successfully generate an example (instead creating a false accident report) and took **3min 18s ± 75.3 ms per loop** (mean ± std. dev. of 7 runs, 1 loop each):

```
Car Accident Report

Date: [Insert Date]
Time: [Insert Time]
Location: [Insert Address]

Driver 1:
Name: [Insert Name]
Age: [Insert Age]
Gender: [Insert Gender]
Address: [Insert Address]
Phone: [Insert Phone Number]

Driver 2:
Name: [Insert Name]
Age: [Insert Age]
Gender: [Insert Gender]
Address: [Insert Address]
Phone: [Insert Phone Number]

Vehicle 1:
Make: [Insert Make]
Model: [Insert Model]
Year: [Insert Year]
Color: [Insert Color]
License Plate Number: [Insert License Plate Number]

Vehicle 2:
Make: [Insert Make]
Model: [Insert Model]
Year: [Insert Year]
Color: [Insert Color]
License Plate Number: [Insert License Plate Number]

Accident Summary:

On [Insert Date] at [Insert Time], a car accident occurred at [Insert Address]. The accident involved two vehicles, a [Insert Make] [Insert Model] [Insert Year] [Insert Color] with license plate number [Insert License Plate Number], driven by [Insert Name], and a [Insert Make] [Insert Model] [Insert Year] [Insert Color] with license plate number [Insert License Plate Number], driven by [Insert Name].

The accident occurred when Driver 1, who was traveling northbound on [Insert Road], failed to yield the right of way to Driver 2, who was traveling eastbound on [Insert Road]. The two vehicles collided at the intersection of [Insert Road] and [Insert Road], causing damage to both vehicles.

There were no injuries reported as a result of the accident.

Witnesses to the accident include [Insert Witness 1 Name], [Insert Witness 2 Name], and [Insert Witness 3 Name].

The investigation into the accident is ongoing.
```

This model is both more accurate as well as efficient when compared to its base, as a result both of the fine-tuning, allowing the model to more effectively handle and understand JSON, as well as the constrained decoding methodology of Jsonformer which allowed for a separation of concerns between schema and output.
36 changes: 36 additions & 0 deletions json-generation/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
description: Mistral 7B, optimized for chat! Compatible with OpenAI Client
environment_variables: {}
external_package_dirs: []
model_cache:
- allow_patterns:
- '*.json'
- '*.safetensors'
- '*.model'
repo_id: NousResearch/Hermes-2-Pro-Mistral-7B
model_metadata:
example_model_input:
messages:
- content: What is the mistral wind?
role: user
model: NousResearch/Hermes-2-Pro-Mistral-7B
repo_id: NousResearch/Hermes-2-Pro-Mistral-7B
pretty_name: Hermes 2 Pro - Mistral 7B
tags:
- text-generation
- openai-compatible
model_name: Hermes 2 Pro - Mistral 7B
python_version: py311
requirements:
- accelerate
- transformers
- torch
- sentencepiece
- protobuf
- jsonformer
resources:
accelerator: A10G
memory: 25Gi
use_gpu: true
secrets:
hf_access_token: "ENTER HF ACCESS TOKEN HERE"
system_packages: []
Empty file.
109 changes: 109 additions & 0 deletions json-generation/model/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import os
import time
from threading import Thread

import torch
from transformers import GenerationConfig, TextIteratorStreamer, pipeline
from jsonformer.format import highlight_values
from jsonformer.main import Jsonformer

class Model:
def __init__(self, **kwargs):
self._repo_id = "NousResearch/Hermes-2-Pro-Mistral-7B"
self._hf_access_token = kwargs["secrets"]["hf_access_token"]
self._latency_metrics = dict()

def get_latency_metrics(self):
return self._latency_metrics

def load(self):
self._model = pipeline(
"text-generation",
model=self._repo_id,
torch_dtype=torch.bfloat16,
device_map="auto",
token=self._hf_access_token,
)


def preprocess(self, request: dict):
generate_args = {
"max_new_tokens": 512,
"temperature": 1.0,
"top_p": 0.95,
"top_k": 50,
"repetition_penalty": 1.0,
"no_repeat_ngram_size": 0,
"use_cache": True,
"do_sample": True,
"eos_token_id": self._model.tokenizer.eos_token_id,
"pad_token_id": self._model.tokenizer.pad_token_id,
"return_full_text": False,
}

request["generate_args"] = {
k: request.get(k, generate_args[k])
for k in generate_args.keys()
}

return request

def stream(self, text_inputs: list, generation_args: dict):
streamer = TextIteratorStreamer(self._model.tokenizer)
generation_config = GenerationConfig(**generation_args)
generation_kwargs = {
"text_inputs": text_inputs,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": generation_args["max_new_tokens"],
"streamer": streamer,
}

with torch.no_grad():
# Begin generation in a separate thread
thread = Thread(target=self._model, kwargs=generation_kwargs)
thread.start()

# Yield generated text as it becomes available
def inner():
for text in streamer:
yield text
thread.join()

return inner()

def predict(self, schema: str, request: dict, prompt:str="Generate an example for the provided schema"):
start_time = time.time()
prefill_start = time.time()
model_inputs = self._model.tokenizer.apply_chat_template(messages, ...)
prefill_end = time.time()
prefill_time = prefill_end - prefill_start

stream = request.pop("stream", False)
messages = request.pop("messages")

generation_args = request.pop("generate_args")
jsonformer = jsonformer.Jsonformer(model=self._model, tokenizer=self._model.tokenizer, json_schema=schema, prompt=prompt)

generation_start = time.time()

output = jsonformer()

first_token_time = time.time() - generation_start
total_tokens = len(output.split())
total_time = time.time() - start_time
tpot = (total_time - first_token_time) / total_tokens if total_tokens > 0 else 0

self._latency_metrics = {
"prefill_time": prefill_time,
"time_to_first_token": first_token_time,
"time_per_output_token": tpot,
"total_generation_time": total_time,
}


if len(output) > 0:
return output

raise Exception("No results returned from model")