-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Configuring Custom Models
Warning
Here is a good example of a bad model. (somethings wrong)
We will now walk through configuration of a Downloaded model, this is required for it to (possibly) work.
Models found on Huggingface or anywhere else are "unsupported" you should follow this guide before asking for help.
Whether you "Sideload" or "Download" a custom model you must configure it to work properly.
- We will refer to a "Download" as being any model that you found using the "Add Models" feature.
- A "Sideload" is any model you get somewhere else and then put in the models directory.
In this example, we use the "Search" feature of GPT4All.

Typing the name of a custom model will search HuggingFace and return results.
- A custom model is one that is not provided in the default models list by GPT4All.
- Any time you use the "search" feature you will get a list of custom models.

Click "More info can be found HERE.", which in this example brings you to huggingface.

Here, you find the information that you need to configure the model. (This model may be outdated, it may have been a failed experiment, it may not yet be compatible with GPT4All, it may be dangerous, it may also be GREAT!)
- You need to know the Prompt Template.
- You need to know the maximum context (128k)
- You need to know if there is a problem. See the community tab and look.

Maybe this won't affect you. Though it's a good place to find out.
So next, let's find that template... Hopefully the model authors were kind and included it.

This could be a good helpful template. Hopefully this works. Keep in mind:
- The model authors may not have tested their own model
- The model authors may not have not bothered to change their models configuration files from finetuning to inferencing workflows.
- Even if they show you a template it may be wrong.
- Each model has its own tokens and its own syntax.
- The models are trained for these and one must use them to work.
- The model uploader may not understand this either and can fail to provide a good model or a mismatching template.
Apart from the model card, there are three files that could hold relevant information for running the model.
- config.json
- tokenizer_config.json
- generation_config.json
Check config.json to find the capabilities (such as the maximum context length) of the model.
Check generation.config.json to find out about the original chat template. Especially useful, if the model author failed to provide a template.
Check all three files, if you want to quantize the model and you do need to cross-check, if the model uses the proper beginning of string (bos) and end of string (eos) tokens.

Important
The chat templates must be followed on a per model basis. Every model is different.
You can imagine them to be like magic spells.
Your magic won't work if you say the wrong word. It won't work if you say it at the wrong place or time.
At this step, we need to combine the chat template that we found in the model card (or in the tokenizer_config.json) with a special syntax that is compatible with the GPT4All-Chat application (The format shown in the above screenshot is only an example).
Special tokens like <|user|> will say the user is about to talk. <|end|> will tell the llm we are done with that, now continue on.
- We use
%1as placeholder for the content of the users prompt. - We use
%2as placholder for the content of the models response.
That example prompt that should (in theory) be compatible with GPT4All will look like this for you...
<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
%1<|end|>
<|assistant|>
%2<|end|>
You can see how the template will inject the stuff you type where the %1 goes. You can stuff something fun in there if you want to... (Now the chat knows my name!)
<|user|>
3Simplex:%1<|end|>
<|assistant|>
%2<|end|>
The system prompt will define the behavior of the model when you chat. You can say "Talk like a pirate, and be sure to keep your bird quite!"
The prompt template will tell the model what is happening and when.

The default settings are a good safe place to start. The default and provides good output for most models. For instance, you can't blow up your RAM on only 2048 context and you can always increase it to whatever the model supports.
This is the maximum context that you will use with the model. Context is somewhat the sum of the models tokens in the system prompt + chat template + user prompts + model responses + tokens that were added to the models context via retrieval augmented generation (RAG), which would be the LocalDocs feature. You need to keep context length within two safe margins.
-
- your system can only use so much memory. Using more than you have will cause severe slowdowns or even crashes.
-
- your model is only capable of what it was trained for. Using more than that will give trash answers and gibberish.

Since we are talking about computer terminology here, 1k = 1024 not 1000. So 128k, as is advertised by the phi3 model will translate to (1024 x 128 = 131072).
I will use 4192 which is 4k of a response. I like allowing for a great response but want to stop the model at that point. (Maybe you want it longer? Try 8192)
This is one that you need to think about if you have a small GPU or a big model.

This will be set to load all layers on the GPU. You may need to use less to get the model to work for you.
These settings are model independent. They are only for the GPT4All environment. You can play with them all you like.

The rest of these are special settings that need more training and experience to learn. They don't need to be changed most of the time.
You should now have a fully configured model I hope it works for you!
More Advanced Topics:
- The model is now configured but still doesn't work.
- Explain how the tokens work in the templates.