You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discussion around design choice for mlx_lm/server.py
Model-Specific Requirements
Some model families require special configuration to work properly. In order to properly support Dynamic Model Loading, we must have a way to inject special parameter for models with special requirements. If we need to add a flag for Qwen like trust_remote_code and eos_token
There should be some logic implemented in ModelProvider.load() to map a model to custom parameter needs that can be injected when dynamically loading the model.
Environment Variables
Variable
Model
MLX_MODEL
mlx-community/Qwen2.5-Coder-32B-Instruct-8bit
TRUST_REMOTE_CODE
True
EOS_TOKEN
"eos_token"
EXTRA_ARGS
Additional command-line arguments
Implementation Pseudo Code
# model_config_mapping.py"""Configuration for specific model families to apply appropriate tokenizer settings."""# Dictionary of regex patterns matching model names to required tokenizer parametersMODEL_FAMILY_CONFIGS= {
# Qwen models require trust_remote_code and specific eos_tokenr"(qwen|Qwen)": {
"trust_remote_code": True,
"eos_token": "<|endoftext|>"
},
# Plamo models require trust_remote_coder"plamo": {
"trust_remote_code": True
},
# Internlm modelsr"internlm": {
"trust_remote_code": True
},
# Yi modelsr"yi": {
"trust_remote_code": True
}
# Add more model families as needed
}
# server.py - ModelProvider: defload(self, model_path, adapter_path=None, draft_model_path=None, tokenizer_params=None):
ifself.model_key== (model_path, adapter_path, draft_model_path):
returnself.model, self.tokenizer# Remove the old model if it exists.self.model=Noneself.tokenizer=Noneself.model_key=Noneself.draft_model=None# Start with base tokenizer_config from CLI argstokenizer_config= {
"trust_remote_code": Trueifself.cli_args.trust_remote_codeelseNone
}
ifself.cli_args.chat_template:
tokenizer_config["chat_template"] =self.cli_args.chat_template# Determine actual model name to check for model-specific configsactual_model_name=self.cli_args.modelifmodel_path=="default_model"elsemodel_path# Apply model-specific config based on model name patternifactual_model_name:
model_specific_config=get_model_specific_config(actual_model_name)
forkey, valueinmodel_specific_config.items():
tokenizer_config[key] =value# Allow request-specific params to override defaults (highest priority)iftokenizer_params:
tokenizer_config.update(tokenizer_params)
# Clean up None valuestokenizer_config= {k: vfork, vintokenizer_config.items() ifvisnotNone}
...
Draft model
The idea to improve efficiency and performance using a draft decoding model is a good idea. But it might be less useful on less powerful device, like laptop.
Option 1: Run Draft and Main model in parallel.
Pros: Improve performance and efficiency and decrease load on Main model, on powerful system.
Cons: Require 30% more memory usage in average, which might lead to OOM for in most system.
Option 2: Run Sequentially - Load/Unload Draft and Main model.
Pros: Memory friendly
Cons: Defeat the purpose of the initial objective. High memory churns and slow performance.
Suggestion: Make Draft Model optional and default disabled. Implement resource tracking to ensure the system has enough resources before attempting to enable it.
importpsutildefget_available_memory_gb():
returnpsutil.virtual_memory().available/ (1024*1024*1024)
defload(self, model_path, adapter_path=None, draft_model_path=None):
# Check available memory before loadingavailable_memory=get_available_memory_gb()
estimated_model_size=5# Example: 5GB per model, adjust based on actual model size# Add warning about resource usageifdraft_model_pathand (available_memory< (estimated_model_size*2)):
logging.warning(
f"Insufficient memory for draft model (available: {available_memory:.1f}GB). ""Disabling speculative decoding."
)
draft_model_path=Noneelifavailable_memory<32:
logging.warning(
"Using a draft model may significantly increase memory usage. ""Consider disabling it on systems with limited resources."
)
defmain():
parser=argparse.ArgumentParser(description="MLX Http Server.")
parser.add_argument(
"--draft-model",
type=str,
help=("A model to be used for speculative decoding. ""WARNING: Using a draft model doubles memory usage."),
default=None,
)
Using Additional Arguments
Use the EXTRA_ARGS environment variable to pass any additional arguments to mlx_lm.server:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion around design choice for mlx_lm/server.py
Model-Specific Requirements
Some model families require special configuration to work properly. In order to properly support Dynamic Model Loading, we must have a way to inject special parameter for models with special requirements. If we need to add a flag for Qwen like trust_remote_code and eos_token
There should be some logic implemented in ModelProvider.load() to map a model to custom parameter needs that can be injected when dynamically loading the model.
Environment Variables
MLX_MODEL
mlx-community/Qwen2.5-Coder-32B-Instruct-8bit
TRUST_REMOTE_CODE
True
EOS_TOKEN
"eos_token"
EXTRA_ARGS
Implementation Pseudo Code
Draft model
The idea to improve efficiency and performance using a draft decoding model is a good idea. But it might be less useful on less powerful device, like laptop.
Option 1: Run Draft and Main model in parallel.
Pros: Improve performance and efficiency and decrease load on Main model, on powerful system.
Cons: Require 30% more memory usage in average, which might lead to OOM for in most system.
Option 2: Run Sequentially - Load/Unload Draft and Main model.
Pros: Memory friendly
Cons: Defeat the purpose of the initial objective. High memory churns and slow performance.
Suggestion: Make Draft Model optional and default disabled. Implement resource tracking to ensure the system has enough resources before attempting to enable it.
Using Additional Arguments
Use the
EXTRA_ARGS
environment variable to pass any additional arguments tomlx_lm.server
:EXTRA_ARGS
Beta Was this translation helpful? Give feedback.
All reactions