Design Talk: server implementations #171

ocontant · 2025-05-12T00:04:23Z

ocontant
May 12, 2025

Discussion around design choice for mlx_lm/server.py

Model-Specific Requirements

Some model families require special configuration to work properly. In order to properly support Dynamic Model Loading, we must have a way to inject special parameter for models with special requirements. If we need to add a flag for Qwen like trust_remote_code and eos_token

There should be some logic implemented in ModelProvider.load() to map a model to custom parameter needs that can be injected when dynamically loading the model.

Environment Variables

Variable	Model
`MLX_MODEL`	`mlx-community/Qwen2.5-Coder-32B-Instruct-8bit`
`TRUST_REMOTE_CODE`	`True`
`EOS_TOKEN`	`"eos_token"`
`EXTRA_ARGS`	Additional command-line arguments

Implementation Pseudo Code

# model_config_mapping.py

"""
Configuration for specific model families to apply appropriate tokenizer settings.
"""

# Dictionary of regex patterns matching model names to required tokenizer parameters
MODEL_FAMILY_CONFIGS = {
    # Qwen models require trust_remote_code and specific eos_token
    r"(qwen|Qwen)": {
        "trust_remote_code": True,
        "eos_token": "<|endoftext|>"
    },
    # Plamo models require trust_remote_code
    r"plamo": {
        "trust_remote_code": True
    },
    # Internlm models
    r"internlm": {
        "trust_remote_code": True
    },
    # Yi models
    r"yi": {
        "trust_remote_code": True
    }
    # Add more model families as needed
}

# server.py - ModelProvider: 

def load(self, model_path, adapter_path=None, draft_model_path=None, tokenizer_params=None):
      if self.model_key == (model_path, adapter_path, draft_model_path):
          return self.model, self.tokenizer

      # Remove the old model if it exists.
      self.model = None
      self.tokenizer = None
      self.model_key = None
      self.draft_model = None

      # Start with base tokenizer_config from CLI args
      tokenizer_config = {
          "trust_remote_code": True if self.cli_args.trust_remote_code else None
      }
      if self.cli_args.chat_template:
          tokenizer_config["chat_template"] = self.cli_args.chat_template

      # Determine actual model name to check for model-specific configs
      actual_model_name = self.cli_args.model if model_path == "default_model" else model_path

      # Apply model-specific config based on model name pattern
      if actual_model_name:
          model_specific_config = get_model_specific_config(actual_model_name)
          for key, value in model_specific_config.items():
              tokenizer_config[key] = value

      # Allow request-specific params to override defaults (highest priority)
      if tokenizer_params:
          tokenizer_config.update(tokenizer_params)

      # Clean up None values
      tokenizer_config = {k: v for k, v in tokenizer_config.items() if v is not None}
      
      ...

Draft model

The idea to improve efficiency and performance using a draft decoding model is a good idea. But it might be less useful on less powerful device, like laptop.

Option 1: Run Draft and Main model in parallel.
Pros: Improve performance and efficiency and decrease load on Main model, on powerful system.
Cons: Require 30% more memory usage in average, which might lead to OOM for in most system.

Option 2: Run Sequentially - Load/Unload Draft and Main model.
Pros: Memory friendly
Cons: Defeat the purpose of the initial objective. High memory churns and slow performance.

Suggestion: Make Draft Model optional and default disabled. Implement resource tracking to ensure the system has enough resources before attempting to enable it.

import psutil
def get_available_memory_gb():
    return psutil.virtual_memory().available / (1024 * 1024 * 1024)
        
def load(self, model_path, adapter_path=None, draft_model_path=None):
       
    # Check available memory before loading
    available_memory = get_available_memory_gb()
    estimated_model_size = 5  # Example: 5GB per model, adjust based on actual model size
    
    # Add warning about resource usage
    if draft_model_path and (available_memory < (estimated_model_size * 2)):
         logging.warning(
            f"Insufficient memory for draft model (available: {available_memory:.1f}GB). "
            "Disabling speculative decoding."
         )
        draft_model_path = None
    elif available_memory < 32: 
        logging.warning(
            "Using a draft model may significantly increase memory usage. "
            "Consider disabling it on systems with limited resources."
        )

def main():
    parser = argparse.ArgumentParser(description="MLX Http Server.")
    parser.add_argument(
        "--draft-model",
        type=str,
        help=("A model to be used for speculative decoding. "
              "WARNING: Using a draft model doubles memory usage."),
        default=None,
    )

Using Additional Arguments

Use the EXTRA_ARGS environment variable to pass any additional arguments to mlx_lm.server:

Variable	Description	Example
`EXTRA_ARGS`	Additional command-line arguments	EXTRA_ARGS="--max-tokens 2048 --temp 0.7 --top-p 0.9"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design Talk: server implementations #171

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Design Talk: server implementations #171

Uh oh!

Uh oh!

ocontant May 12, 2025

Discussion around design choice for mlx_lm/server.py

Model-Specific Requirements

Environment Variables

Implementation Pseudo Code

Draft model

Using Additional Arguments

Replies: 0 comments

ocontant
May 12, 2025