AI自适应负载均衡: 智能客服场景优化 || AI adaptive load balancing: intelligent customer service scenario optimization #2617

kaori-seasons · 2025-07-18T06:04:05Z

kaori-seasons
Jul 18, 2025

在思考一个功能是否有必要

业务场景：智能客服系统

假设用户运营一个电商平台的智能客服系统，有以下特点：

白天咨询量大，晚上相对较少
用户问题类型多样：商品咨询、订单查询、售后服务等
部署了多个 AI 模型：通用对话模型、专业客服模型、情感分析模型

实际配置示例

基于这个场景，我们来配置 AI Adaptive 策略：

lb_policy: ai_adaptive
lb_config:
  # Redis 配置 - 用于存储历史性能数据
  serviceFQDN: redis.customer-service
  servicePort: 6379
  username: default
  password: 'your-redis-password'
  
  # 权重配置 - 根据业务特点调整
  weights:
    queue_factor: 0.4        # 提高队列权重，因为客服响应速度很重要
    kv_cache_factor: 0.2     # 适中，因为客服对话有一定连续性
    lora_affinity_factor: 0.25  # 提高模型亲和性，避免频繁切换模型
    model_affinity_factor: 0.1   # 降低，因为客服场景模型差异不大
    response_time_factor: 0.05   # 降低，因为实时性能比历史更重要
  
  # 学习参数
  learning_rate: 0.15      # 稍微提高学习率，快速适应流量变化
  history_window: 200      # 增大历史窗口，因为客服有明显的时间规律
  
  # 模型配置
  model_profiles:
    - model: "customer-service-general"  # 通用客服模型
      avg_process_time: "1.5s"
      memory_usage: "6GB"
      priority: "high"       # 高优先级，因为使用最频繁
    
    - model: "customer-service-expert"   # 专业客服模型
      avg_process_time: "2.8s"
      memory_usage: "8GB"
      priority: "medium"     # 中优先级，处理复杂问题
    
    - model: "sentiment-analysis"        # 情感分析模型
      avg_process_time: "0.8s"
      memory_usage: "2GB"
      priority: "low"        # 低优先级，辅助功能

各参数的业务意义

队列权重 (queue_factor: 0.4)
业务意义: 客服场景下用户等待时间直接影响满意度

高权重原因: 客服用户对响应时间敏感，宁可选择负载稍高但队列短的节点
实际效果: 避免用户长时间等待，提升服务体验

KV缓存权重 (kv_cache_factor: 0.2)
业务意义: 客服对话有上下文连续性，缓存命中能显著提升效率

适中权重原因: 客服对话通常较短，缓存收益有限，但仍有价值
实际效果: 同一用户的连续问题能更快得到回复

LoRA亲和权重 (lora_affinity_factor: 0.25)
业务意义: 避免频繁的模型切换，保持服务稳定性

较高权重原因: 模型切换会带来延迟，影响用户体验
实际效果: 相同类型的客服问题倾向于路由到已加载对应模型的节点 2

模型适配权重 (model_affinity_factor: 0.1)
业务意义: 客服模型间性能差异相对较小

较低权重原因: 客服场景下模型选择主要基于功能而非性能
实际效果: 不会过度偏向某个特定模型

历史响应时间权重 (response_time_factor: 0.05)
业务意义: 实时负载比历史性能更重要

最低权重原因: 客服流量有明显的时间规律，历史数据参考价值有限
实际效果: 更多依赖当前实时指标进行调度

实际运行效果

场景1：高峰期（上午10点）

大量用户咨询，队列权重发挥作用
系统优先选择队列较短的节点
避免用户长时间等待

场景2：专业问题咨询

用户问题需要专业模型处理
LoRA亲和权重确保路由到已加载专业模型的节点
避免模型切换延迟

场景3：连续对话

用户进行多轮对话
KV缓存权重帮助维持对话上下文
提升回复质量和速度

配置调优建议

监控关键指标: 平均响应时间、用户满意度、模型切换频率
定期调整权重: 根据业务数据调整各因子权重
A/B测试: 对比不同配置的实际效果
季节性调整: 根据业务淡旺季调整学习率和历史窗口

Thinking about whether a function is necessary

Business scenario: Intelligent customer service system

Suppose that the user operates an e-commerce platform's intelligent customer service system, it has the following characteristics:

The number of consultations during the day is large, and relatively few at night
Various types of user questions: product consultation, order inquiry, after-sales service, etc.
Deploy multiple AI models: general conversation model, professional customer service model, sentiment analysis model

Actual configuration example

Based on this scenario, we will configure the AI Adaptive policy:

lb_policy: ai_adaptive
lb_config:
  # Redis configuration - for storing historical performance data
  serviceFQDN: redis.customer-service
  servicePort: 6379
  username: default
  password: 'your-redis-password'
  
  # Weight configuration - Adjust according to business characteristics
  Weights:
    queue_factor: 0.4 # Increase queue weight because customer service response speed is very important
    kv_cache_factor: 0.2 # Moderate, because the customer service conversation has a certain degree of continuity
    lora_affinity_factor: 0.25 # Improve model affinity and avoid frequent switching of models
    model_affinity_factor: 0.1 # Reduced because the customer service scenario model is not much different
    response_time_factor: 0.05 # Reduced because real-time performance is more important than history
  
  # Learn parameters
  learning_rate: 0.15 # slightly improve learning rate and quickly adapt to traffic changes
  history_window: 200 # Enlarge the historical window because customer service has obvious time rules
  
  # Model Configuration
  model_profiles:
    - model: "customer-service-general" # General customer service model
      avg_process_time: "1.5s"
      memory_usage: "6GB"
      priority: "high" # High priority, because it is most frequently used
    
    - model: "customer-service-expert" # Professional customer service model
      avg_process_time: "2.8s"
      memory_usage: "8GB"
      priority: "medium" # to deal with complex problems
    
    - model: "sentiment-analysis" # sentiment analysis model
      avg_process_time: "0.8s"
      memory_usage: "2GB"
      priority: "low" # Low priority, auxiliary function

Business significance of each parameter

Queue weight (queue_factor: 0.4)
Business significance: User waiting time in customer service scenarios directly affects satisfaction

Reason for high weight: Customer service users are sensitive to response time and would rather choose nodes with slightly higher load but short queues
Actual effect: Avoid users waiting for a long time and improve service experience

KV cache weight (kv_cache_factor: 0.2)
Business significance: Customer service conversations have contextual continuity, and cache hits can significantly improve efficiency

Reason for moderate weight: Customer service conversations are usually short, and the cache income is limited, but they are still valuable
Actual effect: Continuous questions from the same user can be answered faster

LoRA affinity weight (lora_affinity_factor: 0.25)
Business significance: Avoid frequent model switching and maintain service stability

Reason for higher weights: Model switching will cause delays and affect user experience
Actual effect: The same type of customer service problems tend to route to nodes with corresponding models loaded 2

Model adaptation weights (model_affinity_factor: 0.1)
Business significance: The performance differences between customer service models are relatively small

Reason for lower weight: Model selection in customer service scenarios is mainly based on functions rather than performance
Actual effect: Not overly biased towards a specific model

Historical response time weight (response_time_factor: 0.05)
Business significance: Real-time load is more important than historical performance

Reason for the minimum weight: There is a clear time pattern for customer service traffic, and the historical data reference value is limited
Actual effect: Relying on current real-time indicators for scheduling

Actual operation effect

Scene 1: Peak period (10 am)

A large number of users consult, queue weights play a role
The system prefers nodes with shorter queues
Avoid users waiting for a long time

Scenario 2: Professional Question Consultation

User problems need to be handled by professional models
LoRA affinity weight ensures routing to nodes with professional models loaded
Avoid model switching delay

Scene 3: Continuous dialogue

Users have multiple rounds of conversations
KV cache weights help maintain conversation context
Improve the quality and speed of reply

Configuration Tuning Suggestions

Monitor key indicators: average response time, user satisfaction, model switching frequency
Regular weight adjustment: Adjust the weight of each factor according to business data
A/B test: Comparing the actual effects of different configurations
Seasonal adjustment: Adjust learning rate and historical window according to business off-peak seasons

kaori-seasons · 2025-07-18T06:06:43Z

kaori-seasons
Jul 18, 2025
Author

1. 策略核心思路

多维度权重计算：综合考虑资源消耗和处理时间，动态计算每个后端节点的负载权重。

2. 关键指标权重公式

节点权重 = α × 队列负载因子 + β × KV缓存因子 + γ × LoRA亲和因子 + δ × 模型适配因子 + ε × 历史响应时间因子

3. 详细实现设计

3.1 队列负载因子优化
基于现有的队列过滤逻辑：

改进点：

引入队列增长率预测
考虑不同模型的平均处理时间差异
动态调整队列阈值

3.2 KV缓存利用率优化
基于现有的 KV 缓存过滤：

改进点：

预测缓存命中率
考虑不同对话长度的缓存需求
动态调整缓存阈值

3.3 LoRA 适配器智能调度
基于现有的 LoRA 亲和性逻辑：

改进点：

预测 LoRA 适配器加载时间
优化适配器在节点间的分布
考虑适配器冷启动成本

3.4 模型特定优化
基于现有的模型处理逻辑：

改进点：

建立模型性能画像
考虑不同模型的资源消耗特征
动态学习模型处理效率

4. 实现架构

4.1 配置结构

lb_policy: ai_adaptive
lb_config:
  # Redis配置（用于全局状态同步）
  serviceFQDN: redis.static
  servicePort: 6379
  username: default
  password: '123456'
  
  # 权重配置
  weights:
    queue_factor: 0.3      # 队列负载权重
    kv_cache_factor: 0.25  # KV缓存权重  
    lora_affinity_factor: 0.2  # LoRA亲和权重
    model_affinity_factor: 0.15  # 模型适配权重
    response_time_factor: 0.1   # 历史响应时间权重
  
  # 动态调整参数
  learning_rate: 0.1
  history_window: 100
  
  # 模型特定配置
  model_profiles:
    - model: "qwen-turbo"
      avg_process_time: 2.5s
      memory_usage: "4GB"
      priority: "high"
    - model: "llama2-7b"
      avg_process_time: 1.8s
      memory_usage: "8GB"
      priority: "medium"

4.2 插件接口实现
基于现有的 LoadBalancer 接口：

需要实现：

- HandleHttpRequestHeaders(): 初始化请求上下文
- HandleHttpRequestBody(): 解析请求，执行智能调度
- HandleHttpResponseHeaders(): 记录响应开始时间
- HandleHttpResponseBody(): 更新性能指标
- HandleHttpStreamDone(): 完成学习更新

4.3 指标收集增强
基于现有的 vLLM metrics

新增指标：

模型切换延迟
内存碎片率
GPU 利用率
推理吞吐量

5. 核心算法流程

请求分析阶段：解析模型类型、优先级、预估资源需求
节点评估阶段：计算每个节点的多维权重得分
智能选择阶段：选择权重最优的节点，考虑负载均衡
反馈学习阶段：收集实际性能数据，更新预测模型

6. 预期效果

资源消耗优化：通过精准的模型-节点匹配，减少不必要的模型加载和内存浪费
处理时间优化：通过缓存亲和性和队列预测，最小化等待时间
自适应能力：通过机器学习不断优化调度策略
高可用性：保持与现有架构的兼容性，支持热插拔

1. Core strategy ideas

Multi-dimensional weight calculation: Taking into account resource consumption and processing time comprehensively, the load weight of each backend node is dynamically calculated.

2. Key indicator weight formula

Node weight = α × Queue load factor + β × KV cache factor + γ × LoRA affinity factor + δ × Model adaptation factor + ε × Historical response time factor

3. Detailed implementation of design

3.1 Queue load factor optimization
Based on existing queue filtering logic:

Improvement points:

Introduce queue growth rate forecast
Consider the difference in average processing time between different models
Dynamically adjust queue threshold

3.2 KV cache utilization optimization
Filtering based on existing KV cache:

Improvement points:

Predict cache hit rate
Consider cache requirements for different conversation lengths
Dynamically adjust cache threshold

3.3 LoRA adapter intelligent scheduling
Based on existing LoRA affinity logic:

Improvement points:

Predicting LoRA Adapter Loading Time
Optimize the distribution of adapters between nodes
Consider the adapter cold start cost

3.4 Model-specific optimization
Based on existing model processing logic:

Improvement points:

Create a model performance portrait
Consider the resource consumption characteristics of different models
Dynamic learning model processing efficiency

4. Implement the architecture

4.1 Configuration structure

lb_policy: ai_adaptive
lb_config:
  # Redis configuration (for global state synchronization)
  serviceFQDN: redis.static
  servicePort: 6379
  username: default
  password: '123456'
  
  # Weight configuration
  Weights:
    queue_factor: 0.3 # queue load weight
    kv_cache_factor: 0.25 # KV cache weight
    lora_affinity_factor: 0.2 # LoRA affinity weight
    model_affinity_factor: 0.15 # Model adaptation weight
    response_time_factor: 0.1 # Historical response time weight
  
  # Dynamically adjust parameters
  learning_rate: 0.1
  history_window: 100
  
  # Model specific configuration
  model_profiles:
    - model: "qwen-turbo"
      avg_process_time: 2.5s
      memory_usage: "4GB"
      priority: "high"
    - model: "llama2-7b"
      avg_process_time: 1.8s
      memory_usage: "8GB"
      priority: "medium"

4.2 Plug-in interface implementation
Based on the existing LoadBalancer interface:

Need to implement:

- HandleHttpRequestHeaders(): Initialize the request context
- HandleHttpRequestBody(): parse requests and perform intelligent scheduling
- HandleHttpResponseHeaders(): Record the response start time
- HandleHttpResponseBody(): Update performance metrics
- HandleHttpStreamDone(): Complete learning update

4.3 Indicator collection enhancement
Based on existing vLLM metrics

New indicators:

Model switching delay
Memory fragmentation rate
GPU utilization
Inference throughput

5. Core algorithm process

Request analysis phase: parse model type, priority, and estimate resource requirements
Node evaluation stage: Calculate the multi-weight score of each node
Intelligent selection stage: select the node with the best weight and consider load balancing
Feedback learning stage: collect actual performance data and update the prediction model

6. Expected results

Resource consumption optimization: Reduce unnecessary model loading and memory waste through precise model-node matching
Processing time optimization: Minimize waiting time by cache affinity and queue prediction
Adaptive capability: Continuously optimize scheduling strategies through machine learning
High Availability: Maintain compatibility with existing architectures, support hot plugging

0 replies

kaori-seasons · 2025-07-18T06:10:04Z

kaori-seasons
Jul 18, 2025
Author

package ai_adaptive

import (
        "encoding/json"
        "fmt"
        "math"
        "time"

        "github.com/alibaba/higress/plugins/wasm-go/extensions/ai-load-balancer/least_busy/backend"
        "github.com/alibaba/higress/plugins/wasm-go/extensions/ai-load-balancer/least_busy/scheduling"
        "github.com/alibaba/higress/plugins/wasm-go/extensions/ai-load-balancer/utils"

        "github.com/higress-group/proxy-wasm-go-sdk/proxywasm"
        "github.com/higress-group/proxy-wasm-go-sdk/proxywasm/types"
        "github.com/higress-group/wasm-go/pkg/log"
        "github.com/higress-group/wasm-go/pkg/wrapper"
        "github.com/tidwall/gjson"
        "github.com/tidwall/resp"
)

type WeightConfig struct {
        QueueFactor       float64 `json:"queue_factor"`
        KVCacheFactor     float64 `json:"kv_cache_factor"`
        LoRAAffinityFactor float64 `json:"lora_affinity_factor"`
        ModelAffinityFactor float64 `json:"model_affinity_factor"`
        ResponseTimeFactor float64 `json:"response_time_factor"`
}

type ModelProfile struct {
        Model          string  `json:"model"`
        AvgProcessTime float64 `json:"avg_process_time"`
        MemoryUsage    string  `json:"memory_usage"`
        Priority       string  `json:"priority"`
}

type AIAdaptiveConfig struct {
        ServiceFQDN    string         `json:"serviceFQDN"`
        ServicePort    int64          `json:"servicePort"`
        Username       string         `json:"username"`
        Password       string         `json:"password"`
        Timeout        int64          `json:"timeout"`
        Database       int            `json:"database"`
        Weights        WeightConfig   `json:"weights"`
        LearningRate   float64        `json:"learning_rate"`
        HistoryWindow  int            `json:"history_window"`
        ModelProfiles  []ModelProfile `json:"model_profiles"`
}

type AIAdaptiveLoadBalancer struct {
        config      AIAdaptiveConfig
        redisClient wrapper.RedisClient
        modelProfiles map[string]ModelProfile
}

func NewAIAdaptiveLoadBalancer(json gjson.Result) (AIAdaptiveLoadBalancer, error) {
        lb := AIAdaptiveLoadBalancer{}
        
        // 解析配置
        configBytes := []byte(json.Raw)
        if err := json.Unmarshal(configBytes, &lb.config); err != nil {
                return lb, fmt.Errorf("failed to parse ai_adaptive config: %v", err)
        }

        // 设置默认值
        if lb.config.Weights.QueueFactor == 0 {
                lb.config.Weights.QueueFactor = 0.3
        }
        if lb.config.Weights.KVCacheFactor == 0 {
                lb.config.Weights.KVCacheFactor = 0.25
        }
        if lb.config.Weights.LoRAAffinityFactor == 0 {
                lb.config.Weights.LoRAAffinityFactor = 0.2
        }
        if lb.config.Weights.ModelAffinityFactor == 0 {
                lb.config.Weights.ModelAffinityFactor = 0.15
        }
        if lb.config.Weights.ResponseTimeFactor == 0 {
                lb.config.Weights.ResponseTimeFactor = 0.1
        }
        if lb.config.LearningRate == 0 {
                lb.config.LearningRate = 0.1
        }
        if lb.config.HistoryWindow == 0 {
                lb.config.HistoryWindow = 100
        }
        if lb.config.Timeout == 0 {
                lb.config.Timeout = 3000
        }

        // 初始化模型配置映射
        lb.modelProfiles = make(map[string]ModelProfile)
        for _, profile := range lb.config.ModelProfiles {
                lb.modelProfiles[profile.Model] = profile
        }

        // 初始化 Redis 客户端
        if lb.config.ServiceFQDN == "" || lb.config.ServicePort == 0 {
                return lb, fmt.Errorf("invalid redis service config")
        }

        lb.redisClient = wrapper.NewRedisClusterClient(wrapper.FQDNCluster{
                FQDN: lb.config.ServiceFQDN,
                Port: lb.config.ServicePort,
        })

        return lb, lb.redisClient.Init(
                lb.config.Username, 
                lb.config.Password, 
                lb.config.Timeout,
                wrapper.WithDataBase(lb.config.Database),
        )
}

func (lb AIAdaptiveLoadBalancer) HandleHttpRequestHeaders(ctx wrapper.HttpContext) types.Action {
        // 记录请求开始时间
        ctx.SetContext("request_start_time", time.Now().UnixMicro())
        return types.HeaderStopIteration
}

func (lb AIAdaptiveLoadBalancer) HandleHttpRequestBody(ctx wrapper.HttpContext, body []byte) types.Action {
        // 解析请求模型
        requestModel := gjson.GetBytes(body, "model")
        if !requestModel.Exists() {
                return types.ActionContinue
        }

        model := requestModel.String()
        ctx.SetContext("model", model)

        // 获取路由和集群信息
        routeName, err := utils.GetRouteName()
        if err != nil || routeName == "" {
                ctx.SetContext("error", true)
                return types.ActionContinue
        }
        ctx.SetContext("routeName", routeName)

        clusterName, err := utils.GetClusterName()
        if err != nil || clusterName == "" {
                ctx.SetContext("error", true)
                return types.ActionContinue
        }
        ctx.SetContext("clusterName", clusterName)

        // 获取健康的主机信息
        hostInfos, err := proxywasm.GetUpstreamHosts()
        if err != nil {
                ctx.SetContext("error", true)
                return types.ActionContinue
        }

        hostMetrics := make(map[string]string)
        healthyHosts := []string{}
        for _, hostInfo := range hostInfos {
                if gjson.Get(hostInfo[1], "health_status").String() == "Healthy" {
                        hostMetrics[hostInfo[0]] = gjson.Get(hostInfo[1], "metrics").String()
                        healthyHosts = append(healthyHosts, hostInfo[0])
                }
        }

        if len(healthyHosts) == 0 {
                ctx.SetContext("error", true)
                return types.ActionContinue
        }

        // 执行智能调度
        selectedHost, err := lb.selectOptimalHost(model, hostMetrics, routeName, clusterName)
        if err != nil {
                log.Errorf("failed to select optimal host: %v", err)
                ctx.SetContext("error", true)
                return types.ActionContinue
        }

        // 设置选中的主机
        if err := proxywasm.SetUpstreamOverrideHost([]byte(selectedHost)); err != nil {
                log.Errorf("failed to override upstream host: %v", err)
                ctx.SetContext("error", true)
                return types.ActionContinue
        }

        ctx.SetContext("selected_host", selectedHost)
        log.Debugf("AI adaptive selected host: %s for model: %s", selectedHost, model)

        return types.ActionContinue
}

func (lb AIAdaptiveLoadBalancer) HandleHttpResponseHeaders(ctx wrapper.HttpContext) types.Action {
        // 记录响应开始时间
        ctx.SetContext("response_start_time", time.Now().UnixMicro())
        return types.ActionContinue
}

func (lb AIAdaptiveLoadBalancer) HandleHttpStreamingResponseBody(ctx wrapper.HttpContext, data []byte, endOfStream bool) []byte {
        return data
}

func (lb AIAdaptiveLoadBalancer) HandleHttpResponseBody(ctx wrapper.HttpContext, body []byte) types.Action {
        return types.ActionContinue
}

func (lb AIAdaptiveLoadBalancer) HandleHttpStreamDone(ctx wrapper.HttpContext) {
        // 收集性能指标并更新学习模型
        lb.updatePerformanceMetrics(ctx)
}

// 核心调度算法
func (lb AIAdaptiveLoadBalancer) selectOptimalHost(model string, hostMetrics map[string]string, routeName, clusterName string) (string, error) {
        if len(hostMetrics) == 0 {
                return "", fmt.Errorf("no healthy hosts available")
        }

        // 解析主机指标
        pods := []*backend.PodMetrics{}
        for host, metricsStr := range hostMetrics {
                podMetrics, err := lb.parseHostMetrics(host, metricsStr)
                if err != nil {
                        log.Warnf("failed to parse metrics for host %s: %v", host, err)
                        continue
                }
                pods = append(pods, podMetrics)
        }

        if len(pods) == 0 {
                return "", fmt.Errorf("no valid pod metrics available")
        }

        // 计算每个节点的综合权重得分
        bestPod := pods[0]
        bestScore := lb.calculateNodeScore(model, pods[0], routeName, clusterName)

        for _, pod := range pods[1:] {
                score := lb.calculateNodeScore(model, pod, routeName, clusterName)
                if score < bestScore { // 分数越低越好
                        bestScore = score
                        bestPod = pod
                }
        }

        return bestPod.Address, nil
}

// 计算节点综合得分
func (lb AIAdaptiveLoadBalancer) calculateNodeScore(model string, pod *backend.PodMetrics, routeName, clusterName string) float64 {
        weights := lb.config.Weights
        
        // 1. 队列负载因子 (越小越好)
        queueScore := float64(pod.WaitingQueueSize + pod.RunningQueueSize)
        
        // 2. KV缓存因子 (使用率越低越好)
        kvCacheScore := pod.KVCacheUsagePercent
        
        // 3. LoRA亲和因子
        loraScore := 0.0
        if _, hasModel := pod.ActiveModels[model]; hasModel {
                loraScore = -10.0 // 负分表示优势
        } else if len(pod.ActiveModels) < pod.MaxActiveModels {
                loraScore = 5.0 // 可以加载新模型，但有成本
        } else {
                loraScore = 20.0 // 需要卸载其他模型，成本高
        }
        
        // 4. 模型适配因子
        modelScore := 0.0
        if profile, exists := lb.modelProfiles[model]; exists {
                switch profile.Priority {
                case "high":
                        modelScore = -5.0
                case "medium":
                        modelScore = 0.0
                case "low":
                        modelScore = 5.0
                }
        }
        
        // 5. 历史响应时间因子 (从Redis获取)
        responseTimeScore := lb.getHistoricalResponseTime(pod.Address, model, routeName, clusterName)
        
        // 综合计算
        totalScore := weights.QueueFactor*queueScore +
                weights.KVCacheFactor*kvCacheScore +
                weights.LoRAAffinityFactor*loraScore +
                weights.ModelAffinityFactor*modelScore +
                weights.ResponseTimeFactor*responseTimeScore

        return totalScore
}

// 解析主机指标
func (lb AIAdaptiveLoadBalancer) parseHostMetrics(host, metricsStr string) (*backend.PodMetrics, error) {
        // 这里复用现有的 vLLM 指标解析逻辑
        // 基于 least_busy/backend/vllm 包的实现
        podMetrics := &backend.PodMetrics{
                Address: host,
                ActiveModels: make(map[string]struct{}),
        }

        // 解析指标字符串 (假设是 Prometheus 格式)
        // 这里需要根据实际的指标格式进行解析
        metrics := gjson.Parse(metricsStr)
        
        podMetrics.WaitingQueueSize = int(metrics.Get("vllm:num_requests_waiting").Int())
        podMetrics.RunningQueueSize = int(metrics.Get("vllm:num_requests_running").Int())
        podMetrics.KVCacheUsagePercent = metrics.Get("vllm:gpu_cache_usage_perc").Float()
        podMetrics.MaxActiveModels = int(metrics.Get("max_lora").Int())

        // 解析活跃模型
        runningAdapters := metrics.Get("running_lora_adapters").String()
        if runningAdapters != "" {
                adapters := gjson.Parse(runningAdapters)
                adapters.ForEach(func(key, value gjson.Result) bool {
                        podMetrics.ActiveModels[key.String()] = struct{}{}
                        return true
                })
        }

        return podMetrics, nil
}

// 获取历史响应时间
func (lb AIAdaptiveLoadBalancer) getHistoricalResponseTime(host, model, routeName, clusterName string) float64 {
        key := fmt.Sprintf("ai_adaptive:response_time:%s:%s:%s:%s", routeName, clusterName, host, model)
        
        // 默认响应时间分数
        defaultScore := 10.0
        
        if !lb.redisClient.Ready() {
                return defaultScore
        }
        
        var responseTime float64 = defaultScore
        lb.redisClient.Get(key, func(response resp.Value) {
                if response.Error() == nil && !response.IsNull() {
                        if val, err := response.Float64(); err == nil {
                                responseTime = val
                        }
                }
        })
        
        return responseTime
}

// 更新性能指标
func (lb AIAdaptiveLoadBalancer) updatePerformanceMetrics(ctx wrapper.HttpContext) {
        // 检查是否有错误
        if errorVal, _ := ctx.GetContext("error"); errorVal != nil {
                return
        }
        
        // 获取上下文信息
        model, _ := ctx.GetContext("model")
        selectedHost, _ := ctx.GetContext("selected_host")
        routeName, _ := ctx.GetContext("routeName")
        clusterName, _ := ctx.GetContext("clusterName")
        requestStartTime, _ := ctx.GetContext("request_start_time")
        
        if model == nil || selectedHost == nil || routeName == nil || clusterName == nil || requestStartTime == nil {
                return
        }
        
        // 计算响应时间
        endTime := time.Now().UnixMicro()
        startTime := requestStartTime.(int64)
        responseTimeMs := float64(endTime-startTime) / 1000.0 // 转换为毫秒
        
        // 更新 Redis 中的历史响应时间
        lb.updateHistoricalResponseTime(
                selectedHost.(string),
                model.(string),
                routeName.(string),
                clusterName.(string),
                responseTimeMs,
        )
}

// 更新历史响应时间
func (lb AIAdaptiveLoadBalancer) updateHistoricalResponseTime(host, model, routeName, clusterName string, responseTime float64) {
        if !lb.redisClient.Ready() {
                return
        }
        
        key := fmt.Sprintf("ai_adaptive:response_time:%s:%s:%s:%s", routeName, clusterName, host, model)
        historyKey := fmt.Sprintf("ai_adaptive:history:%s:%s:%s:%s", routeName, clusterName, host, model)
        
        // 获取当前平均响应时间
        lb.redisClient.Get(key, func(response resp.Value) {
                currentAvg := 10.0 // 默认值
                if response.Error() == nil && !response.IsNull() {
                        if val, err := response.Float64(); err == nil {
                                currentAvg = val
                        }
                }
                
                // 使用指数移动平均更新响应时间
                newAvg := currentAvg*(1-lb.config.LearningRate) + responseTime*lb.config.LearningRate
                
                // 更新平均响应时间
                lb.redisClient.Set(key, newAvg, func(resp.Value) {})
                
                // 添加到历史记录
                lb.redisClient.LPush(historyKey, []interface{}{responseTime}, func(resp.Value) {})
                
                // 限制历史记录长度
                lb.redisClient.LLen(historyKey, func(response resp.Value) {
                        if response.Error() == nil {
                                if length, err := response.Int(); err == nil && length > lb.config.HistoryWindow {
                                        // 删除超出窗口的记录
                                        lb.redisClient.LRem(historyKey, -1, "", func(resp.Value) {})
                                }
                        }
                })
        })
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI自适应负载均衡: 智能客服场景优化 || AI adaptive load balancing: intelligent customer service scenario optimization #2617

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AI自适应负载均衡: 智能客服场景优化 || AI adaptive load balancing: intelligent customer service scenario optimization #2617

Uh oh!

Uh oh!

kaori-seasons Jul 18, 2025

业务场景：智能客服系统

实际配置示例

各参数的业务意义

实际运行效果

配置调优建议

Business scenario: Intelligent customer service system

Actual configuration example

Business significance of each parameter

Actual operation effect

Configuration Tuning Suggestions

Replies: 2 comments

Uh oh!

Uh oh!

kaori-seasons Jul 18, 2025 Author

1. 策略核心思路

2. 关键指标权重公式

3. 详细实现设计

4. 实现架构

5. 核心算法流程

6. 预期效果

1. Core strategy ideas

2. Key indicator weight formula

3. Detailed implementation of design

4. Implement the architecture

5. Core algorithm process

6. Expected results

Uh oh!

kaori-seasons Jul 18, 2025 Author

kaori-seasons
Jul 18, 2025

kaori-seasons
Jul 18, 2025
Author

kaori-seasons
Jul 18, 2025
Author