Skip to content

Conversation

@ggerganov
Copy link
Member

ref #6414

test-backend-ops was failing MUL_MAT tests with Q4_0 and Q8_0 due to incorrect wdata reads:

make tests && ./tests/test-backend-ops -o MUL_MAT -b CPU
  • Fix wdata offset when ggml_blck_size(vec_dot_type) > 1
  • GGML_USE_LLAMAFILE is defined by the build system (Make + CMake)

@ggerganov ggerganov merged commit 666867b into master Apr 16, 2024
@ggerganov ggerganov deleted the gg/fix-sgemm branch April 16, 2024 20:50
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 447 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10600.89ms p(95)=27375.91ms fails=, finish reason: stop=389 truncated=58
  • Prompt processing (pp): avg=118.7tk/s p(95)=553.68tk/s
  • Token generation (tg): avg=23.72tk/s p(95)=36.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/fix-sgemm commit=42b5d17c32a49bdeae0b97608b67884c85019c36

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 447 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713301053 --> 1713301683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 493.75, 493.75, 493.75, 493.75, 493.75, 474.61, 474.61, 474.61, 474.61, 474.61, 490.15, 490.15, 490.15, 490.15, 490.15, 502.32, 502.32, 502.32, 502.32, 502.32, 522.17, 522.17, 522.17, 522.17, 522.17, 537.87, 537.87, 537.87, 537.87, 537.87, 540.68, 540.68, 540.68, 540.68, 540.68, 546.19, 546.19, 546.19, 546.19, 546.19, 563.07, 563.07, 563.07, 563.07, 563.07, 566.96, 566.96, 566.96, 566.96, 566.96, 569.34, 569.34, 569.34, 569.34, 569.34, 579.63, 579.63, 579.63, 579.63, 579.63, 576.82, 576.82, 576.82, 576.82, 576.82, 591.98, 591.98, 591.98, 591.98, 591.98, 616.72, 616.72, 616.72, 616.72, 616.72, 625.14, 625.14, 625.14, 625.14, 625.14, 634.62, 634.62, 634.62, 634.62, 634.62, 563.76, 563.76, 563.76, 563.76, 563.76, 549.36, 549.36, 549.36, 549.36, 549.36, 554.07, 554.07, 554.07, 554.07, 554.07, 554.6, 554.6, 554.6, 554.6, 554.6, 569.38, 569.38, 569.38, 569.38, 569.38, 573.56, 573.56, 573.56, 573.56, 573.56, 576.39, 576.39, 576.39, 576.39, 576.39, 576.49, 576.49, 576.49, 576.49, 576.49, 582.73, 582.73, 582.73, 582.73, 582.73, 583.88, 583.88, 583.88, 583.88, 583.88, 588.59, 588.59, 588.59, 588.59, 588.59, 604.78, 604.78, 604.78, 604.78, 604.78, 604.1, 604.1, 604.1, 604.1, 604.1, 608.05, 608.05, 608.05, 608.05, 608.05, 610.86, 610.86, 610.86, 610.86, 610.86, 620.78, 620.78, 620.78, 620.78, 620.78, 619.25, 619.25, 619.25, 619.25, 619.25, 620.57, 620.57, 620.57, 620.57, 620.57, 621.99, 621.99, 621.99, 621.99, 621.99, 626.68, 626.68, 626.68, 626.68, 626.68, 628.87, 628.87, 628.87, 628.87, 628.87, 628.56, 628.56, 628.56, 628.56, 628.56, 629.76, 629.76, 629.76, 629.76, 629.76, 632.99, 632.99, 632.99, 632.99, 632.99, 641.4, 641.4, 641.4, 641.4, 641.4, 647.46, 647.46, 647.46, 647.46, 647.46, 644.8, 644.8, 644.8, 644.8, 644.8, 644.83, 644.83, 644.83, 644.83, 644.83, 644.66, 644.66, 644.66, 644.66, 644.66, 645.34, 645.34, 645.34, 645.34, 645.34, 648.51, 648.51, 648.51, 648.51, 648.51, 652.26, 652.26, 652.26, 652.26, 652.26, 650.22, 650.22, 650.22, 650.22, 650.22, 643.31, 643.31, 643.31, 643.31, 643.31, 642.98, 642.98, 642.98, 642.98, 642.98, 642.67, 642.67, 642.67, 642.67, 642.67, 640.86, 640.86, 640.86, 640.86, 640.86, 640.16, 640.16, 640.16, 640.16, 640.16, 639.47, 639.47, 639.47, 639.47, 639.47, 638.95, 638.95, 638.95, 638.95, 638.95, 645.02, 645.02, 645.02, 645.02, 645.02, 645.48, 645.48, 645.48, 645.48, 645.48, 648.38, 648.38, 648.38, 648.38, 648.38, 645.04, 645.04, 645.04, 645.04, 645.04, 645.04]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 447 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713301053 --> 1713301683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.91, 38.91, 38.91, 38.91, 38.91, 35.21, 35.21, 35.21, 35.21, 35.21, 24.52, 24.52, 24.52, 24.52, 24.52, 24.16, 24.16, 24.16, 24.16, 24.16, 24.13, 24.13, 24.13, 24.13, 24.13, 23.45, 23.45, 23.45, 23.45, 23.45, 23.74, 23.74, 23.74, 23.74, 23.74, 24.2, 24.2, 24.2, 24.2, 24.2, 25.05, 25.05, 25.05, 25.05, 25.05, 25.42, 25.42, 25.42, 25.42, 25.42, 25.4, 25.4, 25.4, 25.4, 25.4, 25.23, 25.23, 25.23, 25.23, 25.23, 24.65, 24.65, 24.65, 24.65, 24.65, 24.12, 24.12, 24.12, 24.12, 24.12, 24.02, 24.02, 24.02, 24.02, 24.02, 23.78, 23.78, 23.78, 23.78, 23.78, 23.53, 23.53, 23.53, 23.53, 23.53, 23.32, 23.32, 23.32, 23.32, 23.32, 22.51, 22.51, 22.51, 22.51, 22.51, 22.55, 22.55, 22.55, 22.55, 22.55, 22.62, 22.62, 22.62, 22.62, 22.62, 22.69, 22.69, 22.69, 22.69, 22.69, 22.45, 22.45, 22.45, 22.45, 22.45, 22.44, 22.44, 22.44, 22.44, 22.44, 22.29, 22.29, 22.29, 22.29, 22.29, 22.02, 22.02, 22.02, 22.02, 22.02, 21.85, 21.85, 21.85, 21.85, 21.85, 21.99, 21.99, 21.99, 21.99, 21.99, 22.09, 22.09, 22.09, 22.09, 22.09, 21.91, 21.91, 21.91, 21.91, 21.91, 21.98, 21.98, 21.98, 21.98, 21.98, 22.14, 22.14, 22.14, 22.14, 22.14, 22.2, 22.2, 22.2, 22.2, 22.2, 22.08, 22.08, 22.08, 22.08, 22.08, 22.07, 22.07, 22.07, 22.07, 22.07, 22.26, 22.26, 22.26, 22.26, 22.26, 22.38, 22.38, 22.38, 22.38, 22.38, 22.47, 22.47, 22.47, 22.47, 22.47, 22.51, 22.51, 22.51, 22.51, 22.51, 22.57, 22.57, 22.57, 22.57, 22.57, 22.68, 22.68, 22.68, 22.68, 22.68, 22.63, 22.63, 22.63, 22.63, 22.63, 22.6, 22.6, 22.6, 22.6, 22.6, 22.55, 22.55, 22.55, 22.55, 22.55, 22.4, 22.4, 22.4, 22.4, 22.4, 22.47, 22.47, 22.47, 22.47, 22.47, 22.59, 22.59, 22.59, 22.59, 22.59, 22.72, 22.72, 22.72, 22.72, 22.72, 22.8, 22.8, 22.8, 22.8, 22.8, 22.81, 22.81, 22.81, 22.81, 22.81, 22.75, 22.75, 22.75, 22.75, 22.75, 22.67, 22.67, 22.67, 22.67, 22.67, 22.55, 22.55, 22.55, 22.55, 22.55, 21.97, 21.97, 21.97, 21.97, 21.97, 21.96, 21.96, 21.96, 21.96, 21.96, 21.26, 21.26, 21.26, 21.26, 21.26, 21.06, 21.06, 21.06, 21.06, 21.06, 21.09, 21.09, 21.09, 21.09, 21.09, 21.15, 21.15, 21.15, 21.15, 21.15, 21.18, 21.18, 21.18, 21.18, 21.18, 21.26, 21.26, 21.26, 21.26, 21.26, 21.37]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 447 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713301053 --> 1713301683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.39, 0.39, 0.39, 0.39, 0.39, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.24, 0.24, 0.24, 0.24, 0.24, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.4, 0.45, 0.45, 0.45, 0.45, 0.45, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.46, 0.46, 0.46, 0.46, 0.46, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 447 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713301053 --> 1713301683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0]
                    
Loading

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants