-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #16988: CUDA: fix crash on uneven context #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #16988: CUDA: fix crash on uneven context #71
Conversation
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: LLaMA.cpp Critical FunctionsCritical Function Performance AnalysisBased on the analysis of key performance-critical functions in LLaMA.cpp, the following functions show no measurable performance changes between versions: Core Inference Functions
Model Loading Functions
Memory Management Functions
Batch Processing Functions
Key Performance Indicator Impact Assessment1. Tokens Per SecondStatus: No Impact
2. Power ConsumptionStatus: Minimal Impact
3. Quantization EfficiencyStatus: No Impact
4. Memory UsageStatus: No Impact
5. Batch ProcessingStatus: No Impact
CUDA Kernel Selection Changes AnalysisThe PR introduces stride validation improvements in CUDA kernel selection: Modified Functions
Performance Implications
Action Items for Performance OptimizationCode-Level Optimizations
Build-Level Optimizations
ConclusionThe version comparison shows stable performance across all critical LLaMA.cpp functions. The CUDA kernel selection improvements enhance stability without measurable performance impact on core inference operations. The minimal power consumption changes reflect compiler optimization differences rather than functional modifications. No action is required for performance preservation, but monitoring CUDA workload performance is recommended to validate the stride validation improvements. |
e5cc811 to
7c48209
Compare
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryThis analysis examines PR #71 implementing CUDA kernel selection fixes for uneven context sizes, comparing version Key FindingsPerformance Impact:
Power Consumption Analysis: Flame Graph Analysis: CFG Comparison: Code Review Insights: Critical Assessment: Actionable Recommendations:
|
b1ace60 to
bff7103
Compare
7c48209 to
41735c2
Compare
94381d7 to
0eeb29b
Compare
47d1dc9 to
297c352
Compare
Mirrored from ggml-org/llama.cpp#16988
Fixes ggml-org/llama.cpp#16976 .
The problem is that the CUDA kernel selection logic does not check strides, so it's trying to run kernels where the strides don't fit. The tests don't detect this because the strides are always constructed as
2*ne00.@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.