MoE - Analysis for speed optimization VRAM/RAM #15343

fredconex · 2025-08-15T11:22:48Z

fredconex
Aug 15, 2025

Hello Guys,

So I was wondering, moving MoE layers to CPU improve speed, but still I think we may have non optimal organization of experts on CPU/GPU.

My idea is, during normal usage keep track of most used experts, so on next load we move the most used experts to VRAM, I know that during normal usage it may not be useful because constantly moving from RAM to VRAM will drop speed, so the idea is to keep track and only reorganize during load.

Any thoughts on this?

fredconex · 2025-08-15T17:10:15Z

fredconex
Aug 15, 2025
Author

Total tokens processed: 851
Total unique expert uses: 95714
Average experts per token: 112.5
Top 20 most used experts:
Expert 71: 850 tokens (99.9% of all tokens, 0.9% of expert uses)
Expert 46: 847 tokens (99.5% of all tokens, 0.9% of expert uses)
Expert 8: 842 tokens (98.9% of all tokens, 0.9% of expert uses)
Expert 58: 841 tokens (98.8% of all tokens, 0.9% of expert uses)
Expert 64: 839 tokens (98.6% of all tokens, 0.9% of expert uses)
Expert 76: 836 tokens (98.2% of all tokens, 0.9% of expert uses)
Expert 47: 834 tokens (98.0% of all tokens, 0.9% of expert uses)
Expert 65: 833 tokens (97.9% of all tokens, 0.9% of expert uses)
Expert 0: 832 tokens (97.8% of all tokens, 0.9% of expert uses)
Expert 110: 832 tokens (97.8% of all tokens, 0.9% of expert uses)
Expert 104: 828 tokens (97.3% of all tokens, 0.9% of expert uses)
Expert 16: 826 tokens (97.1% of all tokens, 0.9% of expert uses)
Expert 53: 823 tokens (96.7% of all tokens, 0.9% of expert uses)
Expert 115: 823 tokens (96.7% of all tokens, 0.9% of expert uses)
Expert 18: 822 tokens (96.6% of all tokens, 0.9% of expert uses)
Expert 127: 821 tokens (96.5% of all tokens, 0.9% of expert uses)
Expert 117: 817 tokens (96.0% of all tokens, 0.9% of expert uses)
Expert 27: 816 tokens (95.9% of all tokens, 0.9% of expert uses)
Expert 80: 815 tokens (95.8% of all tokens, 0.9% of expert uses)
Expert 45: 813 tokens (95.5% of all tokens, 0.8% of expert uses)

So I've added a small test to my own inference engine in FreePascal to check how much each expert is used, I did around 4-5 messages, like asking a rust function, basic arithmetic, hello and a freepascal function, this was the top 20 most used experts, we can notice that it's not linear, we def have some experts that receive more activations, those would be placed to vram

0 replies

Dampfinchen · 2025-08-16T07:19:15Z

Dampfinchen
Aug 16, 2025

I wonder if the same technique could be used to determine what experts shall be loaded from disk into RAM when the system does not have enough. I think that would be even more useful as it would allow those huge MoEs to run on consumer hardware.

1 reply

fredconex Aug 17, 2025
Author

I wonder if the same technique could be used to determine what experts shall be loaded from disk into RAM when the system does not have enough. I think that would be even more useful as it would allow those huge MoEs to run on consumer hardware.

Yes, I've written a inference engine using Pascal and I've implemented that, with a simple "Hello" from 35gb of Qwen3 30B-A3B Q8_0 it used only 20gb, if we keep track of usage and load/unload I can run it with only 9gb of ram at half speed (my engine is slow so normally at 4-5tk/s) so with this I can run at 2 tk/s, it will be faster if loading/unloading of experts is reduced.

Unfortunately I don't have enough knowledge to implement this directly into llama.cpp, but it could potentially make possible to load bigger models in constrained environments.

fredconex · 2025-08-17T18:28:42Z

fredconex
Aug 17, 2025
Author

Sure! Here's a small **Rust program** that prints "Hello, world!" to the console:

rust
fn main() {
    println!("Hello, world!");
}


### How to Run It:

1. **Install Rust** if you haven't already. You can get it from [https://www.rust-lang.org/](https://www.rust-lang.org/).

2. Save the code above in a file named `main.rs`.

3. Open a terminal and navigate to the directory where the file is saved.

4. Compile and run the program with:

bash
cargo run


Or, if you're not using Cargo (for a simple file):

bash
rustc main.rs && ./main


Let me know if you'd like a more complex example, like a calculator or a simple CLI tool!
--- Response Statistics ---
Prompt tokens:18 | Generated tokens: 176 | Total tokens: 194 | Time to first token: 21.100s | Total response time: 82.762s | Tokens per second: 2.85 tk/s
--- Top 20 Most Used Experts ---
Expert  59:     1229 calls (  1.64%)
Expert  71:     1146 calls (  1.53%)
Expert  78:     1130 calls (  1.51%)
Expert  21:     1073 calls (  1.43%)
Expert  84:     1035 calls (  1.38%)
Expert  80:     1008 calls (  1.35%)
Expert 123:      991 calls (  1.32%)
Expert  51:      963 calls (  1.29%)
Expert  68:      963 calls (  1.29%)
Expert  64:      960 calls (  1.28%)
Expert   9:      955 calls (  1.28%)
Expert  16:      939 calls (  1.25%)
Expert  70:      914 calls (  1.22%)
Expert 105:      904 calls (  1.21%)
Expert  39:      898 calls (  1.20%)
Expert  54:      895 calls (  1.20%)
Expert  20:      888 calls (  1.19%)
Expert  60:      842 calls (  1.12%)
Expert 120:      842 calls (  1.12%)
Expert  82:      826 calls (  1.10%)
Total expert calls: 74880

About my optimization idea, this used 25gb of ram only (otherwise we would have 35gb for Qwen3 30B-A3B Q8_0), and we can notice how the experts used aren't linear, so I bet we can extract much more speed from MoE models

0 replies

fredconex · 2025-08-17T20:54:57Z

fredconex
Aug 17, 2025
Author

So, I've discussed this with some people, I think the overhead of moving experts around and the added complexity might not worth it.

Also the idea of being able to load bigger models with reduced memory requirement may not be doable because bigger models will require large amount of data to be loaded, maybe making it unusable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MoE - Analysis for speed optimization VRAM/RAM #15343

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MoE - Analysis for speed optimization VRAM/RAM #15343

Uh oh!

fredconex Aug 15, 2025

Replies: 4 comments · 1 reply

Uh oh!

fredconex Aug 15, 2025 Author

Uh oh!

Uh oh!

Dampfinchen Aug 16, 2025

Uh oh!

Uh oh!

fredconex Aug 17, 2025 Author

Uh oh!

fredconex Aug 17, 2025 Author

Uh oh!

fredconex Aug 17, 2025 Author

fredconex
Aug 15, 2025

Replies: 4 comments 1 reply

fredconex
Aug 15, 2025
Author

Dampfinchen
Aug 16, 2025

fredconex Aug 17, 2025
Author

fredconex
Aug 17, 2025
Author

fredconex
Aug 17, 2025
Author