Replies: 4 comments 1 reply
-
Total tokens processed: 851 So I've added a small test to my own inference engine in FreePascal to check how much each expert is used, I did around 4-5 messages, like asking a rust function, basic arithmetic, hello and a freepascal function, this was the top 20 most used experts, we can notice that it's not linear, we def have some experts that receive more activations, those would be placed to vram |
Beta Was this translation helpful? Give feedback.
-
I wonder if the same technique could be used to determine what experts shall be loaded from disk into RAM when the system does not have enough. I think that would be even more useful as it would allow those huge MoEs to run on consumer hardware. |
Beta Was this translation helpful? Give feedback.
-
About my optimization idea, this used 25gb of ram only (otherwise we would have 35gb for Qwen3 30B-A3B Q8_0), and we can notice how the experts used aren't linear, so I bet we can extract much more speed from MoE models |
Beta Was this translation helpful? Give feedback.
-
So, I've discussed this with some people, I think the overhead of moving experts around and the added complexity might not worth it. Also the idea of being able to load bigger models with reduced memory requirement may not be doable because bigger models will require large amount of data to be loaded, maybe making it unusable. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Guys,
So I was wondering, moving MoE layers to CPU improve speed, but still I think we may have non optimal organization of experts on CPU/GPU.
My idea is, during normal usage keep track of most used experts, so on next load we move the most used experts to VRAM, I know that during normal usage it may not be useful because constantly moving from RAM to VRAM will drop speed, so the idea is to keep track and only reorganize during load.
Any thoughts on this?
Beta Was this translation helpful? Give feedback.
All reactions