Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.
I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?
Thanks again for your time.