Measuring only PCC can't reveal fine accuracy problems in LLMs. We need to test our accuracy on actual dataset.
My suggestion is to use same dataset and metrics as in tt-metal models, just to be able to easily compare both perf and accuracy. They are using top1 and top5 percentages over some text corpus.