Ideally, we want the exported LLMs pass 3 tests: 1. prompt-processing 2. token generation 3. multi-turn conversation