This indeed greatly improves prompting, although one question may be not very representative for the whole approach. To measure suggested solutions properly, shall we create a test dataset of questions to evaluate the results that we get from each prompt?