Identify evaluation tasks

Let's see if we can find some evaluation tasks that we can use to gauge performance of identified models in a reliable manner. Ideally tasks should probably be scoped to the medical or biological domain (think term recognition) but we should first see what we can find