We need some documentation of the workflow of how to generate the "ground truth" bibtex records.
What's the input? What scripts do we need to run? Etc.
In terms of the "source", for journal articles with DOIs, that's the only thing we need? I.e., a DOI is a unique primary key using which we can crawl.
For articles without DOIs, we need more to serve as the input to crawling?