One pattern I see frequently is developers who check in thousands of lines of code from other projects or 3rd party libraries. Would be interesting to:
A. Detect this
B. Cluster against it
C. See what it correlates with (I suspect it correlates with negative output and poor code).