Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.
Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.