|
| 1 | +# MathComp Corpus |
| 2 | + |
| 3 | +A corpus of code for the [Coq proof assistant](https://coq.inria.fr) along with several |
| 4 | +machine-readable representations, derived from verification |
| 5 | +projects related to the [Mathematical Components][mathcomp-website] (MathComp) library. |
| 6 | +The corpus can be used, e.g., in machine learning and data mining applications. |
| 7 | +Machine-readable representations were created using the [SerAPI library][serapi-website]. |
| 8 | + |
| 9 | +[mathcomp-website]: https://math-comp.github.io |
| 10 | +[serapi-website]: https://github.com/ejgallego/coq-serapi |
| 11 | + |
| 12 | +<b>We are actively updating this repository and will make it ready by |
| 13 | +the end of May. Stay tuned!</b> |
| 14 | + |
| 15 | +## Obtaining the corpus |
| 16 | + |
| 17 | +The latest released version of the corpus can be [downloaded][download-link] |
| 18 | +from GitHub. The archive includes both the source files written in Coq and files containing |
| 19 | +machine-readable representations; the latter were created using [Coq 8.10.2][coq-8102] |
| 20 | +and [SerAPI 0.7.1][serapi-071], via the SerAPI programs `sercomp`, `sertok`, and `sername`. |
| 21 | + |
| 22 | +[download-link]: https://github.com/EngineeringSoftware/mathcomp-corpus/releases |
| 23 | +[coq-8102]: https://github.com/coq/coq/releases/tag/V8.10.2 |
| 24 | +[serapi-071]: https://github.com/ejgallego/coq-serapi/releases |
| 25 | + |
| 26 | +## Corpus contents |
| 27 | + |
| 28 | +The latest release of the corpus, based on MathComp 1.9.0, consists of 449 source files |
| 29 | +from 21 Coq projects --- in total over 297k lines of code (LOC). |
| 30 | + |
| 31 | +| Project | Revision SHA | No. files | LOC | |
| 32 | +|:---------------------------------- |:------------:|-----------:|-------:| |
| 33 | +| [finmap][finmap] | 27642a8 | 4 | 6451 | |
| 34 | +| [fourcolor][fourcolor] | 0851d49 | 60 | 37138 | |
| 35 | +| [math-comp][math-comp] | 748d716 | 89 | 84713 | |
| 36 | +| [odd-order][odd-order] | ca602a4 | 34 | 36125 | |
| 37 | +| [analysis][analysis] | 9e5fe1d | 17 | 11739 | |
| 38 | +| [bigenough][bigenough] | 5f79a32 | 1 | 78 | |
| 39 | +| [elliptic-curves][elliptic-curves] | 631af89 | 18 | 9596 | |
| 40 | +| [grobner][grobner] | dfa54f9 | 1 | 1330 | |
| 41 | +| [multinomials][multinomials] | 691d795 | 5 | 7363 | |
| 42 | +| [real-closed][real-closed] | 495a1fa | 10 | 8925 | |
| 43 | +| [robot][robot] | b341ad1 | 13 | 11130 | |
| 44 | +| [two-square][two-square] | 1c09aca | 2 | 1721 | |
| 45 | +| [bits][bits] | 3cd298a | 10 | 4041 | |
| 46 | +| [comp-dec-pdl][comp-dec-pdl] | c1f813b | 16 | 4419 | |
| 47 | +| [disel][disel] | e8aa80c | 20 | 4473 | |
| 48 | +| [fcsl-pcm][fcsl-pcm] | eef4503 | 12 | 5789 | |
| 49 | +| [games][games] | 3d3bd31 | 12 | 4953 | |
| 50 | +| [monae][monae] | 9d0e461 | 18 | 6655 | |
| 51 | +| [reglang][reglang] | da333e0 | 12 | 3033 | |
| 52 | +| [toychain][toychain] | 97bd697 | 14 | 5275 | |
| 53 | +| [infotheo][infotheo] | 6c17242 | 81 | 42295 | |
| 54 | + |
| 55 | +A [research paper][arxiv-paper] describes the corpus in more detail |
| 56 | +and provides additional statistics. |
| 57 | + |
| 58 | +[finmap]: https://github.com/math-comp/finmap |
| 59 | +[fourcolor]: https://github.com/math-comp/fourcolor |
| 60 | +[math-comp]: https://github.com/math-comp/math-comp |
| 61 | +[odd-order]: https://github.com/math-comp/odd-order |
| 62 | +[analysis]: https://github.com/math-comp/analysis |
| 63 | +[bigenough]: https://github.com/math-comp/bigenough |
| 64 | +[elliptic-curves]: https://github.com/strub/elliptic-curves-ssr |
| 65 | +[grobner]: https://github.com/thery/grobner |
| 66 | +[multinomials]: https://github.com/math-comp/multinomials |
| 67 | +[real-closed]: https://github.com/math-comp/real-closed |
| 68 | +[robot]: https://github.com/affeldt-aist/coq-robot |
| 69 | +[two-square]: https://github.com/thery/twoSquare |
| 70 | +[bits]: https://github.com/coq-community/coq-bits |
| 71 | +[comp-dec-pdl]: https://github.com/palmskog/comp-dec-pdl |
| 72 | +[disel]: https://github.com/DistributedComponents/disel |
| 73 | +[fcsl-pcm]: https://github.com/imdea-software/fcsl-pcm |
| 74 | +[games]: https://github.com/gstew5/games |
| 75 | +[monae]: https://github.com/palmskog/monae |
| 76 | +[reglang]: https://github.com/palmskog/coq-reglang |
| 77 | +[toychain]: https://github.com/certichain/toychain |
| 78 | +[infotheo]: https://github.com/palmskog/infotheo |
| 79 | + |
| 80 | +[arxiv-paper]: https://arxiv.org/abs/2004.07761 |
| 81 | + |
| 82 | +## Applications |
| 83 | + |
| 84 | +The corpus was used to train and evaluate the tool |
| 85 | +[Roosterize][roosterize-website], which suggests lemma |
| 86 | +names in Coq code using neural networks. |
| 87 | + |
| 88 | +If you have used the corpus in a research project, please cite |
| 89 | +the corresponding research paper in any related publication: |
| 90 | +```bibtex |
| 91 | +@inproceedings{NieETAL20Roosterize, |
| 92 | + author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos}, |
| 93 | + title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms}, |
| 94 | + booktitle = {International Joint Conference on Automated Reasoning}, |
| 95 | + pages = {To appear}, |
| 96 | + year = {2020}, |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +[roosterize-website]: https://github.com/EngineeringSoftware/roosterize |
| 101 | + |
| 102 | +## Example data |
| 103 | + |
| 104 | +The Coq lemma sentence |
| 105 | +```coq |
| 106 | +Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1. |
| 107 | +``` |
| 108 | +is serialized into the following tokens: |
| 109 | +```lisp |
| 110 | +(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2) |
| 111 | + (KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier) |
| 112 | + (IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2) |
| 113 | + (KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .))) |
| 114 | +``` |
| 115 | +and the corresponding parsed syntax: |
| 116 | +```lisp |
| 117 | +(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof) |
| 118 | + (((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous())) |
| 119 | + (CLocalAssum(Name(Id L2))(CHole()IntroAnonymous())) |
| 120 | + (CLocalAssum(Name(Id N1)) |
| 121 | + (CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1)))))) |
| 122 | + (CNotation(InConstrEntrySomeLevel"_ -> _") |
| 123 | + (CNotation(InConstrEntrySomeLevel"_ =i _") |
| 124 | + (CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2)))) |
| 125 | + (CApp(CRef(Ser_Qualid(DirPath())(Id nerode))) |
| 126 | + (CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1)))))))) |
| 127 | +``` |
| 128 | +and finally, the corresponding elaborated term: |
| 129 | +```lisp |
| 130 | +(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ... |
| 131 | + (Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ... |
| 132 | + (Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ... |
| 133 | + (Var (Id L1)) ... (Var (Id L2))) |
| 134 | + (App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ... |
| 135 | + (Var (Id L2)) ... (Var (Id N1)))))))) |
| 136 | +``` |
| 137 | + |
| 138 | +## Authors |
| 139 | + |
| 140 | +- [Pengyu Nie](https://cozy.ece.utexas.edu/~pynie/) |
| 141 | +- [Karl Palmskog](https://setoid.com) |
| 142 | +- [Emilio Jesús Gallego Arias](https://www.irif.fr/~gallego/) |
| 143 | +- [Junyi Jessy Li](http://jessyli.com) |
| 144 | +- [Milos Gligoric](http://users.ece.utexas.edu/~gligoric/) |
0 commit comments