Skip to content

Commit f8a6d69

Browse files
committed
initial commit with README
0 parents  commit f8a6d69

File tree

1 file changed

+144
-0
lines changed

1 file changed

+144
-0
lines changed

README.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# MathComp Corpus
2+
3+
A corpus of code for the [Coq proof assistant](https://coq.inria.fr) along with several
4+
machine-readable representations, derived from verification
5+
projects related to the [Mathematical Components][mathcomp-website] (MathComp) library.
6+
The corpus can be used, e.g., in machine learning and data mining applications.
7+
Machine-readable representations were created using the [SerAPI library][serapi-website].
8+
9+
[mathcomp-website]: https://math-comp.github.io
10+
[serapi-website]: https://github.com/ejgallego/coq-serapi
11+
12+
<b>We are actively updating this repository and will make it ready by
13+
the end of May. Stay tuned!</b>
14+
15+
## Obtaining the corpus
16+
17+
The latest released version of the corpus can be [downloaded][download-link]
18+
from GitHub. The archive includes both the source files written in Coq and files containing
19+
machine-readable representations; the latter were created using [Coq 8.10.2][coq-8102]
20+
and [SerAPI 0.7.1][serapi-071], via the SerAPI programs `sercomp`, `sertok`, and `sername`.
21+
22+
[download-link]: https://github.com/EngineeringSoftware/mathcomp-corpus/releases
23+
[coq-8102]: https://github.com/coq/coq/releases/tag/V8.10.2
24+
[serapi-071]: https://github.com/ejgallego/coq-serapi/releases
25+
26+
## Corpus contents
27+
28+
The latest release of the corpus, based on MathComp 1.9.0, consists of 449 source files
29+
from 21 Coq projects --- in total over 297k lines of code (LOC).
30+
31+
| Project | Revision SHA | No. files | LOC |
32+
|:---------------------------------- |:------------:|-----------:|-------:|
33+
| [finmap][finmap] | 27642a8 | 4 | 6451 |
34+
| [fourcolor][fourcolor] | 0851d49 | 60 | 37138 |
35+
| [math-comp][math-comp] | 748d716 | 89 | 84713 |
36+
| [odd-order][odd-order] | ca602a4 | 34 | 36125 |
37+
| [analysis][analysis] | 9e5fe1d | 17 | 11739 |
38+
| [bigenough][bigenough] | 5f79a32 | 1 | 78 |
39+
| [elliptic-curves][elliptic-curves] | 631af89 | 18 | 9596 |
40+
| [grobner][grobner] | dfa54f9 | 1 | 1330 |
41+
| [multinomials][multinomials] | 691d795 | 5 | 7363 |
42+
| [real-closed][real-closed] | 495a1fa | 10 | 8925 |
43+
| [robot][robot] | b341ad1 | 13 | 11130 |
44+
| [two-square][two-square] | 1c09aca | 2 | 1721 |
45+
| [bits][bits] | 3cd298a | 10 | 4041 |
46+
| [comp-dec-pdl][comp-dec-pdl] | c1f813b | 16 | 4419 |
47+
| [disel][disel] | e8aa80c | 20 | 4473 |
48+
| [fcsl-pcm][fcsl-pcm] | eef4503 | 12 | 5789 |
49+
| [games][games] | 3d3bd31 | 12 | 4953 |
50+
| [monae][monae] | 9d0e461 | 18 | 6655 |
51+
| [reglang][reglang] | da333e0 | 12 | 3033 |
52+
| [toychain][toychain] | 97bd697 | 14 | 5275 |
53+
| [infotheo][infotheo] | 6c17242 | 81 | 42295 |
54+
55+
A [research paper][arxiv-paper] describes the corpus in more detail
56+
and provides additional statistics.
57+
58+
[finmap]: https://github.com/math-comp/finmap
59+
[fourcolor]: https://github.com/math-comp/fourcolor
60+
[math-comp]: https://github.com/math-comp/math-comp
61+
[odd-order]: https://github.com/math-comp/odd-order
62+
[analysis]: https://github.com/math-comp/analysis
63+
[bigenough]: https://github.com/math-comp/bigenough
64+
[elliptic-curves]: https://github.com/strub/elliptic-curves-ssr
65+
[grobner]: https://github.com/thery/grobner
66+
[multinomials]: https://github.com/math-comp/multinomials
67+
[real-closed]: https://github.com/math-comp/real-closed
68+
[robot]: https://github.com/affeldt-aist/coq-robot
69+
[two-square]: https://github.com/thery/twoSquare
70+
[bits]: https://github.com/coq-community/coq-bits
71+
[comp-dec-pdl]: https://github.com/palmskog/comp-dec-pdl
72+
[disel]: https://github.com/DistributedComponents/disel
73+
[fcsl-pcm]: https://github.com/imdea-software/fcsl-pcm
74+
[games]: https://github.com/gstew5/games
75+
[monae]: https://github.com/palmskog/monae
76+
[reglang]: https://github.com/palmskog/coq-reglang
77+
[toychain]: https://github.com/certichain/toychain
78+
[infotheo]: https://github.com/palmskog/infotheo
79+
80+
[arxiv-paper]: https://arxiv.org/abs/2004.07761
81+
82+
## Applications
83+
84+
The corpus was used to train and evaluate the tool
85+
[Roosterize][roosterize-website], which suggests lemma
86+
names in Coq code using neural networks.
87+
88+
If you have used the corpus in a research project, please cite
89+
the corresponding research paper in any related publication:
90+
```bibtex
91+
@inproceedings{NieETAL20Roosterize,
92+
author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos},
93+
title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms},
94+
booktitle = {International Joint Conference on Automated Reasoning},
95+
pages = {To appear},
96+
year = {2020},
97+
}
98+
```
99+
100+
[roosterize-website]: https://github.com/EngineeringSoftware/roosterize
101+
102+
## Example data
103+
104+
The Coq lemma sentence
105+
```coq
106+
Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1.
107+
```
108+
is serialized into the following tokens:
109+
```lisp
110+
(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2)
111+
(KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier)
112+
(IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2)
113+
(KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .)))
114+
```
115+
and the corresponding parsed syntax:
116+
```lisp
117+
(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof)
118+
(((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous()))
119+
(CLocalAssum(Name(Id L2))(CHole()IntroAnonymous()))
120+
(CLocalAssum(Name(Id N1))
121+
(CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1))))))
122+
(CNotation(InConstrEntrySomeLevel"_ -> _")
123+
(CNotation(InConstrEntrySomeLevel"_ =i _")
124+
(CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2))))
125+
(CApp(CRef(Ser_Qualid(DirPath())(Id nerode)))
126+
(CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1))))))))
127+
```
128+
and finally, the corresponding elaborated term:
129+
```lisp
130+
(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ...
131+
(Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ...
132+
(Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ...
133+
(Var (Id L1)) ... (Var (Id L2)))
134+
(App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ...
135+
(Var (Id L2)) ... (Var (Id N1))))))))
136+
```
137+
138+
## Authors
139+
140+
- [Pengyu Nie](https://cozy.ece.utexas.edu/~pynie/)
141+
- [Karl Palmskog](https://setoid.com)
142+
- [Emilio Jesús Gallego Arias](https://www.irif.fr/~gallego/)
143+
- [Junyi Jessy Li](http://jessyli.com)
144+
- [Milos Gligoric](http://users.ece.utexas.edu/~gligoric/)

0 commit comments

Comments
 (0)