@@ -4,26 +4,32 @@ A corpus of code for the [Coq proof assistant](https://coq.inria.fr) along with
44machine-readable representations, derived from verification
55projects related to the [ Mathematical Components] [ math-comp-website ]  (MathComp) library.
66The corpus can be used, e.g., in machine learning and data mining applications.
7- Machine-readable representations were created using the [ SerAPI library] [ serapi-website ] .
7+ Machine-readable representations are in the form of [ S-expressions] [ sexp-link ]  (sexps),
8+ and were created using the [ SerAPI library] [ serapi-website ] .
89
910[ math-comp-website ] : https://math-comp.github.io 
1011[ serapi-website ] : https://github.com/ejgallego/coq-serapi 
12+ [ sexp-link ] : https://en.wikipedia.org/wiki/S-expression 
1113
1214## Obtaining the corpus  
1315
1416The latest released version of the corpus can be [ downloaded] [ download-link ] 
15- from GitHub. The archive includes both the source files written in Coq and files containing
16- machine-readable representations; the latter were created using [ Coq 8.10.2] [ coq-8102 ] 
17- and [ SerAPI 0.7.1] [ serapi-071 ] , via the SerAPI programs ` sercomp ` , ` sertok ` , and ` sername ` .
17+ from GitHub. The archive includes both the Coq source files and corresponding
18+ machine-readable representations.
1819
1920[ download-link ] : https://github.com/EngineeringSoftware/math-comp-corpus/releases 
20- [ coq-8102 ] : https://github.com/coq/coq/releases/tag/V8.10.2 
21- [ serapi-071 ] : https://github.com/ejgallego/coq-serapi/releases 
2221
2322## Corpus contents  
2423
25- The latest release of the corpus, based on MathComp 1.9.0, consists of 449
26- source files from 21 Coq projects --- in total over 297k lines of code (LOC).
24+ The latest corpus release is based on [ Coq 8.10.2] [ coq-8102 ]  and [ MathComp 1.9.0] [ mathcomp-190 ] ,
25+ and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC).
26+ For each source file (e.g., ` theory.v ` ), there are two corresponding files with lists of
27+ sexps, organized at the Coq sentence level, for tokens (` theory.tok.sexp ` )
28+ and abstract syntax trees (` theory.ast.sexp ` ). Moreover, three sexp representations
29+ are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and
30+ elaborated term. All machine-readable representations were created using [ SerAPI 0.7.1] [ serapi-071 ] ,
31+ via the SerAPI-bundled programs ` sercomp ` , ` sertok ` , and ` sername ` .
32+ 
2733A [ research paper] [ arxiv-paper ]  describes the corpus in more detail
2834and provides additional statistics. The corpus is divided into three tiers based
2935on how well projects conform to the [ MathComp conventions] [ math-comp-contrib ] .
@@ -56,12 +62,11 @@ The structure of the corpus is as follows:
5662
5763|  File/directory                 |  Contents                                                                               | 
5864| :-------------------------------| :---------------------------------------------------------------------------------------| 
59- |  ` projects-standalone-8.10. yml `  |  List of projects, along with their URL, SHA, build command, installation command, etc. | 
65+ |  ` projects. yml `                   |  List of projects, along with their URL, SHA, build command, installation command, etc. | 
6066|  ` raw-files `                     |  Project source files and their machine-readable representations.                       | 
61- |  ` lemmas `                        |  Lemmas extracted from  the corpus.                                                       | 
62- |  ` lemmas-filtered `               |  Subset of lemmas obeying restrictions on the maximum sizes of their  elaborated terms.  | 
67+ |  ` lemmas `                        |  Lemmas for all projects in  the corpus and their machine-readable statements.            | 
68+ |  ` lemmas-filtered `               |  Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms.         | 
6369|  ` definitions `                   |  Definitions extracted from the corpus.                                                 | 
64- |  ` data-indexes `                  |  Data splitting used in the research paper.                                             | 
6570
6671[ finmap ] : https://github.com/math-comp/finmap 
6772[ fourcolor ] : https://github.com/math-comp/fourcolor 
@@ -92,6 +97,9 @@ The structure of the corpus is as follows:
9297[ apache2 ] : https://spdx.org/licenses/Apache-2.0.html 
9398[ bsd2 ] : https://spdx.org/licenses/BSD-2-Clause.html 
9499
100+ [ coq-8102 ] : https://github.com/coq/coq/releases/tag/V8.10.2 
101+ [ mathcomp-190 ] : https://github.com/math-comp/math-comp/releases/tag/mathcomp-1.9.0 
102+ [ serapi-071 ] : https://github.com/ejgallego/coq-serapi/releases/tag/8.10.0%2B0.7.1 
95103[ arxiv-paper ] : https://arxiv.org/abs/2004.07761 
96104[ math-comp-contrib ] : https://github.com/math-comp/math-comp/blob/mathcomp-1.9.0/CONTRIBUTING.md 
97105
0 commit comments