@@ -4,26 +4,32 @@ A corpus of code for the [Coq proof assistant](https://coq.inria.fr) along with
44machine-readable representations, derived from verification
55projects related to the [ Mathematical Components] [ math-comp-website ] (MathComp) library.
66The corpus can be used, e.g., in machine learning and data mining applications.
7- Machine-readable representations were created using the [ SerAPI library] [ serapi-website ] .
7+ Machine-readable representations are in the form of [ S-expressions] [ sexp-link ] (sexps),
8+ and were created using the [ SerAPI library] [ serapi-website ] .
89
910[ math-comp-website ] : https://math-comp.github.io
1011[ serapi-website ] : https://github.com/ejgallego/coq-serapi
12+ [ sexp-link ] : https://en.wikipedia.org/wiki/S-expression
1113
1214## Obtaining the corpus
1315
1416The latest released version of the corpus can be [ downloaded] [ download-link ]
15- from GitHub. The archive includes both the source files written in Coq and files containing
16- machine-readable representations; the latter were created using [ Coq 8.10.2] [ coq-8102 ]
17- and [ SerAPI 0.7.1] [ serapi-071 ] , via the SerAPI programs ` sercomp ` , ` sertok ` , and ` sername ` .
17+ from GitHub. The archive includes both the Coq source files and corresponding
18+ machine-readable representations.
1819
1920[ download-link ] : https://github.com/EngineeringSoftware/math-comp-corpus/releases
20- [ coq-8102 ] : https://github.com/coq/coq/releases/tag/V8.10.2
21- [ serapi-071 ] : https://github.com/ejgallego/coq-serapi/releases
2221
2322## Corpus contents
2423
25- The latest release of the corpus, based on MathComp 1.9.0, consists of 449
26- source files from 21 Coq projects --- in total over 297k lines of code (LOC).
24+ The latest corpus release is based on [ Coq 8.10.2] [ coq-8102 ] and [ MathComp 1.9.0] [ mathcomp-190 ] ,
25+ and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC).
26+ For each source file (e.g., ` theory.v ` ), there are two corresponding files with lists of
27+ sexps, organized at the Coq sentence level, for tokens (` theory.tok.sexp ` )
28+ and abstract syntax trees (` theory.ast.sexp ` ). Moreover, three sexp representations
29+ are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and
30+ elaborated term. All machine-readable representations were created using [ SerAPI 0.7.1] [ serapi-071 ] ,
31+ via the SerAPI-bundled programs ` sercomp ` , ` sertok ` , and ` sername ` .
32+
2733A [ research paper] [ arxiv-paper ] describes the corpus in more detail
2834and provides additional statistics. The corpus is divided into three tiers based
2935on how well projects conform to the [ MathComp conventions] [ math-comp-contrib ] .
@@ -56,12 +62,11 @@ The structure of the corpus is as follows:
5662
5763| File/directory | Contents |
5864| :-------------------------------| :---------------------------------------------------------------------------------------|
59- | ` projects-standalone-8.10. yml ` | List of projects, along with their URL, SHA, build command, installation command, etc. |
65+ | ` projects. yml ` | List of projects, along with their URL, SHA, build command, installation command, etc. |
6066| ` raw-files ` | Project source files and their machine-readable representations. |
61- | ` lemmas ` | Lemmas extracted from the corpus. |
62- | ` lemmas-filtered ` | Subset of lemmas obeying restrictions on the maximum sizes of their elaborated terms. |
67+ | ` lemmas ` | Lemmas for all projects in the corpus and their machine-readable statements. |
68+ | ` lemmas-filtered ` | Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms. |
6369| ` definitions ` | Definitions extracted from the corpus. |
64- | ` data-indexes ` | Data splitting used in the research paper. |
6570
6671[ finmap ] : https://github.com/math-comp/finmap
6772[ fourcolor ] : https://github.com/math-comp/fourcolor
@@ -92,6 +97,9 @@ The structure of the corpus is as follows:
9297[ apache2 ] : https://spdx.org/licenses/Apache-2.0.html
9398[ bsd2 ] : https://spdx.org/licenses/BSD-2-Clause.html
9499
100+ [ coq-8102 ] : https://github.com/coq/coq/releases/tag/V8.10.2
101+ [ mathcomp-190 ] : https://github.com/math-comp/math-comp/releases/tag/mathcomp-1.9.0
102+ [ serapi-071 ] : https://github.com/ejgallego/coq-serapi/releases/tag/8.10.0%2B0.7.1
95103[ arxiv-paper ] : https://arxiv.org/abs/2004.07761
96104[ math-comp-contrib ] : https://github.com/math-comp/math-comp/blob/mathcomp-1.9.0/CONTRIBUTING.md
97105
0 commit comments