|
| 1 | +# APICoder - CodeGenAPI |
| 2 | + |
| 3 | +Official repository for our paper ["When Language Model Meets Private Library"](). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +APIRetirever finds out useful APIs for a programming problem, and then APICoder aims to generate code that solves the problem with these APIs. We make use of the most straightforward way for APICoder: prompting API information set in front of the context. Each API information is in the form of `name(signature):description`. This is to mimic programmers learning the APIs properly before writing code using them. |
| 10 | + |
| 11 | +<img src=https://s3.bmp.ovh/imgs/2022/09/27/3691aaf9d0421991.png width=650 /> |
| 12 | + |
| 13 | +Figure1: The training process of CodeGenAPI |
| 14 | + |
| 15 | +## Project Directory |
| 16 | +```shell |
| 17 | +├── CodeGenAPI |
| 18 | +│ ├── APICoder |
| 19 | +│ │ ├── get_api_info_by_name.py |
| 20 | +│ │ ├── get_lib_comment_for_eval.py |
| 21 | +│ ├── apex |
| 22 | +│ ├── eval_baseline.py |
| 23 | +│ ├── eval_private.py |
| 24 | +│ ├── nl2code |
| 25 | +│ ├── requirements.txt |
| 26 | +│ ├── run_generating_codes.sh # The entry script for CodeGenAPI inference, which can generate a lot of code snippets for each programming problem. |
| 27 | +│ ├── run_evaluating_codes.sh # The entry script for evaluating the generated code snippets, and outputting the final results (pass@k). |
| 28 | +│ ├── run_private.py |
| 29 | +│ ├── run_private.sh # Implementation of CodeGenAPI training. |
| 30 | +│ └── scripts |
| 31 | +│ ├── encode_private_data.py |
| 32 | +│ ├── extract_api.py |
| 33 | +│ ├── file_utils.py |
| 34 | +│ ├── get_comments_from_evallibs.py |
| 35 | +│ ├── get_libs_info_from_code.py |
| 36 | +│ ├── make_human_in_the_loop_test_corpus.py |
| 37 | +│ ├── multiprocessing_utils.py |
| 38 | +│ ├── pycode_visitor.py |
| 39 | +│ ├── requirements.txt |
| 40 | +│ ├── run_details_apis.sh # Extracting all kinds of API information (API name, signature, description and so on) from the crawled API documentations of 35 libraries. |
| 41 | +│ ├── run_encode_private_data.sh # Encoding the private data |
| 42 | +│ ├── run_extract_apis.sh # Crawling the API documentation for 31 off-the-shelf public libraries. |
| 43 | +│ └── run_extract_details_from_apis.py |
| 44 | +``` |
| 45 | + |
| 46 | +## Quickstart |
| 47 | + |
| 48 | +This section covers environment, data preparation, model inference, and model training. |
| 49 | + |
| 50 | +### Preparation |
| 51 | + |
| 52 | +1、Configuring your runtime environment |
| 53 | + |
| 54 | +``` |
| 55 | +$ cd PrivateLibrary/CodeGenAPI |
| 56 | +$ pip install -r requirements.txt |
| 57 | +``` |
| 58 | +Besides, if you would like to use mixed precision FP16 to speed up the training, it is necessary for you to install the apex library. |
| 59 | +``` |
| 60 | +git clone https://github.com/NVIDIA/apex |
| 61 | +cd apex |
| 62 | +pip install -v --no-cache-dir ./ |
| 63 | +``` |
| 64 | + |
| 65 | +2、Preparation of pre-trained models |
| 66 | + |
| 67 | +Download the pre-trained checkpoint (e.g., `CodeGenAPI-110M`) from Google Drive and place it in the corresponding folder (e.g., `CodeGenAPI/models/CodeGenAPI-110M`). |
| 68 | + |
| 69 | +3、Updating the scripts according to your local path |
| 70 | + |
| 71 | +- Update `run_private.sh`. |
| 72 | +- Update `run_generating_codes.sh`. |
| 73 | +- Update `run_evaluating_codes.sh`. |
| 74 | + |
| 75 | +### Use CodeGenAPI or other models |
| 76 | + |
| 77 | +Firstly, multiple code snippets are generated for each programming problem (`run_generating_codes.sh`). Then, the code snippets are evaluated (`run_evaluating_codes.sh`). |
| 78 | + |
| 79 | +``` |
| 80 | +$ bash run_generating_codes.sh |
| 81 | +$ bash run_evaluating_codes.sh |
| 82 | +``` |
| 83 | + |
| 84 | +### Train CodeGenAPI |
| 85 | + |
| 86 | +Train CodeGenAPI by the following command based on the large-scale code corpus. |
| 87 | + |
| 88 | +``` |
| 89 | +$ bash run_private.sh |
| 90 | +``` |
| 91 | + |
| 92 | +## Experiments |
| 93 | + |
| 94 | +In inference phase, we set the `temperature` to one of `[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]`, the number of samples (`NUM_SAMPLES`) to `200`, the max number of generated tokens (`MAX_TOKNES`) to `100`, and the `top_p` to `0.9`. The best number is reported across the above hyper-parameters. |
| 95 | + |
| 96 | +Here are the main results: |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +After running these numerous experiments, we drew some plausible observations and valuable insights as follows. |
| 101 | + |
| 102 | +> (1) Prompting API information set is useful on private-library oriented code generation task. |
| 103 | +
|
| 104 | +> (2) Which is the best of the API prompt ways including Perfect, Top-$N$, and Human? As a general matter, Perfect, Human, and Top-$N$ produce progressively decreasing benefits. However, Top-$N$ is in occasion superior than Perfect as the noise exists when training the model. Also, we observe that Top-$1$,$2$ usually works better than Top-$3$,$5$ because the latter introduces more noise APIs. |
| 105 | +
|
| 106 | +> (3) Our continual pre-trained model does better at invoking APIs than to its base model, and thus can further elevate the performance of code generation for private libraries in majority of scenarios. |
| 107 | +
|
| 108 | +> (4) APIRetriever has the capability to retrieve useful APIs. |
| 109 | +
|
| 110 | +> (5) Involving human in the loop can further boost the performance. |
| 111 | +
|
| 112 | +> (6) As the $k$ in pass@$k$ grows larger, the gain we add API information brings is larger. |
| 113 | +
|
| 114 | +> (7) It is so challenging to generate code invoking private libraries than public ones, that large models fail to do so if we do not prompt any APIs. |
| 115 | +
|
| 116 | +For more explanation, please see our raw paper. |
| 117 | + |
| 118 | +## Citation |
| 119 | +If you find our work useful, please cite the paper: |
| 120 | +``` |
| 121 | +@inproceedings{APICoder, |
| 122 | + title={When Languange Model Meets Private Library}, |
| 123 | + author={Zan, Daoguang and Chen, Bei and Lin, Zeqi and Guan, Bei and Wang, Yongji and Lou, Jian-Guang}, |
| 124 | + booktitle={EMNLP findings}, |
| 125 | + year={2022} |
| 126 | +} |
| 127 | +``` |
0 commit comments