|
1 | | -# codegen-on-oss |
| 1 | +# Overview |
2 | 2 |
|
3 | | -[](https://img.shields.io/github/v/release/clee-codegen/codegen-on-oss) |
4 | | -[](https://github.com/clee-codegen/codegen-on-oss/actions/workflows/main.yml?query=branch%3Amain) |
5 | | -[](https://codecov.io/gh/clee-codegen/codegen-on-oss) |
6 | | -[](https://img.shields.io/github/commit-activity/m/clee-codegen/codegen-on-oss) |
7 | | -[](https://img.shields.io/github/license/clee-codegen/codegen-on-oss) |
| 3 | +The **Codegen on OSS** package provides a modular pipeline that: |
8 | 4 |
|
9 | | -Testing codegen parsing on popular OSS repositories |
| 5 | +- **Collects repository URLs** from different sources (e.g., CSV files or GitHub searches). |
| 6 | +- **Parses repositories** using the codegen tool. |
| 7 | +- **Profiles performance** and logs metrics for each parsing run. |
| 8 | +- **Logs errors** to help pinpoint parsing failures or performance bottlenecks. |
10 | 9 |
|
11 | | -- **Github repository**: <https://github.com/clee-codegen/codegen-on-oss/> |
12 | | -- **Documentation** <https://clee-codegen.github.io/codegen-on-oss/> |
| 10 | +______________________________________________________________________ |
13 | 11 |
|
14 | | -### Set Up Your Development Environment |
| 12 | +## Package Structure |
15 | 13 |
|
16 | | -install the environment and the pre-commit hooks with |
| 14 | +The package is composed of several modules: |
17 | 15 |
|
18 | | -```bash |
19 | | -make install |
20 | | -``` |
| 16 | +- `sources` |
| 17 | + |
| 18 | + - Defines the Repository source classes and settings. Settings are all configurable via environment variables |
| 19 | + |
| 20 | + - Github Source |
| 21 | + |
| 22 | + ```python |
| 23 | + class GithubSettings(SourceSettings): |
| 24 | + language: Literal["python", "typescript"] = "python" |
| 25 | + heuristic: Literal[ |
| 26 | + "stars", |
| 27 | + "forks", |
| 28 | + "updated", |
| 29 | + # "watchers", |
| 30 | + # "contributors", |
| 31 | + # "commit_activity", |
| 32 | + # "issues", |
| 33 | + # "dependency", |
| 34 | + ] = "stars" |
| 35 | + github_token: str | None = None |
| 36 | + ``` |
| 37 | + |
| 38 | + - The three options available now are the three supported by the Github API. |
| 39 | + - Future Work Additional options will require different strategies |
| 40 | + |
| 41 | + - CSV Source |
| 42 | + |
| 43 | + - Simply reads repo URLs from CSV |
| 44 | + |
| 45 | +- `cache` |
| 46 | + |
| 47 | + - Currently only specifies the cache directory. It is used for caching git repositories pulled by the pipeline `--force-pull` can be used to re-pull from the remote. |
| 48 | + |
| 49 | +- `cli` |
| 50 | + |
| 51 | + - Built with Click, the CLI provides two main commands: |
| 52 | + - `run-one`: Parses a single repository specified by URL. |
| 53 | + - `run`: Iterates over repositories obtained from a selected source and parses each one. |
| 54 | + |
| 55 | +- **`metrics`** |
| 56 | + |
| 57 | + - Provides profiling tools to measure performance during the parse: |
| 58 | + - `MetricsProfiler`: A context manager that creates a profiling session. |
| 59 | + - `MetricsProfile`: Represents a "span" or a "run" of a specific repository. Records step-by-step metrics (clock duration, CPU time, memory usage) and writes them to a CSV file specified by `--output-path` |
| 60 | + |
| 61 | +- **`parser`** |
| 62 | + |
| 63 | + Contains the `CodegenParser` class that orchestrates the parsing process: |
21 | 64 |
|
22 | | -This will also generate your `uv.lock` file |
| 65 | + - Clones the repository (or forces a pull if specified). |
| 66 | + - Initializes a `Codebase` (from the codegen tool). |
| 67 | + - Runs post-initialization validation. |
| 68 | + - Integrates with the `MetricsProfiler` to log measurements at key steps. |
| 69 | + |
| 70 | +______________________________________________________________________ |
| 71 | + |
| 72 | +## Getting Started |
| 73 | + |
| 74 | +1. **Configure the Repository Source** |
| 75 | + |
| 76 | + Decide whether you want to read from a CSV file or query GitHub: |
| 77 | + |
| 78 | + - For CSV, ensure that your CSV file (default: `input.csv`) exists and contains repository URLs in its first column \[`repo_url`\] and commit hash \[`commit_hash`\] (or empty) in the second column. |
| 79 | + - For GitHub, configure your desired settings (e.g., `language`, `heuristic`, and optionally a GitHub token) via environment variables (`GITHUB_` prefix) |
| 80 | + |
| 81 | +1. **Run the Parser** |
| 82 | + |
| 83 | + Use the CLI to start parsing: |
| 84 | + |
| 85 | + - To parse one repository: |
| 86 | + |
| 87 | + ```bash |
| 88 | + uv run cgparse run-one --help |
| 89 | + ``` |
| 90 | + |
| 91 | + - To parse multiple repositories from a source: |
| 92 | + |
| 93 | + ```bash |
| 94 | + uv run cgparse run --help |
| 95 | + ``` |
| 96 | + |
| 97 | +1. **Review Metrics and Logs** |
| 98 | + |
| 99 | + After parsing, check the CSV (default: `metrics.csv` ) to review performance measurements per repository. Error logs are written to the specified error output file (default: `errors.log`) |
| 100 | + |
| 101 | +______________________________________________________________________ |
23 | 102 |
|
24 | | -### pre-commit hooks |
| 103 | +## Modal Integration for Cloud Parsing |
25 | 104 |
|
26 | | -```bash |
27 | | -uv run pre-commit run -a |
| 105 | +```shell |
| 106 | +$ uv run modal run modal_run.py |
28 | 107 | ``` |
29 | 108 |
|
| 109 | +By default, the parser is run for `input.csv` tracked in this repository. |
| 110 | + |
| 111 | +### Modal Configuration |
| 112 | + |
| 113 | +- **Compute Resources**: Allocates 4 CPUs and 16GB of memory. |
| 114 | +- **Secrets & Volumes**: Uses secrets (for bucket credentials) and mounts a volume for caching repositories. |
| 115 | +- **Image Setup**: Builds on a Debian slim image with Python 3.12, installs required packages (`uv` and `git` ) |
| 116 | +- **Environment Configuration**: Environment variables (e.g., GitHub settings) are injected at runtime. |
| 117 | + |
| 118 | +The function `parse_repo_on_modal` performs the following steps: |
| 119 | + |
| 120 | +1. **Environment Setup**: Updates environment variables and configures logging using Loguru. |
| 121 | +1. **Source Initialization**: Creates a repository source based on the provided type (e.g., GitHub). |
| 122 | +1. **Metrics Profiling**: Instantiates `MetricsProfiler` to capture and log performance data. |
| 123 | +1. **Repository Parsing**: Iterates over repository URLs and parses each using the `CodegenParser`. |
| 124 | +1. **Error Handling**: Logs any exceptions encountered during parsing. |
| 125 | +1. **Result Upload**: Uses the `BucketStore` class to upload the configuration, logs, and metrics to an S3 bucket. |
| 126 | + |
| 127 | +### Bucket Storage |
| 128 | + |
| 129 | +**Bucket (public):** [codegen-oss-parse](https://s3.amazonaws.com/codegen-oss-parse/) |
| 130 | + |
| 131 | +The results of each run are saved under the version of `codegen` lib that the run installed and the source type it was run with. Within this prefix: |
| 132 | + |
| 133 | +- Source Settings |
| 134 | + - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/config.json` |
| 135 | +- Metrics |
| 136 | + - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/metrics.csv` |
| 137 | +- Logs |
| 138 | + - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/output.logs` |
| 139 | + |
30 | 140 | ______________________________________________________________________ |
31 | 141 |
|
32 | | -Repository initiated with [fpgmaas/cookiecutter-uv](https://github.com/fpgmaas/cookiecutter-uv). |
| 142 | +## Extensibility |
| 143 | + |
| 144 | +**Adding New Sources:** |
| 145 | + |
| 146 | +You can define additional repository sources by subclassing `RepoSource` and providing a corresponding settings class. Make sure to set the `source_type` and register your new source by following the pattern established in `CSVInputSource` or `GithubSource`. |
| 147 | + |
| 148 | +**Improving Testing:** |
| 149 | + |
| 150 | +The detailed metrics collected can help you understand where parsing failures occur or where performance lags. Use these insights to improve error handling and optimize the codegen parsing logic. |
| 151 | + |
| 152 | +**Containerization and Automation:** |
| 153 | + |
| 154 | +There is a Dockerfile that can be used to create an image capable of running the parse tests. Runtime environment variables can be used to configure the run and output. |
| 155 | + |
| 156 | +**Input & Configuration** |
| 157 | + |
| 158 | +Explore a better CLI for providing options to the Modal run. |
| 159 | + |
| 160 | +______________________________________________________________________ |
| 161 | + |
| 162 | +## Example Log Output |
| 163 | + |
| 164 | +```shell |
| 165 | +[codegen-on-oss*] codegen/codegen-on-oss/$ uv run cgparse run --source csv |
| 166 | + 21:32:36 INFO Cloning repository https://github.com/JohnSnowLabs/spark-nlp.git |
| 167 | + 21:36:57 INFO { |
| 168 | + "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", |
| 169 | + "step": "codebase_init", |
| 170 | + "delta_time": 7.186550649999845, |
| 171 | + "cumulative_time": 7.186550649999845, |
| 172 | + "cpu_time": 180.3553702, |
| 173 | + "memory_usage": 567525376, |
| 174 | + "memory_delta": 317095936, |
| 175 | + "error": null |
| 176 | +} |
| 177 | + 21:36:58 INFO { |
| 178 | + "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", |
| 179 | + "step": "post_init_validation", |
| 180 | + "delta_time": 0.5465090990001045, |
| 181 | + "cumulative_time": 7.733059748999949, |
| 182 | + "cpu_time": 180.9174761, |
| 183 | + "memory_usage": 569249792, |
| 184 | + "memory_delta": 1724416, |
| 185 | + "error": null |
| 186 | +} |
| 187 | + 21:36:58 ERROR Repository: https://github.com/JohnSnowLabs/spark-nlp.git |
| 188 | +Traceback (most recent call last): |
| 189 | + |
| 190 | + File "/home/codegen/codegen/codegen-on-oss/.venv/bin/cgparse", line 10, in <module> |
| 191 | + sys.exit(cli()) |
| 192 | + │ │ └ <Group cli> |
| 193 | + │ └ <built-in function exit> |
| 194 | + └ <module 'sys' (built-in)> |
| 195 | + File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__ |
| 196 | + return self.main(*args, **kwargs) |
| 197 | + │ │ │ └ {} |
| 198 | + │ │ └ () |
| 199 | + │ └ <function BaseCommand.main at 0x7f4665c15120> |
| 200 | + └ <Group cli> |
| 201 | + File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1082, in main |
| 202 | + rv = self.invoke(ctx) |
| 203 | + │ │ └ <click.core.Context object at 0x7f4665f3c9e0> |
| 204 | + │ └ <function MultiCommand.invoke at 0x7f4665c16340> |
| 205 | + └ <Group cli> |
| 206 | + File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1697, in invoke |
| 207 | + return _process_result(sub_ctx.command.invoke(sub_ctx)) |
| 208 | + │ │ │ │ └ <click.core.Context object at 0x7f4665989b80> |
| 209 | + │ │ │ └ <function Command.invoke at 0x7f4665c15d00> |
| 210 | + │ │ └ <Command run> |
| 211 | + │ └ <click.core.Context object at 0x7f4665989b80> |
| 212 | + └ <function MultiCommand.invoke.<locals>._process_result at 0x7f466597fb00> |
| 213 | + File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke |
| 214 | + return ctx.invoke(self.callback, **ctx.params) |
| 215 | + │ │ │ │ │ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache... |
| 216 | + │ │ │ │ └ <click.core.Context object at 0x7f4665989b80> |
| 217 | + │ │ │ └ <function run at 0x7f466145eac0> |
| 218 | + │ │ └ <Command run> |
| 219 | + │ └ <function Context.invoke at 0x7f4665c14680> |
| 220 | + └ <click.core.Context object at 0x7f4665989b80> |
| 221 | + File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 788, in invoke |
| 222 | + return __callback(*args, **kwargs) |
| 223 | + │ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache... |
| 224 | + └ () |
| 225 | + |
| 226 | + File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/cli.py", line 121, in run |
| 227 | + parser.parse(repo_url) |
| 228 | + │ │ └ 'https://github.com/JohnSnowLabs/spark-nlp.git' |
| 229 | + │ └ <function CodegenParser.parse at 0x7f4664b014e0> |
| 230 | + └ <codegen_on_oss.parser.CodegenParser object at 0x7f46612def30> |
| 231 | + |
| 232 | + File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 52, in parse |
| 233 | + with self.metrics_profiler.start_profiler( |
| 234 | + │ │ └ <function MetricsProfiler.start_profiler at 0x7f466577d760> |
| 235 | + │ └ <codegen_on_oss.metrics.MetricsProfiler object at 0x7f465e6c2e70> |
| 236 | + └ <codegen_on_oss.parser.CodegenParser object at 0x7f46612def30> |
| 237 | + |
| 238 | + File "/home/.local/share/uv/python/cpython-3.12.6-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 158, in __exit__ |
| 239 | + self.gen.throw(value) |
| 240 | + │ │ │ └ ParseRunError(<PostInitValidationStatus.LOW_IMPORT_RESOLUTION_RATE: 'LOW_IMPORT_RESOLUTION_RATE'>) |
| 241 | + │ │ └ <method 'throw' of 'generator' objects> |
| 242 | + │ └ <generator object MetricsProfiler.start_profiler at 0x7f4660478740> |
| 243 | + └ <contextlib._GeneratorContextManager object at 0x7f46657849e0> |
| 244 | + |
| 245 | +> File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/metrics.py", line 41, in start_profiler |
| 246 | + yield profile |
| 247 | + └ <codegen_on_oss.metrics.MetricsProfile object at 0x7f4665784a10> |
| 248 | + |
| 249 | + File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 64, in parse |
| 250 | + raise ParseRunError(validation_status) |
| 251 | + │ └ <PostInitValidationStatus.LOW_IMPORT_RESOLUTION_RATE: 'LOW_IMPORT_RESOLUTION_RATE'> |
| 252 | + └ <class 'codegen_on_oss.parser.ParseRunError'> |
| 253 | + |
| 254 | +codegen_on_oss.parser.ParseRunError: LOW_IMPORT_RESOLUTION_RATE |
| 255 | + 21:36:58 INFO { |
| 256 | + "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", |
| 257 | + "step": "TOTAL", |
| 258 | + "delta_time": 7.740976418000173, |
| 259 | + "cumulative_time": 7.740976418000173, |
| 260 | + "cpu_time": 180.9221699, |
| 261 | + "memory_usage": 569249792, |
| 262 | + "memory_delta": 0, |
| 263 | + "error": "LOW_IMPORT_RESOLUTION_RATE" |
| 264 | +} |
| 265 | + 21:36:58 INFO Cloning repository https://github.com/Lightning-AI/lightning.git |
| 266 | + 21:37:53 INFO { |
| 267 | + "profile_name": "https://github.com/Lightning-AI/lightning.git", |
| 268 | + "step": "codebase_init", |
| 269 | + "delta_time": 24.256577352999557, |
| 270 | + "cumulative_time": 24.256577352999557, |
| 271 | + "cpu_time": 211.3604081, |
| 272 | + "memory_usage": 1535971328, |
| 273 | + "memory_delta": 966184960, |
| 274 | + "error": null |
| 275 | +} |
| 276 | + 21:37:53 INFO { |
| 277 | + "profile_name": "https://github.com/Lightning-AI/lightning.git", |
| 278 | + "step": "post_init_validation", |
| 279 | + "delta_time": 0.137609629000508, |
| 280 | + "cumulative_time": 24.394186982000065, |
| 281 | + "cpu_time": 211.5082702, |
| 282 | + "memory_usage": 1536241664, |
| 283 | + "memory_delta": 270336, |
| 284 | + "error": null |
| 285 | +} |
| 286 | + 21:37:53 INFO { |
| 287 | + "profile_name": "https://github.com/Lightning-AI/lightning.git", |
| 288 | + "step": "TOTAL", |
| 289 | + "delta_time": 24.394700584999555, |
| 290 | + "cumulative_time": 24.394700584999555, |
| 291 | + "cpu_time": 211.5088282, |
| 292 | + "memory_usage": 1536241664, |
| 293 | + "memory_delta": 0, |
| 294 | + "error": null |
| 295 | +} |
| 296 | +``` |
| 297 | +
|
| 298 | +## Example Metrics Output |
| 299 | +
|
| 300 | +| profile_name | step | delta_time | cumulative_time | cpu_time | memory_usage | memory_delta | error | |
| 301 | +| ---------------------- | -------------------- | ------------------ | ------------------ | ----------- | ------------ | ------------ | -------------------------- | |
| 302 | +| JohnSnowLabs/spark-nlp | codebase_init | 7.186550649999845 | 7.186550649999845 | 180.3553702 | 567525376 | 317095936 | | |
| 303 | +| JohnSnowLabs/spark-nlp | post_init_validation | 0.5465090990001045 | 7.733059748999949 | 180.9174761 | 569249792 | 1724416 | | |
| 304 | +| JohnSnowLabs/spark-nlp | TOTAL | 7.740976418000173 | 7.740976418000173 | 180.9221699 | 569249792 | 0 | LOW_IMPORT_RESOLUTION_RATE | |
| 305 | +| Lightning-AI/lightning | codebase_init | 24.256577352999557 | 24.256577352999557 | 211.3604081 | 1535971328 | 966184960 | | |
| 306 | +| Lightning-AI/lightning | post_init_validation | 0.137609629000508 | 24.394186982000065 | 211.5082702 | 1536241664 | 270336 | | |
| 307 | +| Lightning-AI/lightning | TOTAL | 24.394700584999555 | 24.394700584999555 | 211.5088282 | 1536241664 | 0 | | |
0 commit comments