Skip to content

Commit acfe3dc

Browse files
committed
add: README
1 parent 94155a2 commit acfe3dc

File tree

1 file changed

+294
-19
lines changed

1 file changed

+294
-19
lines changed

codegen-on-oss/README.md

Lines changed: 294 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,307 @@
1-
# codegen-on-oss
1+
# Overview
22

3-
[![Release](https://img.shields.io/github/v/release/clee-codegen/codegen-on-oss)](https://img.shields.io/github/v/release/clee-codegen/codegen-on-oss)
4-
[![Build status](https://img.shields.io/github/actions/workflow/status/clee-codegen/codegen-on-oss/main.yml?branch=main)](https://github.com/clee-codegen/codegen-on-oss/actions/workflows/main.yml?query=branch%3Amain)
5-
[![codecov](https://codecov.io/gh/clee-codegen/codegen-on-oss/branch/main/graph/badge.svg)](https://codecov.io/gh/clee-codegen/codegen-on-oss)
6-
[![Commit activity](https://img.shields.io/github/commit-activity/m/clee-codegen/codegen-on-oss)](https://img.shields.io/github/commit-activity/m/clee-codegen/codegen-on-oss)
7-
[![License](https://img.shields.io/github/license/clee-codegen/codegen-on-oss)](https://img.shields.io/github/license/clee-codegen/codegen-on-oss)
3+
The **Codegen on OSS** package provides a modular pipeline that:
84

9-
Testing codegen parsing on popular OSS repositories
5+
- **Collects repository URLs** from different sources (e.g., CSV files or GitHub searches).
6+
- **Parses repositories** using the codegen tool.
7+
- **Profiles performance** and logs metrics for each parsing run.
8+
- **Logs errors** to help pinpoint parsing failures or performance bottlenecks.
109

11-
- **Github repository**: <https://github.com/clee-codegen/codegen-on-oss/>
12-
- **Documentation** <https://clee-codegen.github.io/codegen-on-oss/>
10+
______________________________________________________________________
1311

14-
### Set Up Your Development Environment
12+
## Package Structure
1513

16-
install the environment and the pre-commit hooks with
14+
The package is composed of several modules:
1715

18-
```bash
19-
make install
20-
```
16+
- `sources`
17+
18+
- Defines the Repository source classes and settings. Settings are all configurable via environment variables
19+
20+
- Github Source
21+
22+
```python
23+
class GithubSettings(SourceSettings):
24+
language: Literal["python", "typescript"] = "python"
25+
heuristic: Literal[
26+
"stars",
27+
"forks",
28+
"updated",
29+
# "watchers",
30+
# "contributors",
31+
# "commit_activity",
32+
# "issues",
33+
# "dependency",
34+
] = "stars"
35+
github_token: str | None = None
36+
```
37+
38+
- The three options available now are the three supported by the Github API.
39+
- Future Work Additional options will require different strategies
40+
41+
- CSV Source
42+
43+
- Simply reads repo URLs from CSV
44+
45+
- `cache`
46+
47+
- Currently only specifies the cache directory. It is used for caching git repositories pulled by the pipeline `--force-pull` can be used to re-pull from the remote.
48+
49+
- `cli`
50+
51+
- Built with Click, the CLI provides two main commands:
52+
- `run-one`: Parses a single repository specified by URL.
53+
- `run`: Iterates over repositories obtained from a selected source and parses each one.
54+
55+
- **`metrics`**
56+
57+
- Provides profiling tools to measure performance during the parse:
58+
- `MetricsProfiler`: A context manager that creates a profiling session.
59+
- `MetricsProfile`: Represents a "span" or a "run" of a specific repository. Records step-by-step metrics (clock duration, CPU time, memory usage) and writes them to a CSV file specified by `--output-path`
60+
61+
- **`parser`**
62+
63+
Contains the `CodegenParser` class that orchestrates the parsing process:
2164

22-
This will also generate your `uv.lock` file
65+
- Clones the repository (or forces a pull if specified).
66+
- Initializes a `Codebase` (from the codegen tool).
67+
- Runs post-initialization validation.
68+
- Integrates with the `MetricsProfiler` to log measurements at key steps.
69+
70+
______________________________________________________________________
71+
72+
## Getting Started
73+
74+
1. **Configure the Repository Source**
75+
76+
Decide whether you want to read from a CSV file or query GitHub:
77+
78+
- For CSV, ensure that your CSV file (default: `input.csv`) exists and contains repository URLs in its first column \[`repo_url`\] and commit hash \[`commit_hash`\] (or empty) in the second column.
79+
- For GitHub, configure your desired settings (e.g., `language`, `heuristic`, and optionally a GitHub token) via environment variables (`GITHUB_` prefix)
80+
81+
1. **Run the Parser**
82+
83+
Use the CLI to start parsing:
84+
85+
- To parse one repository:
86+
87+
```bash
88+
uv run cgparse run-one --help
89+
```
90+
91+
- To parse multiple repositories from a source:
92+
93+
```bash
94+
uv run cgparse run --help
95+
```
96+
97+
1. **Review Metrics and Logs**
98+
99+
After parsing, check the CSV (default: `metrics.csv` ) to review performance measurements per repository. Error logs are written to the specified error output file (default: `errors.log`)
100+
101+
______________________________________________________________________
23102

24-
### pre-commit hooks
103+
## Modal Integration for Cloud Parsing
25104

26-
```bash
27-
uv run pre-commit run -a
105+
```shell
106+
$ uv run modal run modal_run.py
28107
```
29108

109+
By default, the parser is run for `input.csv` tracked in this repository.
110+
111+
### Modal Configuration
112+
113+
- **Compute Resources**: Allocates 4 CPUs and 16GB of memory.
114+
- **Secrets & Volumes**: Uses secrets (for bucket credentials) and mounts a volume for caching repositories.
115+
- **Image Setup**: Builds on a Debian slim image with Python 3.12, installs required packages (`uv` and `git` )
116+
- **Environment Configuration**: Environment variables (e.g., GitHub settings) are injected at runtime.
117+
118+
The function `parse_repo_on_modal` performs the following steps:
119+
120+
1. **Environment Setup**: Updates environment variables and configures logging using Loguru.
121+
1. **Source Initialization**: Creates a repository source based on the provided type (e.g., GitHub).
122+
1. **Metrics Profiling**: Instantiates `MetricsProfiler` to capture and log performance data.
123+
1. **Repository Parsing**: Iterates over repository URLs and parses each using the `CodegenParser`.
124+
1. **Error Handling**: Logs any exceptions encountered during parsing.
125+
1. **Result Upload**: Uses the `BucketStore` class to upload the configuration, logs, and metrics to an S3 bucket.
126+
127+
### Bucket Storage
128+
129+
**Bucket (public):** [codegen-oss-parse](https://s3.amazonaws.com/codegen-oss-parse/)
130+
131+
The results of each run are saved under the version of `codegen` lib that the run installed and the source type it was run with. Within this prefix:
132+
133+
- Source Settings
134+
- `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/config.json`
135+
- Metrics
136+
- `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/metrics.csv`
137+
- Logs
138+
- `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/output.logs`
139+
30140
______________________________________________________________________
31141

32-
Repository initiated with [fpgmaas/cookiecutter-uv](https://github.com/fpgmaas/cookiecutter-uv).
142+
## Extensibility
143+
144+
**Adding New Sources:**
145+
146+
You can define additional repository sources by subclassing `RepoSource` and providing a corresponding settings class. Make sure to set the `source_type` and register your new source by following the pattern established in `CSVInputSource` or `GithubSource`.
147+
148+
**Improving Testing:**
149+
150+
The detailed metrics collected can help you understand where parsing failures occur or where performance lags. Use these insights to improve error handling and optimize the codegen parsing logic.
151+
152+
**Containerization and Automation:**
153+
154+
There is a Dockerfile that can be used to create an image capable of running the parse tests. Runtime environment variables can be used to configure the run and output.
155+
156+
**Input & Configuration**
157+
158+
Explore a better CLI for providing options to the Modal run.
159+
160+
______________________________________________________________________
161+
162+
## Example Log Output
163+
164+
```shell
165+
[codegen-on-oss*] codegen/codegen-on-oss/$ uv run cgparse run --source csv
166+
21:32:36 INFO Cloning repository https://github.com/JohnSnowLabs/spark-nlp.git
167+
21:36:57 INFO {
168+
"profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git",
169+
"step": "codebase_init",
170+
"delta_time": 7.186550649999845,
171+
"cumulative_time": 7.186550649999845,
172+
"cpu_time": 180.3553702,
173+
"memory_usage": 567525376,
174+
"memory_delta": 317095936,
175+
"error": null
176+
}
177+
21:36:58 INFO {
178+
"profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git",
179+
"step": "post_init_validation",
180+
"delta_time": 0.5465090990001045,
181+
"cumulative_time": 7.733059748999949,
182+
"cpu_time": 180.9174761,
183+
"memory_usage": 569249792,
184+
"memory_delta": 1724416,
185+
"error": null
186+
}
187+
21:36:58 ERROR Repository: https://github.com/JohnSnowLabs/spark-nlp.git
188+
Traceback (most recent call last):
189+
190+
File "/home/codegen/codegen/codegen-on-oss/.venv/bin/cgparse", line 10, in <module>
191+
sys.exit(cli())
192+
│ │ └ <Group cli>
193+
│ └ <built-in function exit>
194+
<module 'sys' (built-in)>
195+
File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
196+
return self.main(*args, **kwargs)
197+
│ │ │ └ {}
198+
│ │ ()
199+
│ └ <function BaseCommand.main at 0x7f4665c15120>
200+
<Group cli>
201+
File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1082, in main
202+
rv = self.invoke(ctx)
203+
│ │ └ <click.core.Context object at 0x7f4665f3c9e0>
204+
│ └ <function MultiCommand.invoke at 0x7f4665c16340>
205+
<Group cli>
206+
File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1697, in invoke
207+
return _process_result(sub_ctx.command.invoke(sub_ctx))
208+
│ │ │ │ └ <click.core.Context object at 0x7f4665989b80>
209+
│ │ │ └ <function Command.invoke at 0x7f4665c15d00>
210+
│ │ └ <Command run>
211+
│ └ <click.core.Context object at 0x7f4665989b80>
212+
<function MultiCommand.invoke.<locals>._process_result at 0x7f466597fb00>
213+
File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
214+
return ctx.invoke(self.callback, **ctx.params)
215+
│ │ │ │ │ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache...
216+
│ │ │ │ └ <click.core.Context object at 0x7f4665989b80>
217+
│ │ │ └ <function run at 0x7f466145eac0>
218+
│ │ └ <Command run>
219+
│ └ <function Context.invoke at 0x7f4665c14680>
220+
└ <click.core.Context object at 0x7f4665989b80>
221+
File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 788, in invoke
222+
return __callback(*args, **kwargs)
223+
│ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache...
224+
()
225+
226+
File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/cli.py", line 121, in run
227+
parser.parse(repo_url)
228+
│ │ └ 'https://github.com/JohnSnowLabs/spark-nlp.git'
229+
│ └ <function CodegenParser.parse at 0x7f4664b014e0>
230+
<codegen_on_oss.parser.CodegenParser object at 0x7f46612def30>
231+
232+
File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 52, in parse
233+
with self.metrics_profiler.start_profiler(
234+
│ │ └ <function MetricsProfiler.start_profiler at 0x7f466577d760>
235+
│ └ <codegen_on_oss.metrics.MetricsProfiler object at 0x7f465e6c2e70>
236+
<codegen_on_oss.parser.CodegenParser object at 0x7f46612def30>
237+
238+
File "/home/.local/share/uv/python/cpython-3.12.6-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 158, in __exit__
239+
self.gen.throw(value)
240+
│ │ │ └ ParseRunError(<PostInitValidationStatus.LOW_IMPORT_RESOLUTION_RATE: 'LOW_IMPORT_RESOLUTION_RATE'>)
241+
│ │ └ <method 'throw' of 'generator' objects>
242+
│ └ <generator object MetricsProfiler.start_profiler at 0x7f4660478740>
243+
<contextlib._GeneratorContextManager object at 0x7f46657849e0>
244+
245+
> File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/metrics.py", line 41, in start_profiler
246+
yield profile
247+
<codegen_on_oss.metrics.MetricsProfile object at 0x7f4665784a10>
248+
249+
File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 64, in parse
250+
raise ParseRunError(validation_status)
251+
│ └ <PostInitValidationStatus.LOW_IMPORT_RESOLUTION_RATE: 'LOW_IMPORT_RESOLUTION_RATE'>
252+
<class 'codegen_on_oss.parser.ParseRunError'>
253+
254+
codegen_on_oss.parser.ParseRunError: LOW_IMPORT_RESOLUTION_RATE
255+
21:36:58 INFO {
256+
"profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git",
257+
"step": "TOTAL",
258+
"delta_time": 7.740976418000173,
259+
"cumulative_time": 7.740976418000173,
260+
"cpu_time": 180.9221699,
261+
"memory_usage": 569249792,
262+
"memory_delta": 0,
263+
"error": "LOW_IMPORT_RESOLUTION_RATE"
264+
}
265+
21:36:58 INFO Cloning repository https://github.com/Lightning-AI/lightning.git
266+
21:37:53 INFO {
267+
"profile_name": "https://github.com/Lightning-AI/lightning.git",
268+
"step": "codebase_init",
269+
"delta_time": 24.256577352999557,
270+
"cumulative_time": 24.256577352999557,
271+
"cpu_time": 211.3604081,
272+
"memory_usage": 1535971328,
273+
"memory_delta": 966184960,
274+
"error": null
275+
}
276+
21:37:53 INFO {
277+
"profile_name": "https://github.com/Lightning-AI/lightning.git",
278+
"step": "post_init_validation",
279+
"delta_time": 0.137609629000508,
280+
"cumulative_time": 24.394186982000065,
281+
"cpu_time": 211.5082702,
282+
"memory_usage": 1536241664,
283+
"memory_delta": 270336,
284+
"error": null
285+
}
286+
21:37:53 INFO {
287+
"profile_name": "https://github.com/Lightning-AI/lightning.git",
288+
"step": "TOTAL",
289+
"delta_time": 24.394700584999555,
290+
"cumulative_time": 24.394700584999555,
291+
"cpu_time": 211.5088282,
292+
"memory_usage": 1536241664,
293+
"memory_delta": 0,
294+
"error": null
295+
}
296+
```
297+
298+
## Example Metrics Output
299+
300+
| profile_name | step | delta_time | cumulative_time | cpu_time | memory_usage | memory_delta | error |
301+
| ---------------------- | -------------------- | ------------------ | ------------------ | ----------- | ------------ | ------------ | -------------------------- |
302+
| JohnSnowLabs/spark-nlp | codebase_init | 7.186550649999845 | 7.186550649999845 | 180.3553702 | 567525376 | 317095936 | |
303+
| JohnSnowLabs/spark-nlp | post_init_validation | 0.5465090990001045 | 7.733059748999949 | 180.9174761 | 569249792 | 1724416 | |
304+
| JohnSnowLabs/spark-nlp | TOTAL | 7.740976418000173 | 7.740976418000173 | 180.9221699 | 569249792 | 0 | LOW_IMPORT_RESOLUTION_RATE |
305+
| Lightning-AI/lightning | codebase_init | 24.256577352999557 | 24.256577352999557 | 211.3604081 | 1535971328 | 966184960 | |
306+
| Lightning-AI/lightning | post_init_validation | 0.137609629000508 | 24.394186982000065 | 211.5082702 | 1536241664 | 270336 | |
307+
| Lightning-AI/lightning | TOTAL | 24.394700584999555 | 24.394700584999555 | 211.5088282 | 1536241664 | 0 | |

0 commit comments

Comments
 (0)