Skip to content

Commit a32c133

Browse files
Divya SankarDivya Sankar
authored andcommitted
CLDK documentation
1 parent 9730268 commit a32c133

File tree

8 files changed

+378
-0
lines changed

8 files changed

+378
-0
lines changed

docs/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
![codellm-devkit logo](https://github.com/IBM/codellm-devkit/blob/main/docs/assets/cldk.png?raw=true)
2+
3+
[![arXiv](https://img.shields.io/badge/arXiv-2410.13007-b31b1b.svg)](https://arxiv.org/abs/2410.13007)
4+
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
5+
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
6+
[![Documentation](https://img.shields.io/badge/GitHub%20Pages-Docs-blue)](https://ibm.github.io/codellm-devkit/)
7+
[![PyPI version](https://badge.fury.io/py/cldk.svg)](https://badge.fury.io/py/cldk)
8+
9+
# CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs
10+
11+
Codellm-devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.
12+
13+
Codellm-devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.
14+
15+
**The purpose of Codellm-devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.**
16+
By providing a consistent and extensible framework, Codellm-devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms.
17+
18+
Codellm-devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs.
19+
20+
Codellm-devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration.
21+
22+
Codellm-devkit is:
23+
24+
- **Unified**: Provides a single framework for integrating multiple analysis tools and CodeLLMs, regardless of the programming languages involved.
25+
- **Extensible**: Designed to support new analysis tools and LLM platforms, making it adaptable to the evolving landscape of code analysis.
26+
- **Streamlined**: Simplifies the process of transforming raw code into structured, LLM-ready inputs, reducing the overhead typically associated with multi-language analysis.
27+
28+
## Architectural and Design Overview
29+
30+
Below is a very high-level overview of the architectural of CLDK:
31+
32+
33+
```mermaid
34+
graph TD
35+
User <--> A[CLDK]
36+
A --> 15[Retrieval ‡]
37+
A --> 16[Prompting ‡]
38+
A[CLDK] <--> B[Languages]
39+
B --> C[Java, Python, Go ‡, C ‡, JavaScript ‡, TypeScript ‡, Rust ‡]
40+
C --> D[Data Models]
41+
D --> 13{Pydantic}
42+
13 --> 7
43+
C --> 7{backends}
44+
7 <--> 9[WALA]
45+
9 <--> 14[Analysis]
46+
7 <--> 10[Tree-sitter]
47+
10 <--> 14[Analysis]
48+
7 <--> 11[LLVM ‡]
49+
11 <--> 14[Analysis]
50+
7 <--> 12[CodeQL ‡]
51+
12 <--> 14[Analysis]
52+
53+
54+
55+
X[‡ Yet to be implemented]
56+
```
57+
58+
The user interacts by invoking the CLDK API. The CLDK API is responsible for handling the user requests and delegating them to the appropriate language-specific modules.
59+
60+
Each language comprises of two key components: data models and backends.
61+
62+
1. **Data Models:** These are high level abstractions that represent the various language constructs and componentes in a structured format using pydantic. This confers a high degree of flexibility and extensibility to the models as well as allowing for easy accees of various data components via a simple dot notation. In addition, the data models are designed to be easily serializable and deserializable, making it easy to store and retrieve data from various sources.
63+
64+
2. **Analysis Backends:** These are the components that are responsible for interfacing with the various program analysis tools. The core backends are Treesitter, Javaparse, WALA, LLVM, and CodeQL. The backends are responsible for handling the user requests and delegating them to the appropriate analysis tools. The analysis tools perfrom the requisite analysis and return the results to the user. The user merely calls one of several high-level API functions such as `get_method_body`, `get_method_signature`, `get_call_graph`, etc. and the backend takes care of the rest.
65+
66+
Some langugages may have multiple backends. For example, Java has WALA, Javaparser, Treesitter, and CodeQL backends. The user has freedom to choose the backend that best suits their needs.
67+
68+
We are currently working on implementing the retrieval and prompting components. The retrieval component will be responsible for retrieving the relevant code snippets from the codebase for RAG usecases. The prompting component will be responsible for generating the prompts for the CodeLLMs using popular prompting frameworks such as `PDL`, `Guidance`, or `LMQL`.
69+
70+
## Contact
71+
72+
For any questions, feedback, or suggestions, please contact the authors:
73+
74+
| Name | Email |
75+
| ---- | ----- |
76+
| Rahul Krishna | [[email protected]](mailto:[email protected]) |
77+
| Rangeet Pan | [[email protected]](mailto:[email protected]) |
78+
| Saurabh Sihna | [[email protected]](mailto:[email protected]) |

docs/api_reference.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# API Reference
2+
3+
This page documents the API functionalities.
4+
5+
## Models
6+
7+
## Python
8+
::: cldk.models.python.models
9+
10+
## Java
11+
::: cldk.models.java.models
12+
13+
## Treesitter
14+
::: cldk.models.treesitter.models
15+
16+
## Python
17+
::: cldk.analysis.python
18+
19+
## Java
20+
::: cldk.analysis.java
21+
File renamed without changes.

docs/css/index.css

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
html,body {
2+
font-family: 'IBM Plex Sans', 'Helvetica Neue', Arial, sans-serif;
3+
background: var(--cds-background, #ffffff);
4+
}
5+
6+
#mainview {
7+
margin-top: 4rem;
8+
height: calc(100% - 4rem);
9+
}

docs/publications.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
### Publication (papers and blogs related to CLDK)
2+
1. Krishna, Rahul, Rangeet Pan, Raju Pavuluri, Srikanth Tamilselvam, Maja Vukovic, and Saurabh Sinha. "[Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights.](https://arxiv.org/pdf/2410.13007)" arXiv preprint arXiv:2410.13007 (2024).
3+
2. Pan, Rangeet, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. "[Multi-language Unit Test Generation using LLMs.](https://arxiv.org/abs/2409.03093)" arXiv preprint arXiv:2409.03093 (2024).
4+
3. Pan, Rangeet, Rahul Krishna, Raju Pavuluri, Saurabh Sinha, and Maja Vukovic., "[Simplify your Code LLM solutions using CodeLLM Dev Kit (CLDK).](https://www.linkedin.com/pulse/simplify-your-code-llm-solutions-using-codellm-dev-kit-rangeet-pan-vnnpe/?trackingId=kZ3U6d8GSDCs8S1oApXZgg%3D%3D)", Blog.

docs/stylesheets/extra.css

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
:root {
2+
--md-primary-fg-color: #0f62fe;
3+
--md-primary-fg-color--light: #4589ff;
4+
--md-primary-fg-color--dark: #002d9c;
5+
}

docs/walkthrough.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
## Quick Start: Example Walkthrough
2+
3+
In this section, we will walk through a simple example to demonstrate how to use CLDK. We will:
4+
5+
* Set up a local ollama server to interact with CodeLLMs
6+
* Build a simple code summarization pipeline for a Java and a Python application.
7+
8+
### Prerequisites
9+
10+
Before we begin, make sure you have the following prerequisites installed:
11+
12+
* Python 3.11 or later
13+
* Ollama v0.3.4 or later
14+
15+
### Step 1: Set up an Ollama server
16+
17+
If don't already have ollama, please download and install it from here: [Ollama](https://ollama.com/download).
18+
19+
Once you have ollama, start the server and make sure it is running.
20+
21+
If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:
22+
23+
```bash
24+
sudo systemctl status ollama
25+
```
26+
27+
You should see an output similar to the following:
28+
29+
```bash
30+
➜ sudo systemctl status ollama
31+
● ollama.service - Ollama Service
32+
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
33+
Active: active (running) since Sat 2024-08-10 20:39:56 EDT; 17s ago
34+
Main PID: 23069 (ollama)
35+
Tasks: 19 (limit: 76802)
36+
Memory: 1.2G (peak: 1.2G)
37+
CPU: 6.745s
38+
CGroup: /system.slice/ollama.service
39+
└─23069 /usr/local/bin/ollama serve
40+
```
41+
42+
If not, you may have to start the server manually. You can do this by running the following command:
43+
44+
```bash
45+
sudo systemctl start ollama
46+
```
47+
48+
#### Pull the latest version of Granite 8b instruct model from ollama
49+
50+
To pull the latest version of the Granite 8b instruct model from ollama, run the following command:
51+
52+
```bash
53+
ollama pull granite-code:8b-instruct
54+
```
55+
56+
Check to make sure the model was successfully pulled by running the following command:
57+
58+
```bash
59+
ollama run granite-code:8b-instruct 'Write a function to print hello world in python'
60+
```
61+
62+
The output should be similar to the following:
63+
64+
```
65+
➜ ollama run granite-code:8b-instruct 'Write a function to print hello world in python'
66+
67+
def say_hello():
68+
print("Hello World!")
69+
```
70+
71+
### Step 2: Install CLDK
72+
73+
You may install the latest version of CLDK from [PyPi](https://pypi.org/project/cldk/):
74+
75+
```python
76+
pip install cldk
77+
```
78+
79+
Once CLDK is installed, you can import it into your Python code:
80+
81+
```python
82+
from cldk import CLDK
83+
```
84+
85+
### Step 3: Build a code summarization pipeline
86+
87+
Now that we have set up the ollama server and installed CLDK, we can build a simple code summarization pipeline for a Java application.
88+
89+
1. Let's download a sample Java (apache-commons-cli):
90+
91+
* Download and unzip the sample Java application:
92+
```bash
93+
wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O commons-cli-1.7.0.zip && unzip commons-cli-1.7.0.zip
94+
```
95+
* Record the path to the sample Java application:
96+
```bash
97+
export JAVA_APP_PATH=/path/to/commons-cli-1.7.0
98+
```
99+
100+
Below is a simple code summarization pipeline for a Java application using CLDK. It does the following things:
101+
102+
* Creates a new instance of the CLDK class (see comment `# (1)`)
103+
* Creates an analysis object over the Java application (see comment `# (2)`)
104+
* Iterates over all the files in the project (see comment `# (3)`)
105+
* Iterates over all the classes in the file (see comment `# (4)`)
106+
* Iterates over all the methods in the class (see comment `# (5)`)
107+
* Gets the code body of the method (see comment `# (6)`)
108+
* Initializes the treesitter utils for the class file content (see comment `# (7)`)
109+
* Sanitizes the class for analysis (see comment `# (8)`)
110+
* Formats the instruction for the given focal method and class (see comment `# (9)`)
111+
* Prompts the local model on Ollama (see comment `# (10)`)
112+
* Prints the instruction and LLM output (see comment `# (11)`)
113+
114+
```python
115+
# code_summarization_for_java.py
116+
117+
from cldk import CLDK
118+
119+
120+
def format_inst(code, focal_method, focal_class):
121+
"""
122+
Format the instruction for the given focal method and class.
123+
"""
124+
inst = f"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\n"
125+
126+
inst += "\n"
127+
inst += f"```{language}\n"
128+
inst += code
129+
inst += "```" if code.endswith("\n") else "\n```"
130+
inst += "\n"
131+
return inst
132+
133+
def prompt_ollama(message: str, model_id: str = "granite-code:8b-instruct") -> str:
134+
"""Prompt local model on Ollama"""
135+
response_object = ollama.generate(model=model_id, prompt=message)
136+
return response_object["response"]
137+
138+
139+
if __name__ == "__main__":
140+
# (1) Create a new instance of the CLDK class
141+
cldk = CLDK(language="java")
142+
143+
# (2) Create an analysis object over the java application
144+
analysis = cldk.analysis(project_path=os.getenv("JAVA_APP_PATH"))
145+
146+
# (3) Iterate over all the files in the project
147+
for file_path, class_file in analysis.get_symbol_table().items():
148+
class_file_path = Path(file_path).absolute().resolve()
149+
# (4) Iterate over all the classes in the file
150+
for type_name, type_declaration in class_file.type_declarations.items():
151+
# (5) Iterate over all the methods in the class
152+
for method in type_declaration.callable_declarations.values():
153+
154+
# (6) Get code body of the method
155+
code_body = class_file_path.read_text()
156+
157+
# (7) Initialize the treesitter utils for the class file content
158+
tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)
159+
160+
# (8) Sanitize the class for analysis
161+
sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)
162+
163+
# (9) Format the instruction for the given focal method and class
164+
instruction = format_inst(
165+
code=sanitized_class,
166+
focal_method=method.declaration,
167+
focal_class=type_name,
168+
)
169+
170+
# (10) Prompt the local model on Ollama
171+
llm_output = prompt_ollama(
172+
message=instruction,
173+
model_id="granite-code:20b-instruct",
174+
)
175+
176+
# (11) Print the instruction and LLM output
177+
print(f"Instruction:\n{instruction}")
178+
print(f"LLM Output:\n{llm_output}")
179+
```

mkdocs.yaml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Config for MkDocs
2+
# To build:
3+
# 1. Ensure you have MkDocs and friends: `pip install -e ".[docs]"`
4+
# 2. `mkdocs serve` to view site locally with live refresh
5+
# 3. `mkdocs build` to build the static site.
6+
# In the future, a GitHub action should do this.
7+
8+
site_name: CodeLLM-Devkit
9+
repo_url: https://github.com/IBM/codellm-devkit
10+
repo_name: IBM/codellm-devkit
11+
# edit_uri: docs/
12+
# site_url: https://ibm.github.io/cldk/
13+
14+
# This unusual configuration of docs/site directories
15+
# is due to GitHub not presenting /site/ as an option
16+
# when selecting the dir for Pages.
17+
docs_dir: docs
18+
site_dir: _site
19+
20+
copyright: Copyright &copy; 2024 IBM
21+
22+
theme:
23+
name: material
24+
font:
25+
text: IBM Plex Sans
26+
code: IBM Plex Mono
27+
icon:
28+
repo: fontawesome/brands/github
29+
features: # see https://squidfunk.github.io/mkdocs-material/setup/
30+
- search.highlight
31+
- search.suggest
32+
- navigation.sections
33+
- navigation.path
34+
- navigation.footer
35+
- navigation.indexes
36+
- navigation.top
37+
- toc.follow
38+
- toc.integrate
39+
- navigation.tabs
40+
- content.action.edit
41+
logo: assets/cldk.png
42+
palette:
43+
- primary: custom
44+
nav:
45+
- Home: README.md
46+
- Walkthrough: walkthrough.md
47+
- API Reference: api_reference.md
48+
- Publications: publications.md
49+
- Contribute: contribute.md
50+
51+
# Define some IBM colors
52+
extra_css:
53+
- stylesheets/extra.css
54+
55+
plugins:
56+
- search
57+
- mkdocstrings: # see https://mkdocstrings.github.io/python/usage/
58+
handlers:
59+
python:
60+
options:
61+
allow_inspection: true
62+
show_source: true
63+
show_bases: true
64+
show_symbol_type_toc: true
65+
show_submodules: false
66+
show_root_toc_entry: true
67+
docstring_section_style: table
68+
inherited_members: false
69+
summary: true
70+
docstring_style: google
71+
show_if_no_docstring: false
72+
show_labels: true
73+
heading_level: 3
74+
show_symbol_type_heading: true
75+
show_signature: true
76+
show_signature_annotations: true
77+
members_order: source
78+
79+
markdown_extensions:
80+
- admonition
81+
- toc:
82+
toc_depth: 3

0 commit comments

Comments
 (0)