Skip to content

Commit 6036748

Browse files
committed
Updated Java code summarization and test generation examples
Signed-off-by: Saurabh Sinha <[email protected]>
1 parent 810cc5d commit 6036748

File tree

2 files changed

+86
-82
lines changed

2 files changed

+86
-82
lines changed

docs/examples/java/notebook/code_summarization.ipynb

Lines changed: 75 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -9,29 +9,20 @@
99
"source": [
1010
"# Using CLDK to explain Java methods\n",
1111
"\n",
12-
"In this tutorial, we will use CLDK to explain or generate code summary for all the methods in a Java Application.\n",
12+
"In this tutorial, we will use CLDK to explain or generate code summary for a Java method. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis and build an LLM-based code summarizer. By the end of this tutorial, you will have implemented such a tool and generated code summary for a Java method.\n",
1313
"\n",
14-
"By the end of this tutorial, you will have code summary for all the methods in a Java application. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis and build a LLM-based code summary generation.\n",
14+
"Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code summarization:\n",
1515
"\n",
16-
"You will learn how to do the following:\n",
16+
"1. Create a new instance of the CLDK class.\n",
17+
"2. Create an analysis object for the target Java application.\n",
18+
"3. Iterate over all files in the application.\n",
19+
"4. Iterate over all classes in a file.\n",
20+
"5. Initialize treesitter utils for the class content.\n",
21+
"6. Iterate over all methods in a class.\n",
22+
"7. Get the code body of a method.\n",
23+
"8. Sanitize the class for prompting the LLM.\n",
1724
"\n",
18-
"<ol>\n",
19-
"<li> Create a new instance of the CLDK class.\n",
20-
"<li> Create an analysis object over the Java application.\n",
21-
"<li> Iterate over all the files in the project.\n",
22-
"<li> Iterate over all the classes in the file.\n",
23-
"<li> Iterate over all the methods in the class.\n",
24-
"<li> Get the code body of the method.\n",
25-
"<li> Initialize the treesitter utils for the class file content.\n",
26-
"<li> Sanitize the class for analysis.\n",
27-
"</ol>\n",
28-
"Next, we will write a couple of helper methods to:\n",
29-
"\n",
30-
"<ol>\n",
31-
"<li> Format the instruction for the given focal method and class.\n",
32-
"<li> Prompts the local model on Ollama.\n",
33-
"<li> Use CLDK to analyze code and get context information for generating code summary.\n",
34-
"</ol>"
25+
"We will write a couple of helper methods to (1) format the LLM instruction for summarizing a given target method and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate the summary for the target method."
3526
]
3627
},
3728
{
@@ -45,12 +36,11 @@
4536
"\n",
4637
"Before we get started, let's make sure you have the following installed:\n",
4738
"\n",
48-
"<ol>\n",
49-
"<li> Python 3.11 or later\n",
50-
"<li> Ollama 0.3.4 or later\n",
51-
"<li> Java 11 or later\n",
52-
"</ol>\n",
53-
"We will use ollama to spin up a local granite model that will act as our LLM for this turorial."
39+
"1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n",
40+
"2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n",
41+
"3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n",
42+
"\n",
43+
"We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial."
5444
]
5545
},
5646
{
@@ -60,12 +50,28 @@
6050
"collapsed": false
6151
},
6252
"source": [
63-
"### Prerequisite 1: Install ollama\n",
53+
"### Download Granite code model\n",
6454
"\n",
65-
"If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n",
66-
"Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n",
67-
"There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n",
68-
"Let's make sure the model is downloaded by running the following command:"
55+
"After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))."
56+
]
57+
},
58+
{
59+
"cell_type": "code",
60+
"execution_count": null,
61+
"id": "627e7184",
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"%%bash\n",
66+
"ollama pull granite-code:8b-instruct"
67+
]
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"id": "8cc1ca5b",
72+
"metadata": {},
73+
"source": [
74+
" Let's make sure the model is downloaded by running the following command:"
6975
]
7076
},
7177
{
@@ -88,7 +94,7 @@
8894
"collapsed": false
8995
},
9096
"source": [
91-
"### Prerequisite 3: Install ollama Python SDK"
97+
"### Install Ollama Python SDK"
9298
]
9399
},
94100
{
@@ -110,8 +116,8 @@
110116
"collapsed": false
111117
},
112118
"source": [
113-
"### Prerequisite 4: Install CLDK\n",
114-
"CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:"
119+
"### Install CLDK\n",
120+
"CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:"
115121
]
116122
},
117123
{
@@ -134,7 +140,7 @@
134140
},
135141
"source": [
136142
"### Step 1: Get the sample Java application\n",
137-
"For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:"
143+
"For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the sample Java application. You can download the source code to a temporary directory by running the following command:"
138144
]
139145
},
140146
{
@@ -157,7 +163,8 @@
157163
"collapsed": false
158164
},
159165
"source": [
160-
"The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location."
166+
"The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n",
167+
"<!-- We'll remove these files later, so don't worry about the location. -->"
161168
]
162169
},
163170
{
@@ -167,12 +174,9 @@
167174
"collapsed": false
168175
},
169176
"source": [
170-
"### Generate code summary\n",
171-
"Code summarization or code explanation is a task that converts a code written in a programming language to a natural language. This particular task has several\n",
172-
"benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To do that, one needs to\n",
173-
"understand the basic details of code structure works, and use that knowledge to generate the summary using various AI-based approaches. In this particular\n",
174-
"example, we will be using Large Language Models (LLM), specifically Granite 8B, an open-source model built by IBM. We will show how easily a developer can use\n",
175-
"CLDK to expose various parts of the code by calling various APIs without implementing various time-intensive program analyses from scratch."
177+
"## Generate code summary\n",
178+
"\n",
179+
"Code summarization or code explanation is the task of converting code written in a programming language to natural language. It has several benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To perform code summarization, one needs to understand the basic details of code implementation, and use that knowledge to generate the summary using various AI-based approaches. In this tutorial, we will use LLMs, specifically Granite code 8b-instruct. We will show how a developer can easily use CLDK to analyze code by calling various APIs without having to implement such analyses."
176180
]
177181
},
178182
{
@@ -182,7 +186,7 @@
182186
"collapsed": false
183187
},
184188
"source": [
185-
"Step 1: Add all the neccessary imports"
189+
"Step 1: Add the neccessary imports"
186190
]
187191
},
188192
{
@@ -194,7 +198,6 @@
194198
},
195199
"outputs": [],
196200
"source": [
197-
"from pathlib import Path\n",
198201
"import ollama\n",
199202
"from cldk import CLDK\n",
200203
"from cldk.analysis import AnalysisLevel"
@@ -207,8 +210,7 @@
207210
"collapsed": false
208211
},
209212
"source": [
210-
"Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n",
211-
"method in a Java class"
213+
"Step 2: Define a function for creating the LLM prompt, which instructs the LLM to summarize a Java method and includes relevant code for the task."
212214
]
213215
},
214216
{
@@ -249,7 +251,7 @@
249251
"collapsed": false
250252
},
251253
"source": [
252-
"Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally."
254+
"Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama."
253255
]
254256
},
255257
{
@@ -274,7 +276,7 @@
274276
"collapsed": false
275277
},
276278
"source": [
277-
"Step 4: Create an object of CLDK and provide the programming language of the source code."
279+
"Step 4: Create an instance of CLDK and provide the programming language of the source code."
278280
]
279281
},
280282
{
@@ -286,7 +288,7 @@
286288
},
287289
"outputs": [],
288290
"source": [
289-
"# Create a new instance of the CLDK class\n",
291+
"# Create an instance of CLDK for Java analysis\n",
290292
"cldk = CLDK(language=\"java\")"
291293
]
292294
},
@@ -297,10 +299,7 @@
297299
"collapsed": false
298300
},
299301
"source": [
300-
"Step 5: CLDK uses different analysis engine--Codeanalyzer (built using WALA and Javaparser), Treesitter, and CodeQL (future). By default, codenanalyzer has\n",
301-
"been selected as the default analysis engine. Also, CLDK support different analysis levels--(a) symbol table, (b) call graph, (c) program dependency graph, and\n",
302-
"(d) system dependency graph. Analysis engine can be selected using ```AnalysisLevel``` enum. In this example, we will generate summarization of all the methods\n",
303-
"of an application. "
302+
"Step 5: Select the analysis engine and analysis level. CLDK uses different analysis engines---[CodeAnalyzer](https://github.com/IBM/codenet-minerva-code-analyzer) (built over [WALA](https://github.com/wala/WALA) and [JavaParser](https://github.com/javaparser/javaparser)), [Treesitter](https://tree-sitter.github.io/tree-sitter/), and [CodeQL](https://codeql.github.com/) (future)---with CodeAnalyzer being the default analysis engine. CLDK supports different analysis levels: (1) symbol table, (2) call graph, (3) program dependency graph, and (4) system dependency graph. The analysis level can be selected using the `AnalysisLevel` enumerated type. For this example, we select the symbol-table analysis level, with CodeAnalyzer as the default analysis engine."
304303
]
305304
},
306305
{
@@ -323,9 +322,9 @@
323322
"collapsed": false
324323
},
325324
"source": [
326-
"Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a customized Java class in the prompt. For instance,\n",
325+
"Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a sanitized Java class in the prompt, containing only the relevant information for summarizing the target method. To illustrate, consider the floowing class:\n",
327326
"\n",
328-
"```\n",
327+
"```java\n",
329328
"package com.ibm.org;\n",
330329
"import A.B.C.D;\n",
331330
"...\n",
@@ -345,7 +344,7 @@
345344
" // do somthing\n",
346345
" } \n",
347346
"```\n",
348-
"Given the above class, let's say we want to generate a summary for the ```bar``` method. To understand what it does, we add the callee of this method in the prompt, which in this case is ```baz```. We also remove imports, comments, etc. All of these are done using a single call to ```sanitize_focal_class``` API. In this process, we also use Treesitter to analyze the code. Once the input code has been sanitized, we call the ```format_inst``` method to create the LLM prompt, which has been passed to ```prompt_ollama``` method to generate the summary using LLM."
347+
"Let's say we want to generate a summary for method `bar`. To understand what it does, we add the callees of this method in the prompt, which in this case includes `baz`. We remove the other methods, imports, comments, etc. All of this can be achieved with a single call to CLDK's `sanitize_focal_class` API. In this process, we also use Treesitter to analyze the code. After creating the sanitized code, we call the previously defined `format_inst` method to create the LLM prompt and pass the prompt to `prompt_ollama` to generate the method summary."
349348
]
350349
},
351350
{
@@ -357,30 +356,34 @@
357356
},
358357
"outputs": [],
359358
"source": [
360-
"# For simplicity, we run the code summarization for a single class and method. One can remove that filter to run this code for the entire application\n",
361-
"qualified_class_name = 'org.apache.commons.cli.GnuParser'\n",
362-
"method_signature = 'flatten(Options, String[], boolean)'\n",
363-
"# Iterate over all the files in the project\n",
359+
"# For simplicity, we run the code summarization for a single class and method (this filter can be removed to run this code over the entire application)\n",
360+
"target_class = \"org.apache.commons.cli.GnuParser\"\n",
361+
"target_method = \"flatten(Options, String[], boolean)\"\n",
362+
"\n",
363+
"# Iterate over all classes in the application\n",
364364
"for class_name in analysis.get_classes():\n",
365-
" if class_name==qualified_class_name:\n",
365+
" if class_name == target_class:\n",
366366
" class_file_path = analysis.get_java_file(qualified_class_name=class_name)\n",
367-
" # Iterate over all the methods in the class\n",
367+
"\n",
368+
" # Read code for the class\n",
369+
" with open(class_file_path, 'r') as f:\n",
370+
" code_body = f.read()\n",
371+
"\n",
372+
" # Initialize treesitter utils for the class file content\n",
373+
" tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n",
374+
" \n",
375+
" # Iterate over all methods in class\n",
368376
" for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n",
369-
" if method==method_signature:\n",
370-
" # Get code body of the method\n",
371-
" with open(class_file_path, 'r') as f:\n",
372-
" code_body = f.read()\n",
377+
" if method == target_method:\n",
373378
" \n",
374-
" # Initialize the treesitter utils for the class file content\n",
375-
" tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n",
376-
" \n",
377379
" # Get all the method details\n",
378380
" method_details = analysis.get_method(qualified_class_name=class_name,\n",
379381
" qualified_method_name=method)\n",
380-
" # Sanitize the class for analysis\n",
382+
" \n",
383+
" # Sanitize the class for analysis with respect to the target method\n",
381384
" sanitized_class = tree_sitter_utils.sanitize_focal_class(method_details.declaration)\n",
382385
" \n",
383-
" # Format the instruction for the given focal method and class\n",
386+
" # Format the instruction for the given target method and class\n",
384387
" instruction = format_inst(\n",
385388
" code=sanitized_class,\n",
386389
" focal_method=method_details.declaration,\n",
@@ -417,7 +420,7 @@
417420
"name": "python",
418421
"nbconvert_exporter": "python",
419422
"pygments_lexer": "ipython3",
420-
"version": "3.11.4"
423+
"version": "3.11.9"
421424
}
422425
},
423426
"nbformat": 4,

0 commit comments

Comments
 (0)