diff --git a/.github/workflows/mkdocs-deploy.yml b/.github/workflows/mkdocs-deploy.yml index 95a9508c..f17ea8df 100644 --- a/.github/workflows/mkdocs-deploy.yml +++ b/.github/workflows/mkdocs-deploy.yml @@ -2,7 +2,8 @@ name: mkdocs-deploy on: push: branches: - - docs + - docs + permissions: contents: write jobs: @@ -19,6 +20,6 @@ jobs: with: python-version: "3.11" - name: Install Dependencies - run: pip install mkdocs-material mkdocs-autorefs mkdocs-get-deps mkdocs-material-extensions mkdocstrings mkdocstrings-python + run: pip install mkdocs-material[imaging] mkdocs-autorefs mkdocs-get-deps mkdocs-material-extensions mkdocstrings mkdocstrings-python mkdocs-git-revision-date-localized-plugin - name: Deploy docs run: mkdocs gh-deploy --force --verbose diff --git a/.gitignore b/.gitignore index e79973ed..bce9afb2 100644 --- a/.gitignore +++ b/.gitignore @@ -14,6 +14,9 @@ # BlueJ files *.ctxt +# Cache files +.cache/ + # Mobile Tools for Java (J2ME) .mtj.tmp/ diff --git a/README.md b/README.md index 1c06e0a8..3d4d4110 100644 --- a/README.md +++ b/README.md @@ -25,24 +25,24 @@
-Codellm-devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. +Codellm-Devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-Devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. -Codellm-devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. +Codellm-Devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. -**The purpose of Codellm-devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.** -By providing a consistent and extensible framework, Codellm-devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms. +**The purpose of Codellm-Devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.** +By providing a consistent and extensible framework, Codellm-Devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms. -Codellm-devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs. +Codellm-Devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-Devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs. -Codellm-devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration. +Codellm-Devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration. -Codellm-devkit is: +Codellm-Devkit is: - **Unified**: Provides a single framework for integrating multiple analysis tools and CodeLLMs, regardless of the programming languages involved. - **Extensible**: Designed to support new analysis tools and LLM platforms, making it adaptable to the evolving landscape of code analysis. - **Streamlined**: Simplifies the process of transforming raw code into structured, LLM-ready inputs, reducing the overhead typically associated with multi-language analysis. -Codellm-devkit is an ongoing project, developed at IBM Research. +Codellm-Devkit is an ongoing project, developed at IBM Research. ## Contact @@ -55,17 +55,16 @@ For any questions, feedback, or suggestions, please contact the authors: | Saurabh Sihna | [sinhas@us.ibm.com](mailto:sinhas@us.ibm.com) | ## Table of Contents -- [CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs](#codellm-devkit-a-python-library-for-seamless-interaction-with-codellms) - - [Contact](#contact) - - [Table of Contents](#table-of-contents) - - [Architectural and Design Overview](#architectural-and-design-overview) - - [Quick Start: Example Walkthrough](#quick-start-example-walkthrough) - - [Prerequisites](#prerequisites) - - [Step 1: Set up an Ollama server](#step-1--set-up-an-ollama-server) - - [Pull the latest version of Granite 8b instruct model from ollama](#pull-the-latest-version-of-granite-8b-instruct-model-from-ollama) - - [Step 2: Install CLDK](#step-2--install-cldk) - - [Step 3: Build a code summarization pipeline](#step-3--build-a-code-summarization-pipeline) - - [Publication (papers and blogs related to CLDK)](#publication-papers-and-blogs-related-to-cldk) +- [Contact](#contact) +- [Table of Contents](#table-of-contents) +- [Architectural and Design Overview](#architectural-and-design-overview) +- [Quick Start: Example Walkthrough](#quick-start-example-walkthrough) + - [Prerequisites](#prerequisites) + - [Step 1: Set up an Ollama server](#step-1--set-up-an-ollama-server) + - [Pull the latest version of Granite 8b instruct model from ollama](#pull-the-latest-version-of-granite-8b-instruct-model-from-ollama) + - [Step 2: Install CLDK](#step-2--install-cldk) + - [Step 3: Build a code summarization pipeline](#step-3--build-a-code-summarization-pipeline) + - [Publication (papers and blogs related to CLDK)](#publication-papers-and-blogs-related-to-cldk) ## Architectural and Design Overview diff --git a/docs/CNAME b/docs/CNAME new file mode 100644 index 00000000..d048e9b8 --- /dev/null +++ b/docs/CNAME @@ -0,0 +1 @@ +codellm-devkit.info diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index 4338d8f2..00000000 --- a/docs/README.md +++ /dev/null @@ -1,78 +0,0 @@ - - -[](https://arxiv.org/abs/2410.13007) -[](https://www.python.org/downloads/release/python-3110/) -[](https://opensource.org/licenses/Apache-2.0) -[](https://ibm.github.io/codellm-devkit/) -[](https://badge.fury.io/py/cldk) - -# CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs - -Codellm-devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. - -Codellm-devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. - -**The purpose of Codellm-devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.** -By providing a consistent and extensible framework, Codellm-devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms. - -Codellm-devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs. - -Codellm-devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration. - -Codellm-devkit is: - -- **Unified**: Provides a single framework for integrating multiple analysis tools and CodeLLMs, regardless of the programming languages involved. -- **Extensible**: Designed to support new analysis tools and LLM platforms, making it adaptable to the evolving landscape of code analysis. -- **Streamlined**: Simplifies the process of transforming raw code into structured, LLM-ready inputs, reducing the overhead typically associated with multi-language analysis. - -## Architectural and Design Overview - -Below is a very high-level overview of the architectural of CLDK: - - -```mermaid -graph TD -User <--> A[CLDK] - A --> 15[Retrieval ‡] - A --> 16[Prompting ‡] - A[CLDK] <--> B[Languages] - B --> C[Java, Python, Go ‡, C ‡, JavaScript ‡, TypeScript ‡, Rust ‡] - C --> D[Data Models] - D --> 13{Pydantic} - 13 --> 7 - C --> 7{backends} - 7 <--> 9[WALA] - 9 <--> 14[Analysis] - 7 <--> 10[Tree-sitter] - 10 <--> 14[Analysis] - 7 <--> 11[LLVM ‡] - 11 <--> 14[Analysis] - 7 <--> 12[CodeQL ‡] - 12 <--> 14[Analysis] - - - -X[‡ Yet to be implemented] -``` - -The user interacts by invoking the CLDK API. The CLDK API is responsible for handling the user requests and delegating them to the appropriate language-specific modules. - -Each language comprises of two key components: data models and backends. - -1. **Data Models:** These are high level abstractions that represent the various language constructs and componentes in a structured format using pydantic. This confers a high degree of flexibility and extensibility to the models as well as allowing for easy accees of various data components via a simple dot notation. In addition, the data models are designed to be easily serializable and deserializable, making it easy to store and retrieve data from various sources. - -2. **Analysis Backends:** These are the components that are responsible for interfacing with the various program analysis tools. The core backends are Treesitter, Javaparse, WALA, LLVM, and CodeQL. The backends are responsible for handling the user requests and delegating them to the appropriate analysis tools. The analysis tools perfrom the requisite analysis and return the results to the user. The user merely calls one of several high-level API functions such as `get_method_body`, `get_method_signature`, `get_call_graph`, etc. and the backend takes care of the rest. - - Some langugages may have multiple backends. For example, Java has WALA, Javaparser, Treesitter, and CodeQL backends. The user has freedom to choose the backend that best suits their needs. - -We are currently working on implementing the retrieval and prompting components. The retrieval component will be responsible for retrieving the relevant code snippets from the codebase for RAG usecases. The prompting component will be responsible for generating the prompts for the CodeLLMs using popular prompting frameworks such as `PDL`, `Guidance`, or `LMQL`. - -## Contact - -For any questions, feedback, or suggestions, please contact the authors: - -| Name | Email | -| ---- | ----- | -| Rahul Krishna | [i.m.ralk@gmail.com](mailto:imralk+oss@gmail.com) | -| Rangeet Pan | [rangeet.pan@ibm.com](mailto:rangeet.pan@gmail.com) | -| Saurabh Sihna | [sinhas@us.ibm.com](mailto:sinhas@us.ibm.com) | \ No newline at end of file diff --git a/docs/api_reference.md b/docs/api_reference.md deleted file mode 100644 index f94e373b..00000000 --- a/docs/api_reference.md +++ /dev/null @@ -1,21 +0,0 @@ -# API Reference - -This page documents the API functionalities. - -## Models - -## Python -::: cldk.models.python.models - -## Java -::: cldk.models.java.models - -## Treesitter -::: cldk.models.treesitter.models - -## Python -::: cldk.analysis.python - -## Java -::: cldk.analysis.java - diff --git a/docs/assets/cldk-dark.ai b/docs/assets/cldk-dark.ai new file mode 100644 index 00000000..6a99523f --- /dev/null +++ b/docs/assets/cldk-dark.ai @@ -0,0 +1,1915 @@ +%PDF-1.6 % +1 0 obj <>/OCGs[25 0 R]>>/Pages 3 0 R/Type/Catalog>> endobj 2 0 obj <>stream + +