Skip to content

Commit 6458c76

Browse files
committed
feat: add Inconsistent Description heuristic
Signed-off-by: Amine <[email protected]>
1 parent e07fff6 commit 6458c76

File tree

15 files changed

+292
-123
lines changed

15 files changed

+292
-123
lines changed

src/macaron/ai/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ This module provides the foundation for interacting with Large Language Models (
55
## Module Components
66

77
- **ai_client.py**
8-
Defines the abstract [`AIClient`](./ai_client.py) class. This class handles the initialization of LLM configuration from the defaults and serves as the base for all specific AI client implementations.
8+
Defines the abstract [`AIClient`](./clients/base.py) class. This class handles the initialization of LLM configuration from the defaults and serves as the base for all specific AI client implementations.
99

1010
- **openai_client.py**
11-
Implements the [`OpenAiClient`](./openai_client.py) class, a concrete subclass of [`AIClient`](./ai_client.py). This client interacts with OpenAI-like APIs by sending requests using HTTP and processing the responses. It also validates and structures responses using the tools provided.
11+
Implements the [`OpenAiClient`](./clients/openai_client.py) class, a concrete subclass of [`AIClient`](./ai_client.py). This client interacts with OpenAI-like APIs by sending requests using HTTP and processing the responses. It also validates and structures responses using the tools provided.
1212

1313
- **ai_factory.py**
14-
Contains the [`AIClientFactory`](./ai_factory.py) class, which is responsible for reading provider configuration from the defaults and creating the correct AI client instance.
14+
Contains the [`AIClientFactory`](./clients/base.py) class, which is responsible for reading provider configuration from the defaults and creating the correct AI client instance.
1515

1616
- **ai_tools.py**
1717
Offers utility functions such as `structure_response` to assist with parsing and validating the JSON response returned by an LLM. These functions ensure that responses conform to a given Pydantic model for easier downstream processing.
@@ -22,11 +22,11 @@ This module provides the foundation for interacting with Large Language Models (
2222
The module reads the LLM configuration from the application defaults (using the `defaults` module). Make sure that the `llm` section in your configuration includes valid settings such as `enabled`, `api_key`, `api_endpoint`, `model`, and `context_window`.
2323

2424
2. **Creating a Client:**
25-
Use the [`AIClientFactory`](./ai_factory.py) to create an AI client instance. The factory checks the configured provider and returns a client (e.g., an instance of [`OpenAiClient`](./openai_client.py)) that can be used to invoke the LLM.
25+
Use the [`AIClientFactory`](./clients/ai_factory.py) to create an AI client instance. The factory checks the configured provider and returns a client (e.g., an instance of [`OpenAiClient`](./clients/openai_client.py)) that can be used to invoke the LLM.
2626

2727
Example:
2828
```py
29-
from macaron.ai.ai_factory import AIClientFactory
29+
from macaron.ai.clients.ai_factory import AIClientFactory
3030

3131
factory = AIClientFactory()
3232
client = factory.create_client(system_prompt="You are a helpful assistant.")
@@ -45,6 +45,6 @@ This module provides the foundation for interacting with Large Language Models (
4545
## Extensibility
4646

4747
The design of the AI module is provider-agnostic. To add support for additional LLM providers:
48-
- Implement a new client by subclassing [`AIClient`](./ai_client.py).
49-
- Add the new client to the [`PROVIDER_MAPPING`](./ai_factory.py).
48+
- Implement a new client by subclassing [`AIClient`](./clients/base.py).
49+
- Add the new client to the [`PROVIDER_MAPPING`](./clients/ai_factory.py).
5050
- Update the configuration defaults accordingly.

src/macaron/ai/ai_tools.py

Lines changed: 6 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,32 +5,26 @@
55
import json
66
import logging
77
import re
8-
from typing import TypeVar
9-
10-
from pydantic import BaseModel, ValidationError
11-
12-
T = TypeVar("T", bound=BaseModel)
8+
from typing import Any
139

1410
logger: logging.Logger = logging.getLogger(__name__)
1511

1612

17-
def structure_response(response_text: str, response_model: type[T]) -> T | None:
13+
def extract_json(response_text: str) -> Any:
1814
"""
19-
Structure and parse the response from the LLM.
15+
Parse the response from the LLM.
2016
2117
If raw JSON parsing fails, attempts to extract a JSON object from text.
2218
2319
Parameters
2420
----------
2521
response_text: str
2622
The response text from the LLM.
27-
response_model: Type[T]
28-
The Pydantic model to structure the response against.
2923
3024
Returns
3125
-------
32-
T | None
33-
The structured Pydantic model instance.
26+
dict[str, Any] | None
27+
The structured JSON object.
3428
"""
3529
try:
3630
data = json.loads(response_text)
@@ -46,8 +40,4 @@ def structure_response(response_text: str, response_model: type[T]) -> T | None:
4640
logger.debug("Failed to parse extracted JSON: %s", e)
4741
return None
4842

49-
try:
50-
return response_model.model_validate(data)
51-
except ValidationError as e:
52-
logger.debug("Validation failed against response model: %s", e)
53-
return None
43+
return data

src/macaron/ai/clients/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Copyright (c) 2025 - 2025, Oracle and/or its affiliates. All rights reserved.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
3+
4+
"""This module provides a mapping of AI client providers to their respective client classes."""
5+
6+
from macaron.ai.clients.base import AIClient
7+
from macaron.ai.clients.openai_client import OpenAiClient
8+
9+
PROVIDER_MAPPING: dict[str, type[AIClient]] = {"openai": OpenAiClient}

src/macaron/ai/ai_factory.py renamed to src/macaron/ai/clients/ai_factory.py

Lines changed: 13 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55

66
import logging
77

8-
from macaron.ai.ai_client import AIClient
9-
from macaron.ai.openai_client import OpenAiClient
8+
from macaron.ai.clients import PROVIDER_MAPPING
9+
from macaron.ai.clients.base import AIClient
1010
from macaron.config.defaults import defaults
1111
from macaron.errors import ConfigurationError
1212

@@ -16,37 +16,30 @@
1616
class AIClientFactory:
1717
"""Factory to create AI clients based on provider configuration."""
1818

19-
PROVIDER_MAPPING: dict[str, type[AIClient]] = {"openai": OpenAiClient}
20-
2119
def __init__(self) -> None:
2220
"""
2321
Initialize the AI client.
2422
2523
The LLM configuration is read from defaults.
2624
"""
27-
self.defaults = self._load_defaults()
25+
self.params = self._load_defaults()
2826

29-
def _load_defaults(self) -> dict:
27+
def _load_defaults(self) -> dict | None:
3028
section_name = "llm"
3129
default_values = {
3230
"enabled": False,
3331
"provider": "",
3432
"api_key": "",
3533
"api_endpoint": "",
3634
"model": "",
37-
"context_window": 10000,
3835
}
3936

4037
if defaults.has_section(section_name):
4138
section = defaults[section_name]
4239
default_values["enabled"] = section.getboolean("enabled", default_values["enabled"])
43-
default_values["api_key"] = str(section.get("api_key", default_values["api_key"])).strip().lower()
44-
default_values["api_endpoint"] = (
45-
str(section.get("api_endpoint", default_values["api_endpoint"])).strip().lower()
46-
)
47-
default_values["model"] = str(section.get("model", default_values["model"])).strip().lower()
48-
default_values["provider"] = str(section.get("provider", default_values["provider"])).strip().lower()
49-
default_values["context_window"] = section.getint("context_window", 10000)
40+
for key, default_value in default_values.items():
41+
if isinstance(default_value, str):
42+
default_values[key] = str(section.get(key, default_value)).strip().lower()
5043

5144
if default_values["enabled"]:
5245
for key, value in default_values.items():
@@ -59,12 +52,11 @@ def _load_defaults(self) -> dict:
5952

6053
def create_client(self, system_prompt: str) -> AIClient | None:
6154
"""Create an AI client based on the configured provider."""
62-
client_class = self.PROVIDER_MAPPING.get(self.defaults["provider"])
63-
if client_class is None:
64-
logger.error("Provider '%s' is not supported.", self.defaults["provider"])
55+
if not self.params or not self.params["enabled"]:
6556
return None
66-
return client_class(system_prompt, self.defaults)
6757

68-
def list_available_providers(self) -> list[str]:
69-
"""List all registered providers."""
70-
return list(self.PROVIDER_MAPPING.keys())
58+
client_class = PROVIDER_MAPPING.get(self.params["provider"])
59+
if client_class is None:
60+
logger.error("Provider '%s' is not supported.", self.params["provider"])
61+
return None
62+
return client_class(system_prompt, self.params)

src/macaron/ai/ai_client.py renamed to src/macaron/ai/clients/base.py

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,36 +3,28 @@
33

44
"""This module defines the abstract AIClient class for implementing AI clients."""
55

6-
import logging
76
from abc import ABC, abstractmethod
8-
from typing import Any, TypeVar
9-
10-
from pydantic import BaseModel
11-
12-
T = TypeVar("T", bound=BaseModel)
13-
14-
logger: logging.Logger = logging.getLogger(__name__)
157

168

179
class AIClient(ABC):
1810
"""This abstract class is used to implement ai clients."""
1911

20-
def __init__(self, system_prompt: str, defaults: dict) -> None:
12+
def __init__(self, system_prompt: str, params: dict) -> None:
2113
"""
2214
Initialize the AI client.
2315
2416
The LLM configuration is read from defaults.
2517
"""
2618
self.system_prompt = system_prompt
27-
self.defaults = defaults
19+
self.params = params
2820

2921
@abstractmethod
3022
def invoke(
3123
self,
3224
user_prompt: str,
3325
temperature: float = 0.2,
34-
structured_output: type[T] | None = None,
35-
) -> Any:
26+
response_format: dict | None = None,
27+
) -> dict:
3628
"""
3729
Invoke the LLM and optionally validate its response.
3830
@@ -42,12 +34,12 @@ def invoke(
4234
The user prompt to send to the LLM.
4335
temperature: float
4436
The temperature for the LLM response.
45-
structured_output: Optional[Type[T]]
46-
The Pydantic model to validate the response against. If provided, the response will be parsed and validated.
37+
response_format: dict | None
38+
The json schema to validate the response against.
4739
4840
Returns
4941
-------
50-
Optional[T | str]
51-
The validated Pydantic model instance if `structured_output` is provided,
42+
dict
43+
The validated schema if `response_format` is provided,
5244
or the raw string response if not.
5345
"""

src/macaron/ai/openai_client.py renamed to src/macaron/ai/clients/openai_client.py

Lines changed: 11 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88

99
from pydantic import BaseModel
1010

11-
from macaron.ai.ai_client import AIClient
12-
from macaron.ai.ai_tools import structure_response
11+
from macaron.ai.ai_tools import extract_json
12+
from macaron.ai.clients.base import AIClient
1313
from macaron.errors import ConfigurationError, HeuristicAnalyzerValueError
1414
from macaron.util import send_post_http_raw
1515

@@ -25,7 +25,7 @@ def invoke(
2525
self,
2626
user_prompt: str,
2727
temperature: float = 0.2,
28-
structured_output: type[T] | None = None,
28+
response_format: dict | None = None,
2929
max_tokens: int = 4000,
3030
timeout: int = 30,
3131
) -> Any:
@@ -38,8 +38,8 @@ def invoke(
3838
The user prompt to send to the LLM.
3939
temperature: float
4040
The temperature for the LLM response.
41-
structured_output: Optional[Type[T]]
42-
The Pydantic model to validate the response against. If provided, the response will be parsed and validated.
41+
response_format: dict
42+
The json schema to validate the response against. If provided, the response will be parsed and validated.
4343
max_tokens: int
4444
The maximum number of tokens for the LLM response.
4545
timeout: int
@@ -56,28 +56,21 @@ def invoke(
5656
HeuristicAnalyzerValueError
5757
If there is an error in parsing or validating the response.
5858
"""
59-
if not self.defaults["enabled"]:
59+
if not self.params["enabled"]:
6060
raise ConfigurationError("AI client is not enabled. Please check your configuration.")
6161

62-
if len(user_prompt.split()) > self.defaults["context_window"]:
63-
logger.warning(
64-
"User prompt exceeds context window (%s words). "
65-
"Truncating the prompt to fit within the context window.",
66-
self.defaults["context_window"],
67-
)
68-
user_prompt = " ".join(user_prompt.split()[: self.defaults["context_window"]])
69-
70-
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.defaults["api_key"]}"}
62+
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.params['api_key']}"}
7163
payload = {
72-
"model": self.defaults["model"],
64+
"model": self.params["model"],
7365
"messages": [{"role": "system", "content": self.system_prompt}, {"role": "user", "content": user_prompt}],
66+
"response_format": response_format,
7467
"temperature": temperature,
7568
"max_tokens": max_tokens,
7669
}
7770

7871
try:
7972
response = send_post_http_raw(
80-
url=self.defaults["api_endpoint"], json_data=payload, headers=headers, timeout=timeout
73+
url=self.params["api_endpoint"], json_data=payload, headers=headers, timeout=timeout
8174
)
8275
if not response:
8376
raise HeuristicAnalyzerValueError("No response received from the LLM.")
@@ -89,11 +82,7 @@ def invoke(
8982
logger.info("LLM call token usage: %s", usage_str)
9083

9184
message_content = response_json["choices"][0]["message"]["content"]
92-
93-
if not structured_output:
94-
logger.debug("Returning raw message content (no structured output requested).")
95-
return message_content
96-
return structure_response(message_content, structured_output)
85+
return extract_json(message_content)
9786

9887
except Exception as e:
9988
logger.error("Error during LLM invocation: %s", e)

src/macaron/ai/prompts/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (c) 2025 - 2025, Oracle and/or its affiliates. All rights reserved.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

src/macaron/ai/schemas/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (c) 2025 - 2025, Oracle and/or its affiliates. All rights reserved.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

src/macaron/config/defaults.ini

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,3 @@ api_key =
647647
api_endpoint =
648648
# The model to use for the LLM service.
649649
model =
650-
# The context window size for the LLM service.
651-
# This is the maximum number of tokens that the LLM can process in a single request.
652-
context_window = 10000

src/macaron/malware_analyzer/pypi_heuristics/heuristics.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ class Heuristics(str, Enum):
5252
#: Indicates that the package contains some code that doesn't match the docstrings.
5353
MATCHING_DOCSTRINGS = "matching_docstrings"
5454

55+
#: Indicates that the package description is inconsistent.
56+
INCONSISTENT_DESCRIPTION = "inconsistent_description"
57+
5558

5659
class HeuristicResult(str, Enum):
5760
"""Result type indicating the outcome of a heuristic."""

0 commit comments

Comments
 (0)