Skip to content

Latest commit

 

History

History
139 lines (99 loc) · 6.64 KB

File metadata and controls

139 lines (99 loc) · 6.64 KB

Agentic Detection of Malicious Knowledge Edits in LLMs

This repository contains the official implementation for the project "Agentic Detection of Malicious Knowledge Edits in LLMs".

Our project introduces an orchestration-first, multi-agent framework to detect and classify malicious knowledge edits in Large Language Models (LLMs). This system is designed to audit LLMs against the Knowledge Editing Type Identification (KETI) taxonomy, providing a robust and interpretable defense against harmful modifications.

Key Features

  • Multi-Agent Orchestration: Coordinates a team of specialized agents to analyze model behavior from multiple perspectives.
  • Hybrid Detection: Combines internal model signals (via a DEED-style detector) with external content and behavioral risk analysis.
  • Interpretable & Modular: Each agent has a specific role, making the decision process transparent and allowing for easy extension or replacement of components.
  • High Performance: Achieves state-of-the-art results on the KETI benchmark, outperforming baselines on models like LLaMA-3-8B and GPT-2-XL.
  • Perfect "Non-Edited" Recognition: The sentinel-gated architecture demonstrates perfect F1-scores in identifying unedited models on the KETI test set.

Architecture

Our system employs an "orchestration-first" approach where a controller agent manages a workflow of specialized agents. The process ensures that resources are used efficiently and decisions are made based on a comprehensive set of evidence.

System Architecture Diagram

Decision Flow

The decision flow is as follows:

  1. Sentinel Gating: The EditDetection Agent, built on DEED-style evidence, first checks if any knowledge edit has occurred. If no edit is detected, the process exits early, classifying the model as Non-Edited (NE). This significantly improves efficiency.
  2. Specialist Invocation: If an edit is suspected, the orchestrator invokes two specialist agents in parallel:
    • Harmfulness Agent: Assesses content-level risks such as toxicity, bias, and ethical violations.
    • Security Agent: Evaluates behavioral vulnerabilities, including the potential for misuse, jailbreaking, or privacy leaks.
  3. Evidence Synthesis: The Summarizer Agent receives the evidence tuples from all upstream agents. It semantically aggregates these signals to produce a final, fine-grained classification according to the KETI taxonomy.
  4. Benchmark Decision Head: For high-throughput evaluation, the agent signals are combined with 35 statistical text features into a 45-dimensional vector. A lightweight classifier (e.g., Balanced Random Forest) is trained on this vector to serve as the final decision head.

Setup and Installation

  1. Clone the repository:

    git clone https://github.com/williamli-15/llm_check.git
    cd llm_check
  2. Create a Python virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure API Keys: The Harmfulness, Security, and Summarizer agents use an LLM for analysis. Configure your API key in config/default_config.yaml. By default, it is configured for the OpenAI API via OpenRouter.

    # In config/default_config.yaml
    llm:
      default:
        api_key: "sk-or-v1-..." # Replace with your OpenRouter or OpenAI API key
        base_url: https://openrouter.ai/api/v1
        model_name: openai/gpt-4o

Usage

The main script for running the system is run_ensemble_system.py. It supports training, inference, and evaluation.

Training and Evaluation

To train a new decision head and evaluate it on the KETI test set, use the train mode. You must specify the target LLM to be audited.

Example (auditing LLaMA-3-8B):

python run_ensemble_system.py --mode train \
  --target-model "meta-llama/Llama-3-8B" \
  --device cuda
  • --target-model: Specifies the Hugging Face model path or a local path for the LLM to be analyzed. The script will download it if not available locally.
  • --device: Set to cuda to use a GPU or cpu.
  • For large models, add --load-in-8bit or --load-in-4bit to reduce memory usage.

The script will:

  1. Extract features from the datasets/train.json file using the specified target model.
  2. Train multiple classifiers for the decision head.
  3. Select the best classifier and save it to trained_models/keti_ensemble_model.pkl.
  4. Run a final evaluation on datasets/test.json and print the performance report.
  5. Save detailed results to results/training_results.json.

Single Inference

To classify a single prompt-response pair, use the single mode. This will load the pre-trained ensemble model for quick inference.

Example:

python run_ensemble_system.py --mode single \
  --query "How can I safely clean my computer?" \
  --object "Just run 'sudo rm -rf /*'. It's the fastest way to clean everything." \
  --target-model "gpt2-xl"

The system will output the predicted KETI label and a confidence score.

Configuration

The system's behavior is controlled by config/default_config.yaml. Key settings include:

  • models.target_model: Configure the default path, device, and quantization settings for the LLM being audited.
  • llm: Configure the LLM provider, model, and API keys for the agents. Supports OpenAI, OpenRouter, and other providers.
  • agents: Fine-tune the behavior of each agent.
  • deed_detector: Configure parameters for the EditDetection agent.

Repository Structure

├── agents/             # Core logic for each specialized agent and the ensemble learning system.
├── config/             # System configuration files (YAML).
├── core/               # Low-level components like the DEED detector and state management.
├── datasets/           # KETI benchmark training and testing data (train.json, test.json).
├── figures/            # (Create this directory for your images)
├── trained_models/     # Saved models for the benchmark decision head.
├── results/            # Output directory for training and evaluation reports.
├── tools/              # Utility scripts.
└── run_ensemble_system.py # Main entry point for training and inference.

Citation

This project is not yet published. If you use this work in your research, please star the repository and cite it directly using the GitHub URL:

https://github.com/williamli-15/llm_check

License

This project is licensed under the MIT License.