diff --git a/recipes/Contract-Analysis/Granite_Docling_Price_Adjustment_Detection.ipynb b/recipes/Contract-Analysis/Granite_Docling_Price_Adjustment_Detection.ipynb new file mode 100644 index 00000000..b0643060 --- /dev/null +++ b/recipes/Contract-Analysis/Granite_Docling_Price_Adjustment_Detection.ipynb @@ -0,0 +1,461 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "05f23d02", + "metadata": {}, + "source": [ + "# Contract Price Adjustment Detection using IBM Granite Models\n", + "\n", + "**Granite 4 Small × Granite Docling 258M** for clause extraction and classification" + ] + }, + { + "cell_type": "markdown", + "id": "bed88299", + "metadata": {}, + "source": [ + "In this notebook, we demonstrate how to automatically **detect and classify price adjustment clauses** (such as CPI-based or cost-based increases) in B2B contracts.\n", + "\n", + "Automating this analysis helps procurement, finance, and legal teams quickly identify pricing flexibility and escalation risk across large volumes of supplier agreements.\n", + "\n", + "We use:\n", + "- **Granite Docling (`ibm-granite/granite-docling-258M-mlx`)** for PDF-to-Text conversion \n", + "- **Granite 4 Small (`ibm/granite-4-h-small`)** for semantic analysis and clause classification \n", + "\n", + "This workflow extracts clauses like:\n", + "- CPI-linked adjustments (inflation, cost-of-living)\n", + "- Cost-based adjustments (materials, energy, logistics)\n", + "- Penalty or performance-based price changes\n", + "- Explicitly fixed price (no adjustment)\n" + ] + }, + { + "cell_type": "markdown", + "id": "c7012a83", + "metadata": {}, + "source": [ + "### 1. Install Dependencies\n", + "\n", + "Install all required libraries for document parsing and LLM-based classification." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2cc74a9d", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install \"git+https://github.com/ibm-granite-community/utils\" \\\n", + " docling \\\n", + " langchain \\\n", + " langchain_ibm \\\n", + " langchain_community \\\n", + " transformers \\\n", + " mlx-vlm \n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "72ba1cea", + "metadata": {}, + "source": [ + "### 2. Import Libraries\n", + "\n", + "Load all dependencies and configure the core components for extraction, logging, and analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47d21c20", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "import logging\n", + "import pandas as pd\n", + "from IPython.display import display\n", + "from docling.datamodel import vlm_model_specs\n", + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import VlmPipelineOptions\n", + "from docling.document_converter import DocumentConverter, PdfFormatOption\n", + "from docling.pipeline.vlm_pipeline import VlmPipeline\n", + "from langchain.prompts import ChatPromptTemplate\n", + "from langchain_ibm import WatsonxLLM\n" + ] + }, + { + "cell_type": "markdown", + "id": "93311d77", + "metadata": {}, + "source": [ + "### 3. Initialize Granite Docling & WatsonxLLM\n", + "\n", + "Before extracting and classifying contracts, we need to initialize our two main engines:\n", + "\n", + "- **Granite Docling** – **ibm-granite/granite-docling-258M-mlx** a multimodal Image-Text-to-Text model designed for converting complex documents (PDFs, scanned images, etc.) into structured, machine-readable formats like Markdown, HTML, or JSON.\n", + "\n", + "- **ibm/granite-4-h-small** used to semantically classify clauses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b9264e9", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "logging.basicConfig(\n", + " level=logging.INFO,\n", + " format=\"%(asctime)s - %(levelname)s - %(message)s\"\n", + ")\n", + "logger = logging.getLogger(__name__)\n", + "\n", + "#Initialize Granite Docling\n", + "pipeline_options = VlmPipelineOptions(\n", + " vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,\n", + ")\n", + "\n", + "doc_converter = DocumentConverter(\n", + " format_options={\n", + " InputFormat.PDF: PdfFormatOption(\n", + " pipeline_cls=VlmPipeline,\n", + " pipeline_options=pipeline_options,\n", + " )\n", + " }\n", + ")\n", + "logger.info(\"Granite Docling initialized\")\n", + "\n", + "# Initialize WatsonxLLM\n", + "api_key = os.getenv(\"WATSON_API_KEY\")\n", + "project_id = os.getenv(\"WATSON_PROJECT_ID\")\n", + "watsonx_url = os.getenv(\"WATSON_URL\", \"https://us-south.ml.cloud.ibm.com\")\n", + "\n", + "\n", + "if not api_key or not project_id:\n", + " logger.error(\"WATSON_API_KEY or WATSON_PROJECT_ID environment variables not set\")\n", + " raise ValueError(\"Missing required environment variables\")\n", + "\n", + "llm = WatsonxLLM(\n", + " model_id=\"ibm/granite-4-h-small\",\n", + " apikey=api_key,\n", + " url=watsonx_url,\n", + " project_id=project_id,\n", + " params={\"decoding_method\": \"greedy\", \"max_new_tokens\": 3000},\n", + ")\n", + "\n", + "logger.info(\"WatsonxLLM initialized\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5b55a43", + "metadata": {}, + "source": [ + "### 4. Define Helper Functions\n", + "\n", + "These utility functions perform the following key tasks:\n", + "\n", + "- **`extract_contract_text()`** → Converts PDF/DOCX files into Markdown text using Granite Docling \n", + "- **`classify_contract()`** → Clause detection and classification \n", + "- **`process_contracts()`** → Batch-processes all contracts in a directory \n", + "\n", + "Each function includes logging for traceability and robust error handling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e494d8d", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_contract_text(file_path, max_chars=32000):\n", + " \"\"\"Extract contract text using Granite Docling\"\"\"\n", + " try:\n", + " logger.info(f\"Extracting: {file_path}\")\n", + " result = doc_converter.convert(file_path)\n", + " text = result.document.export_to_markdown()\n", + " \n", + " # Validate text length\n", + " if len(text) > max_chars:\n", + " logger.warning(f\"Contract exceeds {max_chars} chars ({len(text)}), truncating\")\n", + " text = text[:max_chars]\n", + " \n", + " logger.info(f\"Extracted {len(text)} characters\")\n", + " return text\n", + " except Exception as e:\n", + " logger.error(f\"Extraction failed: {e}\")\n", + " return None\n", + "\n", + "def classify_contract(contract_name, contract_text):\n", + " \"\"\"Classify contract using Granite 4 with robust JSON-only prompt\"\"\"\n", + "\n", + " prompt = f\"\"\"\n", + "You are a professional contract analysis model trained to identify all pricing mechanisms in B2B service agreements.\n", + "\n", + "Your goal is to find, classify, and explain **all clauses that describe how prices can change, or confirm that prices cannot change**.\n", + "\n", + "### Classification categories\n", + "- **\"CPI-based\"** – Price changes tied to inflation, CPI-U, CPI-W, cost of living, or similar indices.\n", + "- **\"Cost-based\"** – Price changes tied to supplier costs, fuel, materials, energy, labor, or other input variations.\n", + "- **\"Penalty-based\"** – Adjustments linked to performance, service levels, or penalties (e.g., late payments, SLA breaches).\n", + "- **\"No price increase\"** – Clauses explicitly stating that prices are fixed, capped, or not subject to increase for the term.\n", + "\n", + "### Important rules\n", + "1. If a single section contains multiple mechanisms (e.g. CPI + cost), create **separate clause entries**.\n", + "2. Only mark `has_price_increases = true` if **at least one** clause allows upward price movement.\n", + "3. If the contract says prices are fixed or capped for the term, mark `\"No price increase\"` and `has_price_increases = false`.\n", + "4. If wording is ambiguous (e.g. “subject to market review”), classify as `\"Cost-based\"` with `\"confidence\": \"Low\"`.\n", + "5. Confidence levels:\n", + " - **High** – Clear, explicit adjustment wording.\n", + " - **Medium** – Indirect or conditional adjustment language.\n", + " - **Low** – Unclear, inferred, or conflicting statements.\n", + "\n", + "### Output rules\n", + "- Output **only one valid JSON object**, starting with '{' and ending with '}'.\n", + "- Do not include any markdown, code fences, or text outside JSON.\n", + "- Use **double quotes only** for strings.\n", + "- Return empty lists if no clauses are found.\n", + "\n", + "### Output schema\n", + "{{\n", + " \"contract_name\": \"{contract_name}\",\n", + " \"total_clauses_found\": ,\n", + " \"price_adjustment_clauses\": [\n", + " {{\n", + " \"clause_id\": ,\n", + " \"classification\": \"\",\n", + " \"section\": \"
\",\n", + " \"supporting_clause\": \"\",\n", + " \"confidence\": \"\",\n", + " \"explanation\": \"\"\n", + " }}\n", + " ],\n", + " \"summary\": {{\n", + " \"has_price_increases\": ,\n", + " \"adjustment_types\": [],\n", + " \"overall_assessment\": \"\"\n", + " }}\n", + "}}\n", + "\n", + "If no price-related clauses are found:\n", + "- Set \"total_clauses_found\": 0,\n", + "- \"price_adjustment_clauses\": [],\n", + "- \"summary.has_price_increases\": false,\n", + "- \"summary.overall_assessment\": \"No price adjustment or escalation clauses detected; pricing appears fixed.\"\n", + "\n", + "### CONTRACT TEXT\n", + "{contract_text}\n", + "\"\"\"\n", + "\n", + " try:\n", + " logger.info(f\"Classifying: {contract_name}\")\n", + " response = llm(prompt).strip()\n", + "\n", + " # Extract JSON if LLM wraps it with extra characters\n", + " if \"{\" in response:\n", + " start = response.find(\"{\")\n", + " end = response.rfind(\"}\") + 1\n", + " response = response[start:end]\n", + "\n", + " # Parse JSON\n", + " result = json.loads(response)\n", + "\n", + " # Validate JSON structure\n", + " if not isinstance(result.get(\"price_adjustment_clauses\"), list):\n", + " logger.error(\"Invalid price_adjustment_clauses format\")\n", + " return None\n", + "\n", + " if not isinstance(result.get(\"summary\"), dict):\n", + " logger.error(\"Invalid summary format\")\n", + " return None\n", + "\n", + " clause_count = len(result.get(\"price_adjustment_clauses\", []))\n", + " return result\n", + "\n", + " except json.JSONDecodeError as e:\n", + " logger.error(f\"JSON decode error: {e}\")\n", + " logger.error(f\"Response snippet: {response[:300]}...\")\n", + " return None\n", + " except Exception as e:\n", + " logger.error(f\"Classification error: {e}\")\n", + " return None\n", + "\n", + "def process_contracts(data_dir):\n", + " \"\"\"Process all contracts in directory\"\"\"\n", + " \n", + " results = []\n", + " \n", + " if not os.path.exists(data_dir):\n", + " logger.error(f\"Directory not found: {data_dir}\")\n", + " return results\n", + " \n", + " pdf_files = [f for f in os.listdir(data_dir) if f.lower().endswith(\".pdf\")]\n", + " logger.info(f\"\\nProcessing {len(pdf_files)} contracts...\\n\")\n", + " \n", + " for idx, file_name in enumerate(pdf_files, 1):\n", + " file_path = os.path.join(data_dir, file_name)\n", + " logger.info(f\"\\n[{idx}/{len(pdf_files)}] {file_name}\")\n", + " \n", + " # Extract with Granite Docling\n", + " contract_text = extract_contract_text(file_path)\n", + " \n", + " if not contract_text:\n", + " logger.warning(f\"Extraction failed, skipping\")\n", + " continue\n", + " \n", + " classification = classify_contract(file_name, contract_text)\n", + " \n", + " if classification:\n", + " results.append(classification)\n", + " else:\n", + " logger.warning(f\"Classification failed\")\n", + " \n", + " return results\n" + ] + }, + { + "cell_type": "markdown", + "id": "43041dd6", + "metadata": {}, + "source": [ + "### 5. Run the Processing Pipeline\n", + "\n", + "Specify your directory containing contracts (PDF or DOCX).\n", + "The pipeline will extract text, classify clauses, and store results as structured JSON." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2009e3b6", + "metadata": {}, + "outputs": [], + "source": [ + "data_dir = \"./Price_Detection_Sample_Contracts\"\n", + "logger.info(\"=\"*80)\n", + "logger.info(\"Starting contract classification pipeline\")\n", + "logger.info(\"=\"*80)\n", + "\n", + "# Process all contracts\n", + "results = process_contracts(data_dir)\n", + "\n", + "if results:\n", + " # Save JSON output\n", + " output_path = \"./outputs/contract_classifications.json\"\n", + " os.makedirs(\"./outputs\", exist_ok=True)\n", + " \n", + " with open(output_path, \"w\", encoding=\"utf-8\") as f:\n", + " json.dump(results, f, indent=2, ensure_ascii=False)\n", + " \n", + " logger.info(f\"\\nJSON output saved: {output_path}\")\n", + " logger.info(f\"Contracts processed successfully\")\n", + "else:\n", + " logger.error(\"No contracts processed\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "736bcb12", + "metadata": {}, + "source": [ + "### 6. Generate Summary & Clause-Level Tables\n", + "\n", + "The notebook generates two key outputs:\n", + "\n", + "1. **Contract Summary Table**: One row per contract summarizing detected pricing mechanisms.\n", + "\n", + "2. **Clause-Level Table**: Detailed breakdown of each detected clause (classification, text snippet, and confidence)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6639a647", + "metadata": {}, + "outputs": [], + "source": [ + "# Flatten the summary\n", + "df_summary = pd.json_normalize(results, sep='_')\n", + "logger.info (results)\n", + "df_clauses = pd.json_normalize(\n", + " results,\n", + " record_path=['price_adjustment_clauses'],\n", + " meta=['contract_name'],\n", + " sep='_'\n", + ")\n", + "# === DISPLAY CLEAN TABLES ===\n", + "print(\"=== Contract Summary ===\")\n", + "display(\n", + " df_summary[[\n", + " \"contract_name\",\n", + " \"summary_has_price_increases\",\n", + " \"summary_adjustment_types\",\n", + " \"summary_overall_assessment\"\n", + " ]].style.set_table_styles([\n", + " {\"selector\": \"th\", \"props\": [(\"text-align\", \"left\")]},\n", + " {\"selector\": \"td\", \"props\": [(\"text-align\", \"left\")]}\n", + " ])\n", + ")\n", + "contracts_with_increases = [\n", + " r for r in results \n", + " if r.get(\"summary\", {}).get(\"has_price_increases\") is True\n", + "]\n", + "\n", + "if contracts_with_increases:\n", + " print(\"=== Clause-Level Details ===\")\n", + " display(df_clauses[[\n", + " \"contract_name\",\n", + " \"clause_id\",\n", + " \"classification\",\n", + " \"section\",\n", + " \"confidence\",\n", + " \"supporting_clause\",\n", + " \"explanation\"\n", + " ]].style.set_table_styles([\n", + " {\"selector\": \"th\", \"props\": [(\"text-align\", \"left\")]},\n", + " {\"selector\": \"td\", \"props\": [(\"text-align\", \"left\")]}\n", + " ])\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "id": "83ba4ef1", + "metadata": {}, + "source": [ + "### 7. Summary\n", + "This workflow demonstrates how IBM Granite models can automatically extract, interpret, and structure price adjustment logic from complex contract documents.\n", + "\n", + "By combining Granite Docling’s multimodal document understanding with Granite 4 Small’s clause reasoning, legal and procurement teams can dramatically accelerate contract review, compliance checks, and financial risk analysis." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BRIGHTEDGE CORPORATE CATERING SERVICES.pdf b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BRIGHTEDGE CORPORATE CATERING SERVICES.pdf new file mode 100644 index 00000000..7a7a2a57 Binary files /dev/null and b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BRIGHTEDGE CORPORATE CATERING SERVICES.pdf differ diff --git a/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BlueStream_Energy.pdf b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BlueStream_Energy.pdf new file mode 100644 index 00000000..a3815ab3 Binary files /dev/null and b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/BlueStream_Energy.pdf differ diff --git a/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/IT SUPPORT AND MAINTENANCE AGREEMENT.pdf b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/IT SUPPORT AND MAINTENANCE AGREEMENT.pdf new file mode 100644 index 00000000..191636e5 Binary files /dev/null and b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/IT SUPPORT AND MAINTENANCE AGREEMENT.pdf differ diff --git a/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/UrbanWorks Facility Services.pdf b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/UrbanWorks Facility Services.pdf new file mode 100644 index 00000000..0c301852 Binary files /dev/null and b/recipes/Contract-Analysis/Price_Detection_Sample_Contracts/UrbanWorks Facility Services.pdf differ