feat: enhance pre-commit hook with validation and detailed setup instructions

gaurpulkit · gaurpulkit · commit 0a0ca374c9a8 · 2025-08-20T15:04:25.000+05:30
- Updated the pre-commit hook description for clarity.
- Added validation for the configuration file in the executor hook to ensure it exists and is not empty.
- Expanded README and documentation to include detailed setup instructions for the pre-commit hook, including configuration options and best practices.
diff --git a/.pre-commit-hooks.yaml b/.pre-commit-hooks.yaml
@@ -1,7 +1,13 @@
 - id: datapilot_run_dbt_checks
   name: datapilot run dbt checks
-  description: datapilot run dbt checks
+  description: Run DataPilot dbt project health checks on changed files
   entry: datapilot_run_dbt_checks
   language: python
   types_or: [yaml, sql]
   require_serial: true
+  # Optional arguments that can be passed to the hook:
+  # --config-path: Path to configuration file
+  # --token: API token for authentication
+  # --instance-name: Tenant/instance name
+  # --backend-url: Backend URL (defaults to https://api.myaltimate.com)
+  # --config-name: Name of config to use from API
diff --git a/README.md b/README.md
@@ -42,6 +42,38 @@ The [--config-path] is an optional argument. You can provide a yaml file with ov
 
 Note: The dbt docs generate requires an active database connection and may take a long time for projects with large number of models.
 
+### Pre-commit Hook Integration
+
+DataPilot provides a pre-commit hook that automatically runs health checks on changed files before each commit. This ensures code quality and catches issues early in the development process.
+
+#### Quick Setup
+
+1. Install pre-commit:
+```bash
+pip install pre-commit
+```
+
+2. Add to your `.pre-commit-config.yaml`:
+```yaml
+repos:
+  - repo: https://github.com/AltimateAI/datapilot-cli
+    rev: v0.0.23  # Always use a specific version tag
+    hooks:
+      - id: datapilot_run_dbt_checks
+        args: [
+          "--config-path", "./datapilot-config.yaml",
+          "--token", "${DATAPILOT_TOKEN}",
+          "--instance-name", "${DATAPILOT_INSTANCE}"
+        ]
+```
+
+3. Install the hook:
+```bash
+pre-commit install
+```
+
+For detailed setup instructions, see the [Pre-commit Hook Setup Guide](docs/pre-commit-setup.md).
+
 ### Checks
 
 The following checks are available:
diff --git a/docs/hooks.rst b/docs/hooks.rst
@@ -11,37 +11,153 @@ To use the DataPilot pre-commit hook, follow these steps:
 
 1. Install the `pre-commit` package if you haven't already:
 
-```
-pip install pre-commit
-```
+.. code-block:: shell
+
+    pip install pre-commit
 
 2. Add the following configuration to your .pre-commit-config.yaml file in the root of your repository:
 
-```
+.. code-block:: yaml
+
     repos:
-  - repo: https://github.com/AltimateAI/datapilot-cli
-    rev: <revision>
-    hooks:
-      - id: datapilot_run_dbt_checks
-        args: ["--config-path", "path/to/your/config/file"]
-```
+      - repo: https://github.com/AltimateAI/datapilot-cli
+        rev: v0.0.23  # Use a specific version tag, not 'main'
+        hooks:
+          - id: datapilot_run_dbt_checks
+            args: [
+              "--config-path", "./datapilot-config.yaml",
+              "--token", "${DATAPILOT_TOKEN}",
+              "--instance-name", "${DATAPILOT_INSTANCE}"
+            ]
+
+Configuration Options
+---------------------
+
+The DataPilot pre-commit hook supports several configuration options:
+
+**Required Configuration:**
+
+- ``rev``: Always use a specific version tag (e.g., ``v0.0.23``) instead of ``main`` for production stability
+
+**Optional Arguments:**
+
+- ``--config-path``: Path to your DataPilot configuration file
+- ``--token``: Your API token for authentication (can use environment variables)
+- ``--instance-name``: Your tenant/instance name (can use environment variables)
+- ``--backend-url``: Backend URL (defaults to https://api.myaltimate.com)
+- ``--config-name``: Name of config to use from API
+- ``--base-path``: Base path of the dbt project (defaults to current directory)
+
+**Environment Variables:**
+
+You can use environment variables for sensitive information:
+
+.. code-block:: yaml
+
+    repos:
+      - repo: https://github.com/AltimateAI/datapilot-cli
+        rev: v0.0.23
+        hooks:
+          - id: datapilot_run_dbt_checks
+            args: [
+              "--config-path", "./datapilot-config.yaml",
+              "--token", "${DATAPILOT_TOKEN}",
+              "--instance-name", "${DATAPILOT_INSTANCE}"
+            ]
+
+**Configuration File Example:**
+
+Create a ``datapilot-config.yaml`` file in your project root:
+
+.. code-block:: yaml
+
+    # DataPilot Configuration
+    disabled_insights:
+      - "hard_coded_references"
+      - "duplicate_sources"
 
-Replace <revision> with the desired revision of the DataPilot repository and "path/to/your/config/file" with the path to your configuration file.
+    # Custom settings for your project
+    project_settings:
+      max_fanout: 10
+      require_tests: true
 
 3. Install the pre-commit hook:
 
-```
-pre-commit install
-```
+.. code-block:: shell
+
+    pre-commit install
 
 Usage
 -----
 
-Once the hook is installed, it will run automatically before each commit. If any issues are detected, the commit will be aborted, and you will be prompted to fix the issues before retrying the commit.
+Once the hook is installed, it will run automatically before each commit. The hook will:
+
+1. **Validate Configuration**: Check that your config file exists and is valid
+2. **Authenticate**: Use your provided token and instance name to authenticate
+3. **Analyze Changes**: Only analyze files that have changed in the commit
+4. **Report Issues**: Display any issues found and prevent the commit if problems are detected
+
+**Manual Execution:**
+
+To manually run all pre-commit hooks on a repository:
+
+.. code-block:: shell
+
+    pre-commit run --all-files
+
+To run individual hooks:
+
+.. code-block:: shell
 
+    pre-commit run datapilot_run_dbt_checks
 
-If you want to manually run all pre-commit hooks on a repository, run `pre-commit run --all-files`. To run individual hooks use `pre-commit run <hook_id>`.
+**Troubleshooting:**
 
+- **Authentication Issues**: Ensure your token and instance name are correctly set
+- **Empty Config Files**: The hook will fail if your config file is empty or invalid
+- **No Changes**: If no relevant files have changed, the hook will skip execution
+- **Network Issues**: Ensure you have access to the DataPilot API
+
+Best Practices
+-------------
+
+1. **Use Version Tags**: Always specify a version tag in the ``rev`` field, never use ``main``
+2. **Environment Variables**: Use environment variables for sensitive information like tokens
+3. **Configuration Files**: Create a dedicated config file for your project settings
+4. **Regular Updates**: Update to new versions when they become available
+5. **Team Coordination**: Ensure all team members use the same configuration
+
+Example Complete Setup
+---------------------
+
+Here's a complete example of a ``.pre-commit-config.yaml`` file:
+
+.. code-block:: yaml
+
+    # .pre-commit-config.yaml
+    exclude: '^(\.tox|ci/templates|\.bumpversion\.cfg)(/|$)'
+
+    repos:
+      - repo: https://github.com/astral-sh/ruff-pre-commit
+        rev: v0.1.14
+        hooks:
+          - id: ruff
+            args: [--fix, --exit-non-zero-on-fix, --show-fixes]
+
+      - repo: https://github.com/psf/black
+        rev: 23.12.1
+        hooks:
+          - id: black
+
+      - repo: https://github.com/AltimateAI/datapilot-cli
+        rev: v0.0.23
+        hooks:
+          - id: datapilot_run_dbt_checks
+            args: [
+              "--config-path", "./datapilot-config.yaml",
+              "--token", "${DATAPILOT_TOKEN}",
+              "--instance-name", "${DATAPILOT_INSTANCE}"
+            ]
 
 Feedback and Contributions
 --------------------------
diff --git a/src/datapilot/core/platforms/dbt/hooks/executor_hook.py b/src/datapilot/core/platforms/dbt/hooks/executor_hook.py
@@ -1,5 +1,7 @@
 import argparse
+import sys
 import time
+from pathlib import Path
 from typing import Optional
 from typing import Sequence
 
@@ -13,6 +15,23 @@
 from datapilot.utils.utils import generate_partial_manifest_catalog
 
 
+def validate_config_file(config_path: str) -> bool:
+    """Validate that the config file exists and is not empty."""
+    if not Path(config_path).exists():
+        print(f"Error: Config file '{config_path}' does not exist.", file=sys.stderr)
+        return False
+
+    try:
+        config = load_config(config_path)
+        if not config:
+            print(f"Error: Config file '{config_path}' is empty or invalid.", file=sys.stderr)
+            return False
+        return True
+    except Exception as e:
+        print(f"Error: Failed to load config file '{config_path}': {e}", file=sys.stderr)
+        return False
+
+
 def main(argv: Optional[Sequence[str]] = None):
     start_time = time.time()
     parser = argparse.ArgumentParser()
@@ -28,58 +47,103 @@ def main(argv: Optional[Sequence[str]] = None):
         help="Base path of the dbt project",
     )
 
+    parser.add_argument(
+        "--token",
+        help="Your API token for authentication.",
+    )
+
+    parser.add_argument(
+        "--instance-name",
+        help="Your tenant ID.",
+    )
+
+    parser.add_argument("--backend-url", help="Altimate's Backend URL", default="https://api.myaltimate.com")
+
+    parser.add_argument(
+        "--config-name",
+        help="Name of the DBT config to use from the API",
+    )
+
     args = parser.parse_known_args(argv)
-    # print(f"args: {args}", file=sys.__stdout__)
+
+    # Validate config file if provided
     config = {}
     if hasattr(args[0], "config_path") and args[0].config_path:
-        # print(f"Using config file: {args[0].config_path[0]}")
-        config = load_config(args[0].config_path[0])
+        config_path = args[0].config_path[0]
+        if not validate_config_file(config_path):
+            print("Pre-commit hook failed: Invalid config file.", file=sys.stderr)
+            sys.exit(1)
+        config = load_config(config_path)
 
     base_path = "./"
     if hasattr(args[0], "base_path") and args[0].base_path:
         base_path = args[0].base_path[0]
 
+    # Get authentication parameters
+    token = getattr(args[0], "token", None)
+    instance_name = getattr(args[0], "instance_name", None)
+    backend_url = getattr(args[0], "backend_url", "https://api.myaltimate.com")
+
+    # Validate authentication parameters
+    if not token:
+        print("Warning: No API token provided. Using default configuration.", file=sys.stderr)
+        print("To specify a token, use: --token 'your-token'", file=sys.stderr)
+
+    if not instance_name:
+        print("Warning: No instance name provided. Using default configuration.", file=sys.stderr)
+        print("To specify an instance, use: --instance-name 'your-instance'", file=sys.stderr)
+
     changed_files = args[1]
-    # print(f"Changed files: {changed_files}")
 
     if not changed_files:
-        # print("No changed files detected - test. Exiting...")
+        print("No changed files detected. Skipping datapilot checks.", file=sys.stderr)
         return
 
-    # print(f"Changed files: {changed_files}", file=sys.__stdout__)
-    selected_models, manifest, catalog = generate_partial_manifest_catalog(changed_files, base_path=base_path)
-    # print("se1ected models", selected_models, file=sys.__stdout__)
-    insight_generator = DBTInsightGenerator(
-        manifest=manifest,
-        catalog=catalog,
-        config=config,
-        selected_model_ids=selected_models,
-    )
-    reports = insight_generator.run()
-    if reports:
-        model_report = generate_model_insights_table(reports[MODEL])
-        if len(model_report) > 0:
-            print("--" * 50)
-            print("Model Insights")
-            print("--" * 50)
-        for model_id, report in model_report.items():
-            print(f"Model: {model_id}")
-            print(f"File path: {report['path']}")
-            print(tabulate_data(report["table"], headers="keys"))
-            print("\n")
-
-        project_report = generate_project_insights_table(reports[PROJECT])
-        if len(project_report) > 0:
-            print("--" * 50)
-            print("Project Insights")
-            print("--" * 50)
-            print(tabulate_data(project_report, headers="keys"))
-
-        exit(1)
+    try:
+        selected_models, manifest, catalog = generate_partial_manifest_catalog(changed_files, base_path=base_path)
+
+        insight_generator = DBTInsightGenerator(
+            manifest=manifest,
+            catalog=catalog,
+            config=config,
+            selected_model_ids=selected_models,
+            token=token,
+            instance_name=instance_name,
+            backend_url=backend_url,
+        )
+
+        reports = insight_generator.run()
+
+        if reports:
+            model_report = generate_model_insights_table(reports[MODEL])
+            if len(model_report) > 0:
+                print("--" * 50, file=sys.stderr)
+                print("Model Insights", file=sys.stderr)
+                print("--" * 50, file=sys.stderr)
+            for model_id, report in model_report.items():
+                print(f"Model: {model_id}", file=sys.stderr)
+                print(f"File path: {report['path']}", file=sys.stderr)
+                print(tabulate_data(report["table"], headers="keys"), file=sys.stderr)
+                print("\n", file=sys.stderr)
+
+            project_report = generate_project_insights_table(reports[PROJECT])
+            if len(project_report) > 0:
+                print("--" * 50, file=sys.stderr)
+                print("Project Insights", file=sys.stderr)
+                print("--" * 50, file=sys.stderr)
+                print(tabulate_data(project_report, headers="keys"), file=sys.stderr)
+
+            print("\nPre-commit hook failed: DataPilot found issues that need to be addressed.", file=sys.stderr)
+            sys.exit(1)
+
+    except Exception as e:
+        print(f"Error running DataPilot checks: {e}", file=sys.stderr)
+        print("Pre-commit hook failed due to an error.", file=sys.stderr)
+        sys.exit(1)
 
     end_time = time.time()
     total_time = end_time - start_time
-    print(f"Total time taken: {round(total_time, 2)} seconds")
+    print(f"DataPilot checks completed successfully in {round(total_time, 2)} seconds", file=sys.stderr)
 
 
 if __name__ == "__main__":