A tool for sanitising HAR (HTTP Archive) files to safely prepare browser logs for sharing with third parties.
This tool sanitises HAR files by removing or redacting sensitive and unnecessary data whilst preserving the essential application-specific information needed for analysis. It's designed with security in mind, using a whitelist-based approach to minimise accidental data leaks.
- Application-Specific Log Filtering: Retains only necessary information for debugging and analysis
- Robust Sensitive Data Detection: Uses advanced regex patterns with validation to identify and redact sensitive information
- Consistent Anonymisation: Replaces sensitive values with hashed versions to maintain referential integrity
- Error Resilience: Continues processing even if individual entries have issues
Sanitises the following types of sensitive information:
-
Server Information:
- Internal/external server names and hostnames
-
Authentication & Authorisation Data:
- OAuth tokens, JWT tokens (with proper JWT structure validation)
- Authorisation headers
- API keys and secrets
- Session IDs and access tokens
-
Personal Identifiable Information (PII):
- Email addresses (RFC 5322 compliant detection)
- Phone numbers (E.164 and common regional formats)
- IP addresses (with proper range validation)
- GUIDs/UUIDs (RFC 4122 compliant)
- Credit card numbers (with Luhn algorithm validation)
-
Headers & Parameters:
- Custom proprietary headers
- Cookies and session data
- Sensitive query parameters
- Tracking parameters (utm_*, fbclid, etc.)
-
Content:
- Request payloads
- Response bodies
- Form data
- Base64-encoded content (with validation for actual encoded text)
- Whitelist-based approach for maximum security
- Consistent hashing to maintain data relationships
- Preservation of error message context whilst removing sensitive data
- Timestamps are preserved to maintain activity timing information
- Sanitisation metadata added to track processing
- Proper validation to reduce false positives
har-file-sanitiser/
├── src/ # Source code
│ └── har_sanitiser/ # Main package
│ ├── __init__.py # Package exports
│ ├── core.py # Core sanitisation functionality
│ ├── utils.py # Utility functions
│ └── cli.py # Command-line interface
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── fixtures/ # Test fixtures
├── examples/ # Example usage
│ ├── example_usage.py # Example script
│ ├── sample.har # Sample HAR file
│ └── custom_config.json # Example configuration
├── docs/ # Documentation
├── setup.py # Package setup script
├── pyproject.toml # Project metadata
└── requirements.txt # Dependencies
- Python 3.6 or higher (recommended due to type hints)
- Git (for cloning the repository)
You can check your Python version with:
python --version
This package can be installed either directly from the repository or by installing the requirements.
git clone https://github.com/grounzero/har-file-sanitiser.git
cd har-file-sanitiser
Using a virtual environment ensures that the dependencies for this project don't interfere with other Python projects on your system.
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
When you're done working with the package, you can deactivate the virtual environment:
deactivate
Choose one of the following installation methods:
- Option 1: Install as an editable package (recommended for development)
pip install -e .
- Option 2: Install requirements only
pip install -r requirements.txt
After installation, it's recommended to run the tests to ensure everything is working correctly:
# Install pytest if not already installed
pip install pytest
# Run the tests
python -m pytest
Basic usage:
# Activate the virtual environment (if not already activated)
source venv/bin/activate # On Windows, use: venv\Scripts\activate
# Run the sanitiser
har-sanitiser input.har sanitised.har
With additional options:
# Enable verbose logging
har-sanitiser --verbose input.har sanitised.har
# Use a custom configuration file
har-sanitiser --config my_config.json input.har sanitised.har
# Disable parallel processing (parallel processing is enabled by default)
har-sanitiser --no-parallel input.har sanitised.har
# Combine options
har-sanitiser --verbose --config my_config.json --no-parallel input.har sanitised.har
# Use default output filename (creates input_sanitised.har)
har-sanitiser --verbose --config my_config.json input.har
The sanitiser automatically detects large HAR files and uses streaming processing to handle them efficiently:
- Reduces memory usage by processing entries incrementally
- Provides progress reporting with a progress bar
- Automatically selects the best processing method based on file sise
- Uses parallel processing by default for faster sanitisation
# Disable parallel processing for large files
har-sanitiser --no-parallel input.har output.har
# Specify the number of processes to use for parallel processing
har-sanitiser --processes 4 input.har output.har
Example configuration file (JSON format):
{
"log_level": "debug",
"sensitive_headers": ["x-custom-header", "x-internal-id"],
"sensitive_params": ["account", "user_id"],
"preserve_timestamps": true,
"remove_tracking_params": true
}
from har_sanitiser import HARSanitiser, read_har_file, write_har_file
# Initialise the sanitiser
sanitiser = HARSanitiser()
# Load a HAR file using the utility function
har_data = read_har_file('input.har')
# Sanitise the HAR data
sanitised_har = sanitiser.sanitise(har_data)
# Save the sanitised HAR file using the utility function
write_har_file('sanitised.har', sanitised_har)
# For large files, use the streaming API (parallel processing is enabled by default)
duration = sanitiser.sanitise_har_streaming('input.har', 'sanitised.har')
print(f"Sanitisation completed in {duration:.2f} seconds")
print(f"Sensitive data found: {sum(sanitiser.metrics['sensitive_data_found'].values())} instances")
# To disable parallel processing
duration = sanitiser.sanitise_har_streaming('input.har', 'sanitised.har', use_parallel=False)
# Configure with custom options
config = {
"log_level": "info",
"preserve_timestamps": True,
"remove_tracking_params": True
}
sanitiser = HARSanitiser(config=config)
The configuration file supports the following options:
Option | Description | Default |
---|---|---|
log_level |
Logging level (debug, info, warning, error) | info |
sensitive_headers |
List of headers to redact | ['authorization', 'cookie', 'set-cookie', 'x-api-key', 'x-client-id', 'x-session-id', 'user-agent', 'referer', 'x-forwarded-for', 'etag', 'if-modified-since', 'last-modified'] |
sensitive_params |
List of query parameters to redact | ['token', 'key', 'password', 'secret', 'auth'] |
tracking_params |
List of tracking parameters to remove | ['utm_', 'fbclid', 'gclid', '_ga'] |
preserve_timestamps |
Whether to preserve original timestamps | true |
remove_tracking_params |
Whether to remove tracking parameters | true |
hash_sensitive_values |
Whether to hash sensitive values | true |
redact_base64_content |
Whether to redact base64 content | true |
parallel |
Whether to use parallel processing | true |
processes |
Number of processes to use for parallel processing | number of CPU cores |
excluded_domains |
List of domains to exclude from sanitisation | [] |
content_types_to_sanitise |
List of content types to sanitise | ['application/json', 'application/x-www-form-urlencoded', 'text/plain', 'text/html'] |
sanitisation_options |
Options for specific data types | {'ip_addresses': true, 'email_addresses': true, 'credit_cards': true, 'phone_numbers': true, 'guid_uuid': true, 'jwt_tokens': true} |
The project includes a comprehensive test suite to verify sanitisation functionality:
# Install pytest if not already installed
pip install pytest
# Run the tests
python -m pytest
The test suite covers:
- Unit tests for individual components
- Validation for data pattern recognition
- Credit card Luhn algorithm checks
- Integration tests for full sanitisation flow
Data Type | Treatment |
---|---|
Authentication Tokens | Redacted with hash |
Cookies | Completely redacted |
API Keys | Redacted with hash |
PII (email, phone) | Replaced with pattern markers |
Credit Cards | Validated with Luhn check before redaction, supports various formats (spaces, dashes) |
Base64 Content | Removed after validation |
Query Parameters | Sensitive ones redacted, tracking ones removed |
User-Agent | Redacted |
Referer | Redacted |
JSON Content | Recursively sanitised |
Problematic Entries | Logged and skipped |
Before (simplified excerpt):
{
"request": {
"headers": [
{ "name": "Authorization", "value": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." },
{ "name": "User-Agent", "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..." }
],
"url": "https://api.example.com/users?token=12345&[email protected]"
}
}
After sanitisation:
{
"request": {
"headers": [
{ "name": "Authorization", "value": "[REDACTED-a1b2c3d4e5f6]" },
{ "name": "User-Agent", "value": "[REDACTED-7890abcdef12]" }
],
"url": "https://api.example.com/users?token=[REDACTED-abcdef123456]&email=[REDACTED-email]"
}
}
- Enhanced Regex Patterns: Improved regex patterns for better detection and fewer false positives
- Comprehensive Credit Card Detection: Support for various credit card formats including those with spaces, dashes, and other separators
- Luhn Algorithm: Credit card numbers are validated with the Luhn check algorithm
- Error Handling: Graceful handling of problematic entries
- Directory Creation: Automatically creates output directories if they don't exist
- Configurable Logging: Adjustable log levels via configuration or command line flags
- Modern Timestamp Handling: Uses timezone-aware datetime objects for accurate metadata
- This tool is designed for preparing HAR files for sharing with third parties.
- Always review sanitised files before sharing to ensure all sensitive data has been removed.
- Consider implementing additional domain-specific sanitisation as needed.
-
Missing dependencies: If you encounter import errors, ensure you've installed the package correctly using one of the installation methods.
-
Invalid HAR file: If you get JSON parsing errors, verify that your HAR file is valid. You can use online JSON validators to check.
-
Configuration errors: If the sanitiser isn't behaving as expected, check your configuration file for syntax errors or invalid options.
Enable verbose logging to get more information about the sanitisation process:
python har_sanitiser.py --verbose input.har sanitised.har
MIT
Contributions are welcome! Please feel free to submit a Pull Request.