Regex HTML Crawler

regex-html-crawler.py is a versatile Python script designed to crawl local directories or web URLs, search for a specified regular expression within target files, and output the results to an Excel spreadsheet. This tool is useful for web developers, security researchers, or anyone needing to audit content across numerous web files.

Features

Dual Crawling Modes: Scan either a local directory (and its subdirectories) or recursively crawl a website starting from a given URL.
Regex Search: Find specific text patterns using regular expressions.
Configurable Target Extensions: Specify which file extensions (e.g., .html, .php, .shtm) the crawler should process. A comprehensive list of default extensions is provided.
Excel Output: Generates an Excel .xls file with three organized worksheets:
- "Found": Lists files/URLs where the regex was found.
- "Not Found": Lists target files/URLs where the regex was not found.
- "Not Processed": Lists files/URLs that were not processed (e.g., non-target file types, network errors during URL crawls).
Command-Line Interface: Supports arguments for non-interactive use.
Interactive Prompts: If arguments are omitted, the script will prompt the user for necessary information.
Error Handling: Includes checks for invalid paths, regex errors, and network issues.

Installation

1. Install Python

If you don't have Python installed, follow these steps:

Windows:
1. Go to the official Python website: https://www.python.org/downloads/windows/
2. Download the latest stable version installer (e.g., "Windows installer (64-bit)").
3. Run the installer. Crucially, make sure to check the box that says "Add Python X.X to PATH" during installation. This will allow you to run Python from your command prompt.
4. Click "Install Now" and follow the prompts.
5. Verify the installation by opening Command Prompt (CMD) or PowerShell and typing:
  python --version
  
  You should see the Python version printed.
macOS:
1. Python is often pre-installed on macOS, but it might be an older version. It's recommended to install a newer version using Homebrew. If you don't have Homebrew, install it first:
  /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
2. Once Homebrew is installed, install Python:
  brew install python
3. Verify the installation by opening Terminal and typing:
  python3 --version
Linux (Ubuntu/Debian-based):
1. Python 3 is usually pre-installed. You can check:
  python3 --version
2. If you need to install it:
  sudo apt update
  sudo apt install python3 python3-pip

2. Install Required Python Libraries

This script uses several external libraries. You can install them using pip, Python's package installer.

Open your terminal or command prompt and run:

pip install requests beautifulsoup4 xlwt

Usage

1. Save the Script

Save the provided Python code as regex-html-crawler.py in a directory of your choice.

2. Run the Program

You can run the program in interactive mode or by providing command-line arguments.

Interactive Mode

If you run the script without any arguments, it will prompt you for the necessary information (directory/URL, regex, and extensions).

python regex-html-crawler.py

Using Command-Line Arguments

You can specify the crawling parameters directly when running the script.

Crawl a Local Directory:
python regex-html-crawler.py -d /path/to/your/html/folder -r "your_regex_pattern" -e "html,php"

(Replace /path/to/your/html/folder with the actual path to your directory.)
Crawl a URL:
python regex-html_crawler.py -u https://example.com -r "copyright" -e "html,shtm"
Full Example (Directory):
python regex-html_crawler.py --directory "C:\webpages" --regex "\\bJavaScript\\b" --extensions "html,shtm,inc"
Full Example (URL):
python regex-html_crawler.py --url "http://localhost:8000/my\_site" --regex "API_KEY"

Help Message

To view the usage instructions and available arguments, use the help flag:

python regex-html-crawler.py -h
# or
python regex-html-crawler.py --help
# or (on Windows)
python regex-html-crawler.py /?

Output

The script will generate an Excel file named website-scan-<date>-<time>.xls (e.g., website-scan-2023-10-27-14-30-05.xls) in the same directory where you run the script. This file will contain three worksheets:

Found: Lists the "Location Type" (Directory or URL) and "Path/URL" for all target files where the specified regex pattern was found.
Not Found: Lists the "Location Type" and "Path/URL" for all target files that were scanned but did not contain the regex pattern.
Not Processed: Lists the "Location Type" and "Path/URL" for any files/URLs that could not be processed (e.g., non-target file types, files not found, network errors).

Default Extensions

The default extensions targeted by the crawler are: .html, .htm, .shtm, .shtml, .php, .inc, .razor, .twig, .latte, .mustache, .tpl, .tpml, .dhtm, .dhtml, .phtm, .phtml, .jhtm, .jhtml, .mhtm, .mhtml, .rhtm, .rhtml, .zhtm, .zhtml.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
regex-html-crawler.py		regex-html-crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regex HTML Crawler

Features

Installation

1. Install Python

2. Install Required Python Libraries

Usage

1. Save the Script

2. Run the Program

Interactive Mode

Using Command-Line Arguments

Help Message

Output

Default Extensions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Regex HTML Crawler

Features

Installation

1. Install Python

2. Install Required Python Libraries

Usage

1. Save the Script

2. Run the Program

Interactive Mode

Using Command-Line Arguments

Help Message

Output

Default Extensions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages