This Python script is a spider that crawls a given website, designed to extract URLs and subdomains from a given starting URL. It uses the requests library to fetch web pages, BeautifulSoup for HTML parsing, and regular expressions to extract relevant information.
-
Make sure you have Python installed.
-
Clone this repository:
git clone https://github.com/thib-web3/website_urls_finder.git -
Go in the root folder:
cd website_urls_finder -
Install the required packages using the following command:
pip install -r requirements.txt
Run the index.py script with the URL you want to start crawling from as a command-line argument. For example:
python index.py https://example.comThe script will initiate the crawling process and extract URLs and subdomains related to the provided starting URL. It will then filter and save the results as a JSON: /example.json.
- Extracts URLs and subdomains from a starting URL.
- Filters out irrelevant URLs and subdomains such as JavaScript links and anchor tags.
- Removes duplicate URLs.
- Save locally extracted and filtered URLs.
- The script utilizes the
requestslibrary for making HTTP requests: Requests - HTML parsing is performed using
BeautifulSoup: BeautifulSoup - Regular expressions are used for URL extraction and manipulation.
This project is licensed under the MIT License.
- [titi]