Skip to content

πŸ•·οΈ A spider that crawls a given website's url and extracts every other urls with the same domain name

Notifications You must be signed in to change notification settings

titi-devv/website_urls_finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Website URLs Finder πŸ•·οΈ

This Python script is a spider that crawls a given website, designed to extract URLs and subdomains from a given starting URL. It uses the requests library to fetch web pages, BeautifulSoup for HTML parsing, and regular expressions to extract relevant information.

Getting Started πŸš€

  1. Make sure you have Python installed.

  2. Clone this repository: git clone https://github.com/thib-web3/website_urls_finder.git

  3. Go in the root folder: cd website_urls_finder

  4. Install the required packages using the following command: pip install -r requirements.txt

Usage πŸ“‹

Run the index.py script with the URL you want to start crawling from as a command-line argument. For example:

python index.py https://example.com

The script will initiate the crawling process and extract URLs and subdomains related to the provided starting URL. It will then filter and save the results as a JSON: /example.json.

Features πŸ› οΈ

  • Extracts URLs and subdomains from a starting URL.
  • Filters out irrelevant URLs and subdomains such as JavaScript links and anchor tags.
  • Removes duplicate URLs.
  • Save locally extracted and filtered URLs.

Credits πŸ™Œ

  • The script utilizes the requests library for making HTTP requests: Requests
  • HTML parsing is performed using BeautifulSoup: BeautifulSoup
  • Regular expressions are used for URL extraction and manipulation.

License πŸ“„

This project is licensed under the MIT License.

Author πŸ‘€

  • [titi]

About

πŸ•·οΈ A spider that crawls a given website's url and extracts every other urls with the same domain name

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages