-
Notifications
You must be signed in to change notification settings - Fork 0
🛠️ Implementation Notes
The following technologies and tools were utilized throughout the development of the project:
- Operating System: Initially developed on Windows. Later stages involved running the project on Ubuntu 24.04.1 LTS via WSL, in accordance with the project requirements. The project is compatible with both OS.
- Programming Language: Python 3.12.3
- IDE: Development was primarily conducted in PyCharm Community Edition. Later, PyCharm Professional was used to enable seamless integration with WSL.
- Version Control: GitHub was used for version control. You are on the right place!
-
Dependencies Management: All project dependencies are listed in the
requirements.txtfile, located in the main code directory, following standard Python practices. - Packaging: The entire project was bundled into a single executable file using PyInstaller.
- The codebase follows the PEP 8 Python coding convention.
- Code style enforcement was supported by the IDE.
- Docstrings were added to every class and method. Minimal or no inline documentation is included; instead, emphasis was placed on writing clean and readable code.
The application involves multi-threaded execution. To ensure thread safety, the following shared data structures were managed as described:
-
urls_to_visit(Queue):
This is accessed and modified by all threads. The nativeQueue.Queueclass from Python’s standard library is used, which provides built-in thread safety. -
visited_urls(Set):
Used by all crawling threads. Access to this set is synchronized using a lock to ensure that only one thread can read or modify it at a time. -
broken_urls(List):
This list is used concurrently by the crawling threads (writers) and the main thread (reader). A lock is used to protect write operations.
Once the crawling phase is complete, the list becomes read-only and is accessed exclusively by the main thread to generate reports.
-
🚀Usage Instructions
Learn how to configure, run, and customize BLC. -
📐 High-Level Design
Explore the project's origin and architecture. -
🛠️ Implementation Notes
Tools, technologies, and key implementation decisions. -
🚫 Sites That Restrict Automated Crawling See what was done and achieved.
-
🚀 Thread Count Optimization
Discussion on thread number optimization. -
📄Sample Outputs See it in action!