This project crawls the ETSU Computing department website, extracts content, and organizes it into a structured format that can be easily imported into LLMs for summarization or fact extraction. It's designed to help you prepare talking points for campus tours and keep track of the latest achievements, courses, student organizations, and other important information.
- Smart Content Extraction: Properly extracts content from complex web structures including accordion menus (
<details>
and<summary>
elements), tables, and lists - Path Restriction: Stays within the Computing department by only crawling URLs with the
/cbat/computing
prefix - Depth Control: Limits crawling to specified levels deep from the starting page
- Content Organization: Automatically categorizes extracted content by topic
- Key Facts Extraction: Identifies potential talking points from the extracted content
The project consists of three main Python scripts:
web_crawler.py
- Crawls the ETSU Computing website and extracts content into a raw markdown filecontent_extractor.py
- Processes the raw content and organizes it into categoriesrun.py
- Orchestrates the entire process with a simple command
- Python 3.7 or higher
- Required Python packages:
- requests
- beautifulsoup4
- argparse
Install the required packages:
pip install requests beautifulsoup4
Run the entire process with default settings:
python run.py
This will:
- Crawl the ETSU Computing website starting from https://www.etsu.edu/cbat/computing/
- Extract content into a raw markdown file
- Process and categorize the content
- Generate organized markdown files by category and a key facts file
Control crawling behavior:
python run.py --start-url https://www.etsu.edu/cbat/computing/ --max-pages 50 --delay 2.0 --max-depth 1 --path-prefix /cbat/computing
The --max-depth
parameter controls how many "hops" from the starting URL the crawler will follow:
--max-depth 1
(default): Only crawls links found on the start page (main navigation links)--max-depth 2
: Crawls the main navigation links and links found on those pages- Higher values will crawl deeper into the site structure
The --path-prefix
parameter (default: /cbat/computing
) restricts the crawler to only visit URLs that start with the given path. This prevents the crawler from navigating to other departments or colleges.
Skip crawling and process existing content:
python run.py --skip-crawl
Specify output directory:
python run.py --output-dir my_etsu_data
If you prefer to run each component separately:
-
Run the web crawler:
python web_crawler.py --start-url https://www.etsu.edu/cbat/computing/ --output etsu_content.md
-
Process the extracted content:
python content_extractor.py --input etsu_content.md --output-dir extracted_content
The script generates several output files in the specified directory:
etsu_computing_raw_[timestamp].md
- Raw content from all crawled pagesextracted_[timestamp]/
- Directory containing organized content:achievements.md
- Department achievements and awardscourses.md
- Information about coursesconcentrations.md
- Degree programs and concentrationsstudent_organizations.md
- Student clubs and organizationsfaculty.md
- Faculty informationresearch.md
- Research activitiesevents.md
- Events and activitiesfacilities.md
- Information about facilitiesgeneral_info.md
- General informationkey_facts.md
- Extracted key facts that can serve as talking pointsall_content.md
- All content organized by category
The generated markdown files can be imported into LLMs (like Claude) for summarization or to extract specific information. For example:
- Import
key_facts.md
and ask the LLM to prepare a 5-minute tour script based on the highlights - Import
student_organizations.md
and ask the LLM to summarize all active student clubs - Import
courses.md
and ask the LLM to identify new courses or program changes
To customize how content is categorized, edit the category_keywords
dictionary in content_extractor.py
. You can add new categories or modify the keywords for existing ones to better match the structure of your department's website.
The content extraction logic in web_crawler.py
can be customized to better target specific elements on the ETSU Computing website. Look for the _extract_content
method to adjust the HTML selectors.
- The crawler respects robots.txt by default, which may limit what content it can access
- Very dynamic content (loaded via JavaScript) might not be captured
- PDF, DOC, and other non-HTML content is not processed
- Images and media content are not included in the extraction
To keep your talking points up-to-date, run this tool periodically to capture new content and changes on the website