This project contains a script for identifying and classifying websites based on the presence of user-defined keywords. It can be used for data collection that allows users to configure a set of keywords to find websites containing relevant information.
It processes the content of web pages, classifies them into different confidence levels based on keyword density, and generates a prediction file.
As a demonstration, this project has been set up with a dataset to identify information about medical tumor boards, but the script can be adapted for other keyword detection tasks.
The script also includes a feature to extract more specific, secondary information related to the primary keywords, such as identifying keyword sub-types or extracting schedule-like patterns (days, frequencies, times).
The project uses the following data structure, with the htmls
directory representing a cached version of website content for testing:
train.csv
: Contains identifiers for websites, along with their correspondinglabel
.test.csv
: Contains identifiers for websites that require a prediction.htmls/
: A directory containing synthetically created HTML content.keyword_type.csv
: A configurable list of keywords for detection.
The labels in train.csv
signify the confidence level of a website containing the desired information:
- 1 (No Evidence): The website does not contain the target keywords.
- 2 (Medium Confidence): Keywords are mentioned, but the site is not primarily dedicated to the topic.
- 3 (High Confidence): The site is largely dedicated to the topic defined by the keywords.
For websites classified with medium or high confidence, the script also attempts to extract:
- Keyword Type: Identifies specific sub-types associated with the keywords as defined in the
keyword_type.csv
file. - Schedule Information: Uses regular expressions to find details like frequencies (e.g., weekly), days (e.g., Friday), and times (e.g., 8:00 AM).
Running the script will:
- Print the accuracy of the model on the training data.
- Display a confusion matrix.