Keyword Detection

Overview

This project contains a script for identifying and classifying websites based on the presence of user-defined keywords. It can be used for data collection that allows users to configure a set of keywords to find websites containing relevant information.

It processes the content of web pages, classifies them into different confidence levels based on keyword density, and generates a prediction file.

As a demonstration, this project has been set up with a dataset to identify information about medical tumor boards, but the script can be adapted for other keyword detection tasks.

The script also includes a feature to extract more specific, secondary information related to the primary keywords, such as identifying keyword sub-types or extracting schedule-like patterns (days, frequencies, times).

Data Description

The project uses the following data structure, with the htmls directory representing a cached version of website content for testing:

train.csv: Contains identifiers for websites, along with their corresponding label.
test.csv: Contains identifiers for websites that require a prediction.
htmls/: A directory containing synthetically created HTML content.
keyword_type.csv: A configurable list of keywords for detection.

Labels

The labels in train.csv signify the confidence level of a website containing the desired information:

1 (No Evidence): The website does not contain the target keywords.
2 (Medium Confidence): Keywords are mentioned, but the site is not primarily dedicated to the topic.
3 (High Confidence): The site is largely dedicated to the topic defined by the keywords.

Additional Information Extraction

For websites classified with medium or high confidence, the script also attempts to extract:

Keyword Type: Identifies specific sub-types associated with the keywords as defined in the keyword_type.csv file.
Schedule Information: Uses regular expressions to find details like frequencies (e.g., weekly), days (e.g., Friday), and times (e.g., 8:00 AM).

Output

Running the script will:

Print the accuracy of the model on the training data.
Display a confusion matrix.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
htmls		htmls
LICENSE		LICENSE
README.md		README.md
keyword_detection.py		keyword_detection.py
keyword_type.csv		keyword_type.csv
output.csv		output.csv
requirements.txt		requirements.txt
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Keyword Detection

Overview

Data Description

Labels

Additional Information Extraction

Output

About

Uh oh!

Releases

Packages

Languages

License

filtercodes/keyword_detector

Folders and files

Latest commit

History

Repository files navigation

Keyword Detection

Overview

Data Description

Labels

Additional Information Extraction

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages