Skip to content

filtercodes/keyword_detector

Repository files navigation

Keyword Detection

Overview

This project contains a script for identifying and classifying websites based on the presence of user-defined keywords. It can be used for data collection that allows users to configure a set of keywords to find websites containing relevant information.

It processes the content of web pages, classifies them into different confidence levels based on keyword density, and generates a prediction file.

As a demonstration, this project has been set up with a dataset to identify information about medical tumor boards, but the script can be adapted for other keyword detection tasks.

The script also includes a feature to extract more specific, secondary information related to the primary keywords, such as identifying keyword sub-types or extracting schedule-like patterns (days, frequencies, times).

Data Description

The project uses the following data structure, with the htmls directory representing a cached version of website content for testing:

  • train.csv: Contains identifiers for websites, along with their corresponding label.
  • test.csv: Contains identifiers for websites that require a prediction.
  • htmls/: A directory containing synthetically created HTML content.
  • keyword_type.csv: A configurable list of keywords for detection.

Labels

The labels in train.csv signify the confidence level of a website containing the desired information:

  • 1 (No Evidence): The website does not contain the target keywords.
  • 2 (Medium Confidence): Keywords are mentioned, but the site is not primarily dedicated to the topic.
  • 3 (High Confidence): The site is largely dedicated to the topic defined by the keywords.

Additional Information Extraction

For websites classified with medium or high confidence, the script also attempts to extract:

  • Keyword Type: Identifies specific sub-types associated with the keywords as defined in the keyword_type.csv file.
  • Schedule Information: Uses regular expressions to find details like frequencies (e.g., weekly), days (e.g., Friday), and times (e.g., 8:00 AM).

Output

Running the script will:

  • Print the accuracy of the model on the training data.
  • Display a confusion matrix.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published