Skip to content

Ad-Rian815/Unza_journal_key_word_classification

Repository files navigation

1. Business Understanding

1.1 Problem Statement Currently, keywords associated with research articles in UNZA journals are not automatically organized or classified. This makes it difficult for researchers, students, and librarians to quickly identify relevant articles or track research trends. A manual process is time-consuming, inconsistent, and limits the usefulness of the institutional repository. Navigation of journals will be a piece of cake!

1.2 Business Objectives The goal of this project is to build a system that can automatically classify keywords from articles into meaningful categories (e.g., Agriculture, Medicine, Computer Science).

From a real-world perspective, success means:

  • Improving searchability and retrieval of research articles.
  • Helping researchers discover related works faster.
  • Supporting administrators in analyzing research output trends at UNZA.

1.3 Data Mining Goal The technical approach to achieving these objectives is structured into the following data mining goals:

1.Classification Model Development: A classification model will be built to categorize article keywords into predefined classes.

2.Text Preprocessing: The raw text data will be prepared for machine learning using standard preprocessing techniques, including tokenization, stop-word removal, and TF-IDF (Term Frequency-Inverse Document Frequency) for vectorization.

3.Algorithm Experimentation: The performance of several classification algorithms will be evaluated to determine the most effective one. The algorithms to be tested include Naïve Bayes, Support Vector Machines (SVM), and Decision Trees.

To wrap things up, the main goal of this project is to build a machine learning model that can automatically classify article keywords. We've broken the work down into two key parts. First, we'll focus on data preparation by cleaning the text using tokenization and stop-word removal, and then we'll use TF-IDF to turn everything into numbers for the models to work with.

For the second part, we'll experiment with different algorithms like Naïve Bayes, SVM, and Decision Trees. By testing them with metrics like precision and F1-score, then we'll figure out which one is the most accurate. We're hoping that by following these steps, we can successfully create a model that not only classifies keywords effectively but also shows we've got a solid grasp of the data mining process.

Summary: The workflow involves two main stages. First, text preprocessing to clean and transform raw keywords into numerical representations. Second, experimentation with multiple algorithms to identify the most accurate and robust classifier.

1.3.1 Data Preparation

Before building the model, the keyword data will be:

  • Tokenized into smaller units.

  • Cleaned by removing stop words and irrelevant terms.

  • Converted into numerical representations using TF-IDF for input into machine learning models.

1.4 Project Success Criteria

  • The model should achieve at least 80% accuracy on the test dataset.
  • The classification results must be interpretable and consistent across different domains.
  • The classification outputs are clear and be easily explained to non-technical stakeholders.
  • The system should reduce the time required to organize keywords compared to manual methods.

2. Data understanding

We loaded the raw UNZA journals dataset into a Pandas DataFrame and performed an initial exploration:

  • Structural overview: .head(), .info(), .describe(), .shape
  • Quality checks: missing values, duplicates, unique counts
  • Distributions: histograms for numeric columnsand bar charts for categorical columns
  • Frequency analysis (top 5 keywords)

Initial Summary

The dataset contains 18 articles with 6 columns. There are no missing values in the dataset. Abstracts have an average length of about 275 words. The most frequent authors are:

  • Brian Chanda Chiluba: 2 article(s)
  • Esther Munalula Nkandu: 2 article(s)
  • Munalula Muyangwa Munalula: 2 article(s)
  • Kris Kapp: 2 article(s)
  • Kweleka Mwanza: 1 article(s) The most common keywords are:
  • covid-19: 6 occurrence(s)
  • disability: 3 occurrence(s)
  • adolescent reproductive health: 2 occurrence(s)
  • anthrax: 1 occurrence(s)
  • bacillus anthracis: 1 occurrence(s)

3. Data Preparation

the data preparation phase involved the following steps:

  1. data cleaning- removed duplicates, handled missing data and standardized text.
  2. feature engineering- created derived features such as text statistics, author counts and keyword counts.
  3. data transformation- applied TD-IDF vectorization to abstracts and normalized numerical features.
  4. final data preparation- selected relevant features, concatenated them and exported them to a ready to use dataset. ================================================== DATA PREPARATION SUMMARY ================================================== Original dataset shape: (6, 5) Prepared dataset shape: (6, 60) Number of new features created: 55

Data preparation steps completed:

  1. ✅ Data cleaning (handled missing values, text cleaning)
  2. ✅ Feature engineering (created 10+ new features)
  3. ✅ Data transformation (encoding, normalization, vectorization)
  4. ✅ Final dataset preparation
  5. ✅ Data saved for modeling phase

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5