Skip to content

Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier

Repository files navigation


[🇧🇷 Português] [🇺🇸 English]



5- Data Mining / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)



Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.



Tip

This repository makes part of the Data Mining, course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Reposi tory



This repository addresses fundamental concepts and methodologies in Data Mining, with an emphasis on data cleaning, preparation, and the identification of anomalies and outliers. The material is grounded in a comprehensive reference document that integrates theoretical foundations with practical applications, including Python-based implementations for the treatment of heterogeneous and noisy datasets.

It constitutes a structured starting point for the systematic study and application of Data Mining techniques, particularly those related to data preprocessing, anomaly and outlier detection, and validation. The repository also provides contextualized examples and executable Python code to support empirical exploration and reproducibility.



Table of Contents



Introduction

The exponential growth of data generation necessitates intelligent techniques such as Data Mining to extract valuable knowledge from raw data. This process involves cleaning, preparing, mining, and validating data to enable effective decision-making.



Dataset for Study

We use a publicly available, small, dirty dataset exemplifying missing values, duplicates, and inconsistencies to demonstrate concepts of data cleaning and anomaly detection.

Example dataset: Titanic dataset
This dataset contains missing values and requires preprocessing.



Pandas Functions for Data Exploration


  • dataframe.describe()
    Displays statistical summary including count, mean, std, min, quartiles, and max.

  • dataframe.info()
    Shows information such as number of non-null entries and data types for each column.

Example usage:


import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.describe())
print(df.info())



Key Concepts

Anomaly / Outlier

Anomalies or outliers are data points that deviate significantly from the majority and may indicate errors, rare events, or fraud.


Anomaly Detection

Techniques to identify such unusual data points, including statistical, proximity-based, and machine learning methods.


Fraud Detection

Identifying fraudulent transactions or activities that typically manifest as anomalies in data.


Tips for Efficient and Effective Analysis

  • Significance of Mining
    • Statistical significance: Confidence in results, ensured by properly prepared datasets.
    • Practical significance: Real-world applicability of insights.

  • Data Characteristics Influence Results
    The properties of the dataset affect analysis outcomes significantly.

  • Know Your Data
    Preliminary exploration and descriptive statistics help understand data distributions.

  • Parsimony Principle
    Choose models that balance complexity and interpretability.

  • Error Verification & Model Performance
    Check prediction errors, rule significance, and algorithm performance rigorously.

  • Validation of Results
    Compare multiple methods; assess generalization capacity; combine techniques; involve domain experts to validate findings.



Formulas and Concepts

Interquartile Range (IQR) rule for outliers:


$\Huge IQR = Q_3 - Q_1$



\Huge IQR = Q_3 - Q_1



$\Huge \text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR$



\Huge \text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR



Z-Score for detecting outliers:



$\Huge Z = \frac{x - \mu}{\sigma}$



\Huge Z = \frac{x - \mu}{\sigma}

Where: $(x)$ is a data point, $(\mu)$ mean, and $(\sigma)$ standard deviation.



Learning Paradigms


Paradigm Description Example Algorithms
Supervised Learning Training with labeled data; learns mapping from inputs to outputs Decision Trees, Random Forest, SVM
Unsupervised Learning Training with unlabeled data; discovers patterns or groups K-Means Clustering, DBSCAN, PCA
Lazy Learning Defers generalization until a query is made K-Nearest Neighbors (KNN)



Example: Decision Tree

A model is trained by partitioning data based on attribute splits optimizing a criterion like information gain.


Example: K-Nearest Neighbors (KNN)

Classifies new data by looking at the 'k' closest known examples (lazy learning).



Applications

Extensive use of data mining techniques in:

  • Credit analysis and prediction
  • Fraud detection
  • Financial market prediction
  • Customer relationship management
  • Corporate bankruptcy prediction
  • Energy sector
  • Education, logistics, supply chain management
  • Environment, social networks, ecommerce



Sentiment Analysis in Social Networks

Classifying texts based on expressed sentiments (positive, negative, neutral) to measure public opinion, marketing effectiveness, and product feedback.



Non-Technical Losses in Electrical Energy

  • Technical losses: Intrinsic to electrical systems.
  • Commercial losses: Errors, unmeasured consumption, fraud.

Data mining supports identifying irregularities and optimizing inspections.



Energy Load Segmentation

Use clustering to segment typical daily electricity consumption patterns to improve demand prediction.



Steel Process Modeling

Data mining to predict chemical composition and optimize production processes in steel industry.



Credit Card Fraud Detection

Fraud categories:

  • Application Fraud: Using fake personal info to obtain cards.
  • Behavioral Fraud: Unauthorized use of genuine card user's data.

Fraud mitigation includes prevention (security measures) and detection (rapid identification of suspicious transactions).


This code guides through loading data, exploratory analysis, cleaning, outlier detection, normalization, modeling, and validation.


Tip

Access Code: Titanic - Exploratory Data Analysis






Below is the structured fraud detection code, organized cell by cell. It includes explanations about the dataset, along with additional techniques such as SMOTE for class balancing, Random Forest hyperparameter tuning, and model accuracy testing.

The evaluation covers key performance metrics, including:

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • ROC-AUC

Fraud Detection with Random Forest & Logistic Regression


Tip

Access Code: Fraud Detection with Random Forest & Logistic Regression



Cell 1 - Data loading and Initial Understanding



import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load a smaller dataset (e.g., Iris dataset for binary classification - e.g., Versicolor vs Virginica)
# Carregar um conjunto de dados menor (por exemplo, conjunto de dados Iris para classificação binária - por exemplo, Versicolor vs Virginica)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# For binary classification, let's use only two classes (e.g., 1 and 2)
# Para classificação binária, vamos usar apenas duas classes (por exemplo, 1 e 2)
df_binary = df[df['target'].isin([1, 2])]
df_binary['target'] = df_binary['target'].replace({1: 0, 2: 1}) # Rename classes to 0 and 1
# Renomear classes para 0 e 1

# 2. Display the first few rows of the loaded DataFrame.
# Exibir as primeiras linhas do DataFrame carregado.
print("First 5 rows of the dataset:")
# Primeiras 5 linhas do conjunto de dados:
display(df_binary.head())

# 3. Display concise information about the DataFrame.
# Exibir informações concisas sobre o DataFrame.
print("\nDataset Info:")
# Informações do conjunto de dados:
df_binary.info()

# 4. Calculate and display the distribution of the target variable.
# Calcular e exibir a distribuição da variável alvo.
print("\nClass Distribution:")
# Distribuição de Classes:
display(df_binary['target'].value_counts())

# 5. Set up matplotlib for dark mode plotting.
# Configurar matplotlib para plotagem em modo escuro.
plt.style.use('dark_background')

# Set text color to white for better visibility in dark mode
# Definir a cor do texto para branco para melhor visibilidade no modo escuro
plt.rcParams['text.color'] = 'white'
plt.rcParams['axes.labelcolor'] = 'white'
plt.rcParams['xtick.color'] = 'white'
plt.rcParams['ytick.color'] = 'white'
plt.rcParams['axes.edgecolor'] = 'white'
plt.rcParams['figure.facecolor'] = '#2b2b2b' # Dark background for figure
plt.rcParams['axes.facecolor'] = '#2b2b2b' # Dark background for axes

# 6. Define a turquoise color palette.
# Definir uma paleta de cores turquesa.
turquoise_palette = ['#40E0D0', '#48D1CC', '#00CED1', '#5F9EA0', '#008B8B']



Cell 2 - Exploratory Data Analysis (EDA)


This code block carries out the initial steps of a data analysis workflow. In essence, it prepares the dataset for further exploration and offers a first look at its main characteristics, laying the groundwork for more detailed analysis or modeling.



import os

# Define the directory for saving plots
plot_dir = '/content/plots'
if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

# 1. Create histograms for each feature in df_binary with dual-language titles and labels.
# 1. Criar histogramas para cada característica em df_binary com títulos e rótulos em dois idiomas.
print("Feature Distributions (Histograms):")
# Distribuições das Características (Histogramas):
# Use only the first color from the palette for histograms
df_binary.hist(figsize=(12, 10), color=turquoise_palette[0], bins=15)
plt.suptitle('Feature Distributions / Distribuições das Características', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_histograms.png') # Save histogram plot

# 2. Generate box plots for each feature, comparing distributions across target classes with dual-language titles and labels.
# 2. Gerar box plots para cada característica, comparando as distribuições entre as classes alvo com títulos e rótulos em dois idiomas.
print("\nFeature Distributions by Target Class (Box Plots):")
# Distribuições das Características por Classe Alvo (Box Plots):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for i, col in enumerate(df_binary.columns[:-1]):
    # Removed palette argument from boxplot as it's not used with hue and causes a warning
    sns.boxplot(x='target', y=col, data=df_binary, ax=axes[i])
    axes[i].set_title(f'{col} Distribution by Target Class / Distribuição de {col} por Classe Alvo')
    axes[i].set_xlabel('Target Class / Classe Alvo')
    axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_box_plots.png') # Save box plot

# 3. Create a pair plot of the features in df_binary, colored by the 'target' variable, with a dual-language title.
# 3. Criar um pair plot das características em df_binary, colorido pela variável 'target', com um título em dois idiomas.
print("\nPair Plot of Features by Target Class:")
# Pair Plot das Características por Classe Alvo:
# Use only the first two colors from the palette for the two classes
sns.pairplot(df_binary, hue='target', palette=turquoise_palette[:2], diag_kind='kde')
plt.suptitle('Pair Plot of Features by Target Class / Pair Plot das Características por Classe Alvo', y=1.02, fontsize=16)
plt.show()
plt.savefig(f'{plot_dir}/feature_pair_plot.png') # Save pair plot

# 4. Calculate and display the correlation matrix for the features in df_binary and visualize it with a heatmap and dual-language titles and labels.
# 4. Calcular e exibir a matriz de correlação para as características em df_binary e visualizá-la com um heatmap e títulos e rótulos em dois idiomas.
print("\nCorrelation Matrix:")
# Matriz de Correlação:
correlation_matrix_binary = df_binary.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_binary, annot=True, cmap='viridis', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix / Matriz de Correlação', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/correlation_matrix_heatmap.png') # Save heatmap plot



1- Feature Distributions (Histograms):


Image



2- sepal length (cm) Distributions


Image



3-Pair Plot of Features by Target Class


Image


<

4- Correlation Matrix


Image



Cell 3 cData preparation


# 1. Check for missing values in the df_binary DataFrame and print the count for each column.
# 1. Verificar valores ausentes no DataFrame df_binary e imprimir a contagem para cada coluna.
print("Checking for missing values / Verificando valores ausentes:")
print(df_binary.isnull().sum())

# 2. If missing values are found, handle them appropriately for numerical data (e.g., imputation with the mean or median).
# Based on the previous df_binary.info() output, there are no missing values.
# Com base na saída anterior de df_binary.info(), não há valores ausentes.
# No action needed for missing values in this case.
# Nenhuma ação necessária para valores ausentes neste caso.

# 3. Separate the features (X) and the target variable (y) from the df_binary DataFrame.
# 3. Separar as características (X) e a variável alvo (y) do DataFrame df_binary.
X = df_binary.drop('target', axis=1)
y = df_binary['target']
print("\nFeatures (X) and Target (y) separated. / Características (X) e Alvo (y) separados.")

# 4. Scale the numerical features using StandardScaler.
# Fit the scaler only on the training data to prevent data leakage.
# 4. Escalar as características numéricas usando StandardScaler.
# Ajustar o scaler apenas nos dados de treinamento para evitar vazamento de dados.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# 5. Split the data into training and testing sets.
# 5. Dividir os dados em conjuntos de treinamento e teste.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Using stratify for balanced classes
# Usando stratify para classes balanceadas

# Fit and transform the scaler on the training data
# Ajustar e transformar o scaler nos dados de treinamento
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the fitted scaler
# Transformar os dados de teste usando o scaler ajustado
X_test_scaled = scaler.transform(X_test)

print("\nData split into training and testing sets (80/20). / Dados divididos em conjuntos de treinamento e teste (80/20).")
print("Features scaled using StandardScaler. / Características escaladas usando StandardScaler.")
print(f"X_train shape: {X_train_scaled.shape}, X_test shape: {X_test_scaled.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")



Cell 4 Handle class imbalance


# 1. Check the class distribution of the training set (y_train) to confirm if class imbalance exists.
# Print the value counts with a dual-language explanation.
print("Class distribution in the training set (y_train):")
# Distribuição de classes no conjunto de treinamento (y_train):
display(y_train.value_counts())



=======================================Still Surfing this Repo 🏄 ===========================================























1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.







🛸๋ My Contacts Hub





────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

Sponsor this project