5- Data Mining / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository makes part of the Data Mining, course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Reposi tory

This repository addresses fundamental concepts and methodologies in Data Mining, with an emphasis on data cleaning, preparation, and the identification of anomalies and outliers. The material is grounded in a comprehensive reference document that integrates theoretical foundations with practical applications, including Python-based implementations for the treatment of heterogeneous and noisy datasets.

It constitutes a structured starting point for the systematic study and application of Data Mining techniques, particularly those related to data preprocessing, anomaly and outlier detection, and validation. The repository also provides contextualized examples and executable Python code to support empirical exploration and reproducibility.

Introduction
Dataset for Study
Pandas Functions for Data Exploration
Key Concepts
- Anomaly
- Outlier
- Anomaly Detection
- Fraud Detection
Tips for Efficient and Effective Analysis
Statistical and Practical Significance
Characteristics and Understanding of Data
Parsimony Principle in Model Selection
Error Checking and Validation
Learning Paradigms
Applications
Sentiment Analysis in Social Networks
Credit Card Fraud Detection
Non-Technical Losses in Electrical Energy
Energy Load Segmentation
Steel Process Modeling

Introduction

The exponential growth of data generation necessitates intelligent techniques such as Data Mining to extract valuable knowledge from raw data. This process involves cleaning, preparing, mining, and validating data to enable effective decision-making.

Dataset for Study

We use a publicly available, small, dirty dataset exemplifying missing values, duplicates, and inconsistencies to demonstrate concepts of data cleaning and anomaly detection.

Example dataset: Titanic dataset
This dataset contains missing values and requires preprocessing.

Pandas Functions for Data Exploration

dataframe.describe()
Displays statistical summary including count, mean, std, min, quartiles, and max.

dataframe.info()
Shows information such as number of non-null entries and data types for each column.

Example usage:

import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.describe())
print(df.info())

Key Concepts

Anomaly / Outlier

Anomalies or outliers are data points that deviate significantly from the majority and may indicate errors, rare events, or fraud.

Anomaly Detection

Techniques to identify such unusual data points, including statistical, proximity-based, and machine learning methods.

Fraud Detection

Identifying fraudulent transactions or activities that typically manifest as anomalies in data.

Tips for Efficient and Effective Analysis

Significance of Mining
- Statistical significance: Confidence in results, ensured by properly prepared datasets.
- Practical significance: Real-world applicability of insights.

Data Characteristics Influence Results
The properties of the dataset affect analysis outcomes significantly.

Know Your Data
Preliminary exploration and descriptive statistics help understand data distributions.

Parsimony Principle
Choose models that balance complexity and interpretability.

Error Verification & Model Performance
Check prediction errors, rule significance, and algorithm performance rigorously.

Validation of Results
Compare multiple methods; assess generalization capacity; combine techniques; involve domain experts to validate findings.

Formulas and Concepts

Interquartile Range (IQR) rule for outliers:

$\Huge IQR = Q_3 - Q_1$

\Huge IQR = Q_3 - Q_1

$\Huge \text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR$

\Huge \text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR

Z-Score for detecting outliers:

$\Huge Z = \frac{x - \mu}{\sigma}$

\Huge Z = \frac{x - \mu}{\sigma}

Where: $(x)$ is a data point, $(\mu)$ mean, and $(\sigma)$ standard deviation.

Learning Paradigms

Paradigm	Description	Example Algorithms
Supervised Learning	Training with labeled data; learns mapping from inputs to outputs	Decision Trees, Random Forest, SVM
Unsupervised Learning	Training with unlabeled data; discovers patterns or groups	K-Means Clustering, DBSCAN, PCA
Lazy Learning	Defers generalization until a query is made	K-Nearest Neighbors (KNN)

Example: Decision Tree

A model is trained by partitioning data based on attribute splits optimizing a criterion like information gain.

Example: K-Nearest Neighbors (KNN)

Classifies new data by looking at the 'k' closest known examples (lazy learning).

Applications

Extensive use of data mining techniques in:

Credit analysis and prediction
Fraud detection
Financial market prediction
Customer relationship management
Corporate bankruptcy prediction
Energy sector
Education, logistics, supply chain management
Environment, social networks, ecommerce

Sentiment Analysis in Social Networks

Classifying texts based on expressed sentiments (positive, negative, neutral) to measure public opinion, marketing effectiveness, and product feedback.

Non-Technical Losses in Electrical Energy

Technical losses: Intrinsic to electrical systems.
Commercial losses: Errors, unmeasured consumption, fraud.

Data mining supports identifying irregularities and optimizing inspections.

Energy Load Segmentation

Use clustering to segment typical daily electricity consumption patterns to improve demand prediction.

Steel Process Modeling

Data mining to predict chemical composition and optimize production processes in steel industry.

Credit Card Fraud Detection

Fraud categories:

Application Fraud: Using fake personal info to obtain cards.
Behavioral Fraud: Unauthorized use of genuine card user's data.

Fraud mitigation includes prevention (security measures) and detection (rapid identification of suspicious transactions).

Python Example: Titanic - Exploratory Data Analysis

This code guides through loading data, exploratory analysis, cleaning, outlier detection, normalization, modeling, and validation.

Tip

Access Code: Titanic - Exploratory Data Analysis

Python Example: Fraud Detection with Mini Data

Below is the structured fraud detection code, organized cell by cell. It includes explanations about the dataset, along with additional techniques such as SMOTE for class balancing, Random Forest hyperparameter tuning, and model accuracy testing.

The evaluation covers key performance metrics, including:

Accuracy
Precision
Recall
F1-Score
ROC-AUC

Fraud Detection with Random Forest & Logistic Regression

Tip

Access Code: Fraud Detection with Random Forest & Logistic Regression

Cell 1 - Data loading and Initial Understanding

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load a smaller dataset (e.g., Iris dataset for binary classification - e.g., Versicolor vs Virginica)
# Carregar um conjunto de dados menor (por exemplo, conjunto de dados Iris para classificação binária - por exemplo, Versicolor vs Virginica)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# For binary classification, let's use only two classes (e.g., 1 and 2)
# Para classificação binária, vamos usar apenas duas classes (por exemplo, 1 e 2)
df_binary = df[df['target'].isin([1, 2])]
df_binary['target'] = df_binary['target'].replace({1: 0, 2: 1}) # Rename classes to 0 and 1
# Renomear classes para 0 e 1

# 2. Display the first few rows of the loaded DataFrame.
# Exibir as primeiras linhas do DataFrame carregado.
print("First 5 rows of the dataset:")
# Primeiras 5 linhas do conjunto de dados:
display(df_binary.head())

# 3. Display concise information about the DataFrame.
# Exibir informações concisas sobre o DataFrame.
print("\nDataset Info:")
# Informações do conjunto de dados:
df_binary.info()

# 4. Calculate and display the distribution of the target variable.
# Calcular e exibir a distribuição da variável alvo.
print("\nClass Distribution:")
# Distribuição de Classes:
display(df_binary['target'].value_counts())

# 5. Set up matplotlib for dark mode plotting.
# Configurar matplotlib para plotagem em modo escuro.
plt.style.use('dark_background')

# Set text color to white for better visibility in dark mode
# Definir a cor do texto para branco para melhor visibilidade no modo escuro
plt.rcParams['text.color'] = 'white'
plt.rcParams['axes.labelcolor'] = 'white'
plt.rcParams['xtick.color'] = 'white'
plt.rcParams['ytick.color'] = 'white'
plt.rcParams['axes.edgecolor'] = 'white'
plt.rcParams['figure.facecolor'] = '#2b2b2b' # Dark background for figure
plt.rcParams['axes.facecolor'] = '#2b2b2b' # Dark background for axes

# 6. Define a turquoise color palette.
# Definir uma paleta de cores turquesa.
turquoise_palette = ['#40E0D0', '#48D1CC', '#00CED1', '#5F9EA0', '#008B8B']

Cell 2 - Exploratory Data Analysis (EDA)

This code block carries out the initial steps of a data analysis workflow. In essence, it prepares the dataset for further exploration and offers a first look at its main characteristics, laying the groundwork for more detailed analysis or modeling.

import os

# Define the directory for saving plots
plot_dir = '/content/plots'
if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

# 1. Create histograms for each feature in df_binary with dual-language titles and labels.
# 1. Criar histogramas para cada característica em df_binary com títulos e rótulos em dois idiomas.
print("Feature Distributions (Histograms):")
# Distribuições das Características (Histogramas):
# Use only the first color from the palette for histograms
df_binary.hist(figsize=(12, 10), color=turquoise_palette[0], bins=15)
plt.suptitle('Feature Distributions / Distribuições das Características', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_histograms.png') # Save histogram plot

# 2. Generate box plots for each feature, comparing distributions across target classes with dual-language titles and labels.
# 2. Gerar box plots para cada característica, comparando as distribuições entre as classes alvo com títulos e rótulos em dois idiomas.
print("\nFeature Distributions by Target Class (Box Plots):")
# Distribuições das Características por Classe Alvo (Box Plots):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for i, col in enumerate(df_binary.columns[:-1]):
    # Removed palette argument from boxplot as it's not used with hue and causes a warning
    sns.boxplot(x='target', y=col, data=df_binary, ax=axes[i])
    axes[i].set_title(f'{col} Distribution by Target Class / Distribuição de {col} por Classe Alvo')
    axes[i].set_xlabel('Target Class / Classe Alvo')
    axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_box_plots.png') # Save box plot

# 3. Create a pair plot of the features in df_binary, colored by the 'target' variable, with a dual-language title.
# 3. Criar um pair plot das características em df_binary, colorido pela variável 'target', com um título em dois idiomas.
print("\nPair Plot of Features by Target Class:")
# Pair Plot das Características por Classe Alvo:
# Use only the first two colors from the palette for the two classes
sns.pairplot(df_binary, hue='target', palette=turquoise_palette[:2], diag_kind='kde')
plt.suptitle('Pair Plot of Features by Target Class / Pair Plot das Características por Classe Alvo', y=1.02, fontsize=16)
plt.show()
plt.savefig(f'{plot_dir}/feature_pair_plot.png') # Save pair plot

# 4. Calculate and display the correlation matrix for the features in df_binary and visualize it with a heatmap and dual-language titles and labels.
# 4. Calcular e exibir a matriz de correlação para as características em df_binary e visualizá-la com um heatmap e títulos e rótulos em dois idiomas.
print("\nCorrelation Matrix:")
# Matriz de Correlação:
correlation_matrix_binary = df_binary.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_binary, annot=True, cmap='viridis', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix / Matriz de Correlação', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/correlation_matrix_heatmap.png') # Save heatmap plot

1- Feature Distributions (Histograms):

2- sepal length (cm) Distributions

3-Pair Plot of Features by Target Class

<

4- Correlation Matrix

Cell 3 cData preparation

# 1. Check for missing values in the df_binary DataFrame and print the count for each column.
# 1. Verificar valores ausentes no DataFrame df_binary e imprimir a contagem para cada coluna.
print("Checking for missing values / Verificando valores ausentes:")
print(df_binary.isnull().sum())

# 2. If missing values are found, handle them appropriately for numerical data (e.g., imputation with the mean or median).
# Based on the previous df_binary.info() output, there are no missing values.
# Com base na saída anterior de df_binary.info(), não há valores ausentes.
# No action needed for missing values in this case.
# Nenhuma ação necessária para valores ausentes neste caso.

# 3. Separate the features (X) and the target variable (y) from the df_binary DataFrame.
# 3. Separar as características (X) e a variável alvo (y) do DataFrame df_binary.
X = df_binary.drop('target', axis=1)
y = df_binary['target']
print("\nFeatures (X) and Target (y) separated. / Características (X) e Alvo (y) separados.")

# 4. Scale the numerical features using StandardScaler.
# Fit the scaler only on the training data to prevent data leakage.
# 4. Escalar as características numéricas usando StandardScaler.
# Ajustar o scaler apenas nos dados de treinamento para evitar vazamento de dados.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# 5. Split the data into training and testing sets.
# 5. Dividir os dados em conjuntos de treinamento e teste.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Using stratify for balanced classes
# Usando stratify para classes balanceadas

# Fit and transform the scaler on the training data
# Ajustar e transformar o scaler nos dados de treinamento
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the fitted scaler
# Transformar os dados de teste usando o scaler ajustado
X_test_scaled = scaler.transform(X_test)

print("\nData split into training and testing sets (80/20). / Dados divididos em conjuntos de treinamento e teste (80/20).")
print("Features scaled using StandardScaler. / Características escaladas usando StandardScaler.")
print(f"X_train shape: {X_train_scaled.shape}, X_test shape: {X_test_scaled.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

Cell 4 Handle class imbalance

# 1. Check the class distribution of the training set (y_train) to confirm if class imbalance exists.
# Print the value counts with a dual-language explanation.
print("Class distribution in the training set (y_train):")
# Distribuição de classes no conjunto de treinamento (y_train):
display(y_train.value_counts())

=======================================Still Surfing this Repo 🏄 ===========================================

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
Fraud_Detection_RandonForrest_Logistic_Regression__MiniData/Code		Fraud_Detection_RandonForrest_Logistic_Regression__MiniData/Code
Workbook_DataMining_IDataCleaning_Preparation_Anomalies_Outlier		Workbook_DataMining_IDataCleaning_Preparation_Anomalies_Outlier
titanic_exploratory_analysis		titanic_exploratory_analysis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Readme_titanic_exploratory_analysis.md		Readme_titanic_exploratory_analysis.md
readme_ExploratoryAnalysis_Cleaning_OutlierDetection_Normalization,_Modeling_Validation..md		readme_ExploratoryAnalysis_Cleaning_OutlierDetection_Normalization,_Modeling_Validation..md

Uh oh!

License

Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier

Folders and files

Latest commit

History

Repository files navigation

5- Data Mining / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

Table of Contents

Introduction

Dataset for Study

Pandas Functions for Data Exploration

Key Concepts

Anomaly / Outlier

Anomaly Detection

Fraud Detection

Tips for Efficient and Effective Analysis

Formulas and Concepts

Interquartile Range (IQR) rule for outliers:

Z-Score for detecting outliers:

Where: $(x)$ is a data point, $(\mu)$ mean, and $(\sigma)$ standard deviation.

Learning Paradigms

Example: Decision Tree

Example: K-Nearest Neighbors (KNN)

Applications

Sentiment Analysis in Social Networks

Non-Technical Losses in Electrical Energy

Energy Load Segmentation

Steel Process Modeling

Credit Card Fraud Detection

Python Example: Titanic - Exploratory Data Analysis

Python Example: Fraud Detection with Mini Data

Cell 1 - Data loading and Initial Understanding

Cell 2 - Exploratory Data Analysis (EDA)

1- Feature Distributions (Histograms):

2- sepal length (cm) Distributions

3-Pair Plot of Features by Target Class

4- Correlation Matrix

Cell 3 cData preparation

Cell 4 Handle class imbalance

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages