5- Data Mining / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)
Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
📺 For better resolution, watch the video on YouTube.
Tip
This repository makes part of the Data Mining, course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
Access Data Mining Main Reposi tory
This repository addresses fundamental concepts and methodologies in Data Mining, with an emphasis on data cleaning, preparation, and the identification of anomalies and outliers. The material is grounded in a comprehensive reference document that integrates theoretical foundations with practical applications, including Python-based implementations for the treatment of heterogeneous and noisy datasets.
It constitutes a structured starting point for the systematic study and application of Data Mining techniques, particularly those related to data preprocessing, anomaly and outlier detection, and validation. The repository also provides contextualized examples and executable Python code to support empirical exploration and reproducibility.
- Introduction
- Dataset for Study
- Pandas Functions for Data Exploration
- Key Concepts
- Anomaly
- Outlier
- Anomaly Detection
- Fraud Detection
- Tips for Efficient and Effective Analysis
- Statistical and Practical Significance
- Characteristics and Understanding of Data
- Parsimony Principle in Model Selection
- Error Checking and Validation
- Learning Paradigms
- Applications
- Sentiment Analysis in Social Networks
- Credit Card Fraud Detection
- Non-Technical Losses in Electrical Energy
- Energy Load Segmentation
- Steel Process Modeling
The exponential growth of data generation necessitates intelligent techniques such as Data Mining to extract valuable knowledge from raw data. This process involves cleaning, preparing, mining, and validating data to enable effective decision-making.
We use a publicly available, small, dirty dataset exemplifying missing values, duplicates, and inconsistencies to demonstrate concepts of data cleaning and anomaly detection.
Example dataset: Titanic dataset
This dataset contains missing values and requires preprocessing.
dataframe.describe()
Displays statistical summary including count, mean, std, min, quartiles, and max.
dataframe.info()
Shows information such as number of non-null entries and data types for each column.
Example usage:
import pandas as pd
df = pd.read_csv('titanic.csv')
print(df.describe())
print(df.info())
Anomalies or outliers are data points that deviate significantly from the majority and may indicate errors, rare events, or fraud.
Techniques to identify such unusual data points, including statistical, proximity-based, and machine learning methods.
Identifying fraudulent transactions or activities that typically manifest as anomalies in data.
- Significance of Mining
- Statistical significance: Confidence in results, ensured by properly prepared datasets.
- Practical significance: Real-world applicability of insights.
- Data Characteristics Influence Results
The properties of the dataset affect analysis outcomes significantly.
- Know Your Data
Preliminary exploration and descriptive statistics help understand data distributions.
- Parsimony Principle
Choose models that balance complexity and interpretability.
- Error Verification & Model Performance
Check prediction errors, rule significance, and algorithm performance rigorously.
- Validation of Results
Compare multiple methods; assess generalization capacity; combine techniques; involve domain experts to validate findings.
\Huge IQR = Q_3 - Q_1
\Huge \text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR
\Huge Z = \frac{x - \mu}{\sigma}
Where: $(x)$ is a data point, $(\mu)$ mean, and $(\sigma)$ standard deviation.
Paradigm | Description | Example Algorithms |
---|---|---|
Supervised Learning | Training with labeled data; learns mapping from inputs to outputs | Decision Trees, Random Forest, SVM |
Unsupervised Learning | Training with unlabeled data; discovers patterns or groups | K-Means Clustering, DBSCAN, PCA |
Lazy Learning | Defers generalization until a query is made | K-Nearest Neighbors (KNN) |
A model is trained by partitioning data based on attribute splits optimizing a criterion like information gain.
Classifies new data by looking at the 'k' closest known examples (lazy learning).
Extensive use of data mining techniques in:
- Credit analysis and prediction
- Fraud detection
- Financial market prediction
- Customer relationship management
- Corporate bankruptcy prediction
- Energy sector
- Education, logistics, supply chain management
- Environment, social networks, ecommerce
Classifying texts based on expressed sentiments (positive, negative, neutral) to measure public opinion, marketing effectiveness, and product feedback.
- Technical losses: Intrinsic to electrical systems.
- Commercial losses: Errors, unmeasured consumption, fraud.
Data mining supports identifying irregularities and optimizing inspections.
Use clustering to segment typical daily electricity consumption patterns to improve demand prediction.
Data mining to predict chemical composition and optimize production processes in steel industry.
Fraud categories:
- Application Fraud: Using fake personal info to obtain cards.
- Behavioral Fraud: Unauthorized use of genuine card user's data.
Fraud mitigation includes prevention (security measures) and detection (rapid identification of suspicious transactions).
Python Example: Titanic - Exploratory Data Analysis
This code guides through loading data, exploratory analysis, cleaning, outlier detection, normalization, modeling, and validation.
Tip
Access Code: Titanic - Exploratory Data Analysis
Python Example: Fraud Detection with Mini Data
Below is the structured fraud detection code, organized cell by cell. It includes explanations about the dataset, along with additional techniques such as SMOTE for class balancing, Random Forest hyperparameter tuning, and model accuracy testing.
The evaluation covers key performance metrics, including:
- Accuracy
- Precision
- Recall
- F1-Score
- ROC-AUC
Fraud Detection with Random Forest & Logistic Regression
Tip
Access Code: Fraud Detection with Random Forest & Logistic Regression
Cell 1 - Data loading and Initial Understanding
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Load a smaller dataset (e.g., Iris dataset for binary classification - e.g., Versicolor vs Virginica)
# Carregar um conjunto de dados menor (por exemplo, conjunto de dados Iris para classificação binária - por exemplo, Versicolor vs Virginica)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
# For binary classification, let's use only two classes (e.g., 1 and 2)
# Para classificação binária, vamos usar apenas duas classes (por exemplo, 1 e 2)
df_binary = df[df['target'].isin([1, 2])]
df_binary['target'] = df_binary['target'].replace({1: 0, 2: 1}) # Rename classes to 0 and 1
# Renomear classes para 0 e 1
# 2. Display the first few rows of the loaded DataFrame.
# Exibir as primeiras linhas do DataFrame carregado.
print("First 5 rows of the dataset:")
# Primeiras 5 linhas do conjunto de dados:
display(df_binary.head())
# 3. Display concise information about the DataFrame.
# Exibir informações concisas sobre o DataFrame.
print("\nDataset Info:")
# Informações do conjunto de dados:
df_binary.info()
# 4. Calculate and display the distribution of the target variable.
# Calcular e exibir a distribuição da variável alvo.
print("\nClass Distribution:")
# Distribuição de Classes:
display(df_binary['target'].value_counts())
# 5. Set up matplotlib for dark mode plotting.
# Configurar matplotlib para plotagem em modo escuro.
plt.style.use('dark_background')
# Set text color to white for better visibility in dark mode
# Definir a cor do texto para branco para melhor visibilidade no modo escuro
plt.rcParams['text.color'] = 'white'
plt.rcParams['axes.labelcolor'] = 'white'
plt.rcParams['xtick.color'] = 'white'
plt.rcParams['ytick.color'] = 'white'
plt.rcParams['axes.edgecolor'] = 'white'
plt.rcParams['figure.facecolor'] = '#2b2b2b' # Dark background for figure
plt.rcParams['axes.facecolor'] = '#2b2b2b' # Dark background for axes
# 6. Define a turquoise color palette.
# Definir uma paleta de cores turquesa.
turquoise_palette = ['#40E0D0', '#48D1CC', '#00CED1', '#5F9EA0', '#008B8B']
Cell 2 - Exploratory Data Analysis (EDA)
This code block carries out the initial steps of a data analysis workflow. In essence, it prepares the dataset for further exploration and offers a first look at its main characteristics, laying the groundwork for more detailed analysis or modeling.
import os
# Define the directory for saving plots
plot_dir = '/content/plots'
if not os.path.exists(plot_dir):
os.makedirs(plot_dir)
# 1. Create histograms for each feature in df_binary with dual-language titles and labels.
# 1. Criar histogramas para cada característica em df_binary com títulos e rótulos em dois idiomas.
print("Feature Distributions (Histograms):")
# Distribuições das Características (Histogramas):
# Use only the first color from the palette for histograms
df_binary.hist(figsize=(12, 10), color=turquoise_palette[0], bins=15)
plt.suptitle('Feature Distributions / Distribuições das Características', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_histograms.png') # Save histogram plot
# 2. Generate box plots for each feature, comparing distributions across target classes with dual-language titles and labels.
# 2. Gerar box plots para cada característica, comparando as distribuições entre as classes alvo com títulos e rótulos em dois idiomas.
print("\nFeature Distributions by Target Class (Box Plots):")
# Distribuições das Características por Classe Alvo (Box Plots):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for i, col in enumerate(df_binary.columns[:-1]):
# Removed palette argument from boxplot as it's not used with hue and causes a warning
sns.boxplot(x='target', y=col, data=df_binary, ax=axes[i])
axes[i].set_title(f'{col} Distribution by Target Class / Distribuição de {col} por Classe Alvo')
axes[i].set_xlabel('Target Class / Classe Alvo')
axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/feature_box_plots.png') # Save box plot
# 3. Create a pair plot of the features in df_binary, colored by the 'target' variable, with a dual-language title.
# 3. Criar um pair plot das características em df_binary, colorido pela variável 'target', com um título em dois idiomas.
print("\nPair Plot of Features by Target Class:")
# Pair Plot das Características por Classe Alvo:
# Use only the first two colors from the palette for the two classes
sns.pairplot(df_binary, hue='target', palette=turquoise_palette[:2], diag_kind='kde')
plt.suptitle('Pair Plot of Features by Target Class / Pair Plot das Características por Classe Alvo', y=1.02, fontsize=16)
plt.show()
plt.savefig(f'{plot_dir}/feature_pair_plot.png') # Save pair plot
# 4. Calculate and display the correlation matrix for the features in df_binary and visualize it with a heatmap and dual-language titles and labels.
# 4. Calcular e exibir a matriz de correlação para as características em df_binary e visualizá-la com um heatmap e títulos e rótulos em dois idiomas.
print("\nCorrelation Matrix:")
# Matriz de Correlação:
correlation_matrix_binary = df_binary.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_binary, annot=True, cmap='viridis', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix / Matriz de Correlação', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
plt.savefig(f'{plot_dir}/correlation_matrix_heatmap.png') # Save heatmap plot



<

Cell 3 cData preparation
# 1. Check for missing values in the df_binary DataFrame and print the count for each column.
# 1. Verificar valores ausentes no DataFrame df_binary e imprimir a contagem para cada coluna.
print("Checking for missing values / Verificando valores ausentes:")
print(df_binary.isnull().sum())
# 2. If missing values are found, handle them appropriately for numerical data (e.g., imputation with the mean or median).
# Based on the previous df_binary.info() output, there are no missing values.
# Com base na saída anterior de df_binary.info(), não há valores ausentes.
# No action needed for missing values in this case.
# Nenhuma ação necessária para valores ausentes neste caso.
# 3. Separate the features (X) and the target variable (y) from the df_binary DataFrame.
# 3. Separar as características (X) e a variável alvo (y) do DataFrame df_binary.
X = df_binary.drop('target', axis=1)
y = df_binary['target']
print("\nFeatures (X) and Target (y) separated. / Características (X) e Alvo (y) separados.")
# 4. Scale the numerical features using StandardScaler.
# Fit the scaler only on the training data to prevent data leakage.
# 4. Escalar as características numéricas usando StandardScaler.
# Ajustar o scaler apenas nos dados de treinamento para evitar vazamento de dados.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# 5. Split the data into training and testing sets.
# 5. Dividir os dados em conjuntos de treinamento e teste.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Using stratify for balanced classes
# Usando stratify para classes balanceadas
# Fit and transform the scaler on the training data
# Ajustar e transformar o scaler nos dados de treinamento
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data using the fitted scaler
# Transformar os dados de teste usando o scaler ajustado
X_test_scaled = scaler.transform(X_test)
print("\nData split into training and testing sets (80/20). / Dados divididos em conjuntos de treinamento e teste (80/20).")
print("Features scaled using StandardScaler. / Características escaladas usando StandardScaler.")
print(f"X_train shape: {X_train_scaled.shape}, X_test shape: {X_test_scaled.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")
Cell 4 Handle class imbalance
# 1. Check the class distribution of the training set (y_train) to confirm if class imbalance exists.
# Print the value counts with a dual-language explanation.
print("Class distribution in the training set (y_train):")
# Distribuição de classes no conjunto de treinamento (y_train):
display(y_train.value_counts())
=======================================Still Surfing this Repo 🏄 ===========================================
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
🛸๋ My Contacts Hub
────────────── 🔭⋆ ──────────────
➣➢➤ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.