4- Data Mining / Concepts and Exploratory Analysis

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.

Project Overview
AI Project Workflow
Installation and Requirements
Usage Examples
Knowledge Discovery in Databases (KDD)
Mathematical Concepts
Discrete vs Continuous Values
Primary Libraries and Tools
Supervised vs Unsupervised Learning
Clustering and Distance Calculations
Credit Analysis: Classification Example
Fruit Clustering Example
Anomaly Detection and Association Rules
Applications
Repository Structure and Documentation
Contributing and License
References

Project Overview

This project provides a comprehensive introduction to Data Mining and AI, based on the workbook from PUC-SP by Prof. Dr. Daniel Rodrigues da Silva. It covers:

Foundations of data mining and KDD
Relation between data, information, and knowledge concepts
Mathematical foundations (inflection points, maxima, minima)
Overview of supervised and unsupervised learning
Practical coding examples in Python and R
Real-world cases such as credit classification, fruit clustering, anomaly detection, and industrial applications

AI/ML Project Workflow

Machine learning project development follows an iterative, structured process:

+-----------------+      +---------------------+      +----------------------+
|   Dataset/Data  | ---> |   Preprocessing     | ---> |   Cluster Training   |
+-----------------+      +---------------------+      +----------------------+
                                |                          |
                                v                          v
                      +---------------------+      +----------------------+
                      |  Cleaned/Labeled    |      |   Parallel Training  |
                      |      Data           |      +----------------------+
                      +---------------------+                |
                                        |                    v
                                        v          +----------------------+
                                  +---------------------+ | Validation &   |
                                  |     Validation      | |   Tuning       |
                                  +---------------------+ +----------------+
                                        |
                                        v
                               +---------------------+
                               |   Trained Model     |
                               +---------------------+
                                        |
                                        v
                             +------------------------+
                             | Inference in Production|
                             +------------------------+
                                        |
                                        v
                             +------------------------+
                             | Feedback (New Data)    |
                             +------------------------+

Detailed AI Project Workflow Explanation

Data Collection (Dataset) Everything starts with collecting the dataset used for training. Example: 1 million images for a facial recognition model.

Preprocessing Clean, standardize, and organize data. Example: Resize images, remove noise, label properly. Simple machines or servers suffice.

Sending to the Cluster Data is sent to clusters (dozens or hundreds of GPUs/CPUs) for parallel processing. Example: Upload to AWS, Google Cloud, or private clusters.

Training on the Cluster Workload is split across multiple machines for faster training. Example: GPUs process parts of batches; results combined for final model.

Validation and Tuning Test on validation subset to check accuracy; tune hyperparameters until objectives met.

Inference in Production Deploy trained model on servers/clusters for real-time predictions. Example: User uploads photo; model recognizes face in seconds.

Feedback and Update Collect new data to retrain and improve the model continually. Example: User data expands dataset for next training cycle.

Tip

This Repository offers a complete, clear, and accessible guide for learners and practitioners working on data mining and AI projects.

Installation and Requirements

Python Setup

pip install pandas numpy scikit-learn seaborn

Include `requirements.txt` for ease:

pandas
numpy
scikit-learn
seaborn

R Setup

install.packages(c("tidyverse", "caret", "cluster"))

Usage Examples

Python Clustering Example

Cell 1: KMeans Clustering Algorithm Example with Two Clusters for Data Segmentation

# Example of clustering data using KMeans to segment into two clusters
import pandas as pd
from sklearn.cluster import KMeans

# Creating a DataFrame with two features
data = pd.DataFrame({'shape': [5, 4, 1, 2], 'color': [7, 8, 2, 1]})

# Initializing KMeans with 2 clusters and fitting to the data
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)

# Adding a 'cluster' column indicating the cluster assignment for each point
data['cluster'] = kmeans.labels_

# Printing the data with the cluster labels
print(data)

Cell 2: Classification Example

from sklearn.linear_model import LogisticRegression
import pandas as pd

df = pd.DataFrame({
    'income': [2300, 4000, 1200, 6000],
    'score': [600, 700, 550, 800],
    'approved': [0, 1, 0, 1]
})
X = df[['income', 'score']]
y = df['approved']
model = LogisticRegression().fit(X, y)
print(model.predict([[3000, 650]]))

Cell 3: Creating a Sample Dataset (Python code with comments)

# Import pandas for data manipulation
import pandas as pd

# Create a simple dataset with categorical and numerical data
data = pd.DataFrame({
    'age': [25, 30, 22, 40],         # Age in years
    'height': [186, 164, 175, 180],  # Height in cm
    'gender': ['male', 'female', 'male', 'female']  # Gender category
})

# Display the dataset
print("Sample Dataset:")
print(data)

Cell 4: K-Means Clustering Example (Python code with comments)

from sklearn.cluster import KMeans

# Sample data with two features: shape and color
fruit_data = pd.DataFrame({
    'shape': [5, 4, 1, 2],
    'color': [7, 8, 2, 1],
})

# Create K-Means model with 2 clusters and fixed random state
kmeans = KMeans(n_clusters=2, random_state=42).fit(fruit_data)

# Assign cluster labels to data points
fruit_data['cluster'] = kmeans.labels_

# Show clustered data
print("Clustered Data:")
print(fruit_data)

Cell 5: Logistic Regression for Credit Approval (Python code with comments)

from sklearn.linear_model import LogisticRegression

# Sample credit data: income, score, and approval status
credit_data = pd.DataFrame({
    'income': [2300, 4000, 1200, 6000],
    'score': [600, 700, 550, 800],
    'approved': [0, 1, 0, 1]  # 0 - Denied, 1 - Approved
})

# Features and target variable
X = credit_data[['income', 'score']]
y = credit_data['approved']

# Build logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X, y)

# Predict credit approval for a new client
new_client = [[3000, 650]]
prediction = model.predict(new_client)

# Print prediction result
print(f"Prediction for new client {new_client}: {'Approved' if prediction[^0] == 1 else 'Denied'}")

Cell 6: Visualizing Cluster Centers (Python code with comments)

import matplotlib.pyplot as plt

# Get cluster centers
centers = kmeans.cluster_centers_

# Plot data points colored by cluster
plt.scatter(fruit_data['shape'], fruit_data['color'], c=fruit_data['cluster'], cmap='viridis')

# Plot cluster centers as red 'X'
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

# Set labels and title
plt.xlabel("Shape")
plt.ylabel("Color")
plt.title("K-Means Clustering with Centroids")

# Show plot
plt.show()

R Clustering Example

fruit <- data.frame(shape=c(5,4,1,2), color=c(7,8,2,1))
km <- kmeans(fruit, centers=2)
print(km$cluster)

R Classification Example

data <- data.frame(income=c(2300,4000,1200,6000), score=c(600,700,550,800), approved=c(0,1,0,1))
model <- glm(approved ~ income + score, data, family=binomial)
predict(model, newdata=data.frame(income=3000, score=650), type="response")

Knowledge Discovery in Databases (KDD)]()

KDD spans data selection, preprocessing, mining, and validation steps ensuring extracted knowledge is meaningful and valuable.

Mathematical Concepts

Concept	Description	Formula
Inflection	Point where curvature changes sign	$f''(x_0) = 0$, concavity change
Maximum	Local peak where $f'(x_0)=0$, $f''(x_0)<0$
Minimum	Local trough where $f'(x_0)=0$, $f''(x_0)>0$

Discrete vs Continuous Values

Type	Example
Discrete	Loan approved: Yes/No
Continuous	Loan amount: $1000 to $10,000+

Mathematical Concepts

Concept	Description	Formula
Inflection	Point where curvature changes sign	$f''(x_0) = 0$, concavity change
Maximum	Local peak where $f'(x_0)=0$, $f''(x_0)<0$
Minimum	Local trough where $f'(x_0)=0$, $f''(x_0)>0$

Discrete vs Continuous Values

[Type()	Example
Discrete	Loan approved: Yes/No
Continuous	Loan amount: $1000 to $10,000+

Primary Libraries and Tools

Name	Use	Language
pandas	Data manipulation	Python
NumPy	Numerical computations	Python
seaborn	Data visualization	Python
scikit-learn	Machine learning algorithms	Python
tidyverse	Data wrangling and plotting	R
caret	Machine learning in R	R

Supervised vs Unsupervised Learning

Type	Known Labels	Purpose	Algorithms
Supervised	Yes	Predict labels/values	Logistic Regression, SVM, Decision Trees
Unsupervised	No	Discover patterns or clusters	K-Means, Hierarchical Clustering, DBSCAN

Clustering and Distance Calculations

Clusters group similar data by minimizing intra-cluster distances. Euclidean distance calculated as:

$$ \Huge d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2} $$

\Huge
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}

Important

Objects assigned to Clusters with nearest Centroid.

Credit Analysis: Classification Example

Predict if credit will be approved based on income and score data.

Fruit Clustering Example

Attributes like shape and color are used to cluster fruits into categories with K-means.

Anomaly Detection and Association Rules

- Detect rare, irregular events (fraud, anomalies) statistically or by distance metrics.

- Association rules discover frequently co-occurring attributes (e.g., smartphone buyers often subscribe to data plans).

Applications

- Finance: fraud detection, credit scoring

- Energy: load forecasting, loss reduction

- Agriculture: crop yield prediction

- Web: sentiment analysis, customer segmentation

Contributing

-]() Contributions welcome via pull requests

-]() See CONTRIBUTING.md for details

Bibliography

1. Castro, L. No. And Ferrari, D. G. (2016). Introduction to data mining: basic concepts, algorithms and applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence - A Machine Learning Approach. 2nd Ed. LTC

3. Larson and Farber (2015). Applied Statistics. Pearson

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
Python_Code		Python_Code
R_Code		R_Code
class_4-DataMining_Concepts_ExploratoryAnalysis		class_4-DataMining_Concepts_ExploratoryAnalysis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Uh oh!

License

Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis

Folders and files

Latest commit

History

Repository files navigation

4- Data Mining / Concepts and Exploratory Analysis

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

Table of Contents

Include requirements.txt for ease:

Python Clustering Example

Cell 1: KMeans Clustering Algorithm Example with Two Clusters for Data Segmentation

Cell 2: Classification Example

Cell 3: Creating a Sample Dataset (Python code with comments)

Cell 4: K-Means Clustering Example (Python code with comments)

Cell 5: Logistic Regression for Credit Approval (Python code with comments)

Cell 6: Visualizing Cluster Centers (Python code with comments)

R Clustering Example

R Classification Example

Knowledge Discovery in Databases (KDD)]()

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages

Include `requirements.txt` for ease: