Skip to content

4- Data mining identifies patterns in large datasets, while exploratory analysis examines and summarizes data to understand its initial characteristics.

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

89 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡ΊπŸ‡Έ English]





Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.



Table of Contents


This project provides a comprehensive introduction to Data Mining and AI, based on the workbook from PUC-SP by Prof. Dr. Daniel Rodrigues da Silva. It covers:

  • Foundations of data mining and KDD
  • Relation between data, information, and knowledge concepts
  • Mathematical foundations (inflection points, maxima, minima)
  • Overview of supervised and unsupervised learning
  • Practical coding examples in Python and R
  • Real-world cases such as credit classification, fruit clustering, anomaly detection, and industrial applications




Machine learning project development follows an iterative, structured process:


+-----------------+      +---------------------+      +----------------------+
|   Dataset/Data  | ---> |   Preprocessing     | ---> |   Cluster Training   |
+-----------------+      +---------------------+      +----------------------+
                                |                          |
                                v                          v
                      +---------------------+      +----------------------+
                      |  Cleaned/Labeled    |      |   Parallel Training  |
                      |      Data           |      +----------------------+
                      +---------------------+                |
                                        |                    v
                                        v          +----------------------+
                                  +---------------------+ | Validation &   |
                                  |     Validation      | |   Tuning       |
                                  +---------------------+ +----------------+
                                        |
                                        v
                               +---------------------+
                               |   Trained Model     |
                               +---------------------+
                                        |
                                        v
                             +------------------------+
                             | Inference in Production|
                             +------------------------+
                                        |
                                        v
                             +------------------------+
                             | Feedback (New Data)    |
                             +------------------------+




  1. Data Collection (Dataset) Everything starts with collecting the dataset used for training. Example: 1 million images for a facial recognition model.

  1. Preprocessing Clean, standardize, and organize data. Example: Resize images, remove noise, label properly. Simple machines or servers suffice.

  1. Sending to the Cluster Data is sent to clusters (dozens or hundreds of GPUs/CPUs) for parallel processing. Example: Upload to AWS, Google Cloud, or private clusters.

  1. Training on the Cluster Workload is split across multiple machines for faster training. Example: GPUs process parts of batches; results combined for final model.

  1. Validation and Tuning Test on validation subset to check accuracy; tune hyperparameters until objectives met.

  1. Inference in Production Deploy trained model on servers/clusters for real-time predictions. Example: User uploads photo; model recognizes face in seconds.

  1. Feedback and Update Collect new data to retrain and improve the model continually. Example: User data expands dataset for next training cycle.



Tip

This Repository offers a complete, clear, and accessible guide for learners and practitioners working on data mining and AI projects.





pip install pandas numpy scikit-learn seaborn



Include requirements.txt for ease:


pandas
numpy
scikit-learn
seaborn




install.packages(c("tidyverse", "caret", "cluster"))



Python Clustering Example


Cell 1: KMeans Clustering Algorithm Example with Two Clusters for Data Segmentation


# Example of clustering data using KMeans to segment into two clusters
import pandas as pd
from sklearn.cluster import KMeans

# Creating a DataFrame with two features
data = pd.DataFrame({'shape': [5, 4, 1, 2], 'color': [7, 8, 2, 1]})

# Initializing KMeans with 2 clusters and fitting to the data
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)

# Adding a 'cluster' column indicating the cluster assignment for each point
data['cluster'] = kmeans.labels_

# Printing the data with the cluster labels
print(data)

Cell 2: Classification Example

from sklearn.linear_model import LogisticRegression
import pandas as pd

df = pd.DataFrame({
    'income': [2300, 4000, 1200, 6000],
    'score': [600, 700, 550, 800],
    'approved': [0, 1, 0, 1]
})
X = df[['income', 'score']]
y = df['approved']
model = LogisticRegression().fit(X, y)
print(model.predict([[3000, 650]]))

Cell 3: Creating a Sample Dataset (Python code with comments)


# Import pandas for data manipulation
import pandas as pd

# Create a simple dataset with categorical and numerical data
data = pd.DataFrame({
    'age': [25, 30, 22, 40],         # Age in years
    'height': [186, 164, 175, 180],  # Height in cm
    'gender': ['male', 'female', 'male', 'female']  # Gender category
})

# Display the dataset
print("Sample Dataset:")
print(data)

Cell 4: K-Means Clustering Example (Python code with comments)


from sklearn.cluster import KMeans

# Sample data with two features: shape and color
fruit_data = pd.DataFrame({
    'shape': [5, 4, 1, 2],
    'color': [7, 8, 2, 1],
})

# Create K-Means model with 2 clusters and fixed random state
kmeans = KMeans(n_clusters=2, random_state=42).fit(fruit_data)

# Assign cluster labels to data points
fruit_data['cluster'] = kmeans.labels_

# Show clustered data
print("Clustered Data:")
print(fruit_data)

Cell 5: Logistic Regression for Credit Approval (Python code with comments)


from sklearn.linear_model import LogisticRegression

# Sample credit data: income, score, and approval status
credit_data = pd.DataFrame({
    'income': [2300, 4000, 1200, 6000],
    'score': [600, 700, 550, 800],
    'approved': [0, 1, 0, 1]  # 0 - Denied, 1 - Approved
})

# Features and target variable
X = credit_data[['income', 'score']]
y = credit_data['approved']

# Build logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X, y)

# Predict credit approval for a new client
new_client = [[3000, 650]]
prediction = model.predict(new_client)

# Print prediction result
print(f"Prediction for new client {new_client}: {'Approved' if prediction[^0] == 1 else 'Denied'}")

Cell 6: Visualizing Cluster Centers (Python code with comments)


import matplotlib.pyplot as plt

# Get cluster centers
centers = kmeans.cluster_centers_

# Plot data points colored by cluster
plt.scatter(fruit_data['shape'], fruit_data['color'], c=fruit_data['cluster'], cmap='viridis')

# Plot cluster centers as red 'X'
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

# Set labels and title
plt.xlabel("Shape")
plt.ylabel("Color")
plt.title("K-Means Clustering with Centroids")

# Show plot
plt.show()



R Clustering Example

fruit <- data.frame(shape=c(5,4,1,2), color=c(7,8,2,1))
km <- kmeans(fruit, centers=2)
print(km$cluster)

R Classification Example

data <- data.frame(income=c(2300,4000,1200,6000), score=c(600,700,550,800), approved=c(0,1,0,1))
model <- glm(approved ~ income + score, data, family=binomial)
predict(model, newdata=data.frame(income=3000, score=650), type="response")



Knowledge Discovery in Databases (KDD)]()

KDD spans data selection, preprocessing, mining, and validation steps ensuring extracted knowledge is meaningful and valuable.



Concept Description Formula
Inflection Point where curvature changes sign $f''(x_0) = 0$, concavity change
Maximum Local peak where $f'(x_0)=0$, $f''(x_0)&lt;0$
Minimum Local trough where $f'(x_0)=0$, $f''(x_0)&gt;0$




Type Example
Discrete Loan approved: Yes/No
Continuous Loan amount: $1000 to $10,000+




Concept Description Formula
Inflection Point where curvature changes sign $f''(x_0) = 0$, concavity change
Maximum Local peak where $f'(x_0)=0$, $f''(x_0)&lt;0$
Minimum Local trough where $f'(x_0)=0$, $f''(x_0)&gt;0$




[Type() Example
Discrete Loan approved: Yes/No
Continuous Loan amount: $1000 to $10,000+




Name Use Language
pandas Data manipulation Python
NumPy Numerical computations Python
seaborn Data visualization Python
scikit-learn Machine learning algorithms Python
tidyverse Data wrangling and plotting R
caret Machine learning in R R




Type Known Labels Purpose Algorithms
Supervised Yes Predict labels/values Logistic Regression, SVM, Decision Trees
Unsupervised No Discover patterns or clusters K-Means, Hierarchical Clustering, DBSCAN




Clusters group similar data by minimizing intra-cluster distances. Euclidean distance calculated as:



$$ \Huge d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2} $$



\Huge
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}

Important

  • Objects assigned to Clusters with nearest Centroid.



Predict if credit will be approved based on income and score data.



Attributes like shape and color are used to cluster fruits into categories with K-means.



- Detect rare, irregular events (fraud, anomalies) statistically or by distance metrics.

- Association rules discover frequently co-occurring attributes (e.g., smartphone buyers often subscribe to data plans).



- Finance: fraud detection, credit scoring

- Energy: load forecasting, loss reduction

- Agriculture: crop yield prediction

- Web: sentiment analysis, customer segmentation



-]() Contributions welcome via pull requests

-]() See CONTRIBUTING.md for details



1. Castro, L. No. And Ferrari, D. G. (2016). Introduction to data mining: basic concepts, algorithms and applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence - A Machine Learning Approach. 2nd Ed. LTC

3. Larson and Farber (2015). Applied Statistics. Pearson







πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

4- Data mining identifies patterns in large datasets, while exploratory analysis examines and summarizes data to understand its initial characteristics.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published