[π§π· PortuguΓͺs] [πΊπΈ English]
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
Access Data Mining Main Repository
If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.
- Project Overview
- AI Project Workflow
- Installation and Requirements
- Usage Examples
- Knowledge Discovery in Databases (KDD)
- Mathematical Concepts
- Discrete vs Continuous Values
- Primary Libraries and Tools
- Supervised vs Unsupervised Learning
- Clustering and Distance Calculations
- Credit Analysis: Classification Example
- Fruit Clustering Example
- Anomaly Detection and Association Rules
- Applications
- Repository Structure and Documentation
- Contributing and License
- References
This project provides a comprehensive introduction to Data Mining and AI, based on the workbook from PUC-SP by Prof. Dr. Daniel Rodrigues da Silva. It covers:
- Foundations of data mining and KDD
- Relation between data, information, and knowledge concepts
- Mathematical foundations (inflection points, maxima, minima)
- Overview of supervised and unsupervised learning
- Practical coding examples in Python and R
- Real-world cases such as credit classification, fruit clustering, anomaly detection, and industrial applications
Machine learning project development follows an iterative, structured process:
+-----------------+ +---------------------+ +----------------------+
| Dataset/Data | ---> | Preprocessing | ---> | Cluster Training |
+-----------------+ +---------------------+ +----------------------+
| |
v v
+---------------------+ +----------------------+
| Cleaned/Labeled | | Parallel Training |
| Data | +----------------------+
+---------------------+ |
| v
v +----------------------+
+---------------------+ | Validation & |
| Validation | | Tuning |
+---------------------+ +----------------+
|
v
+---------------------+
| Trained Model |
+---------------------+
|
v
+------------------------+
| Inference in Production|
+------------------------+
|
v
+------------------------+
| Feedback (New Data) |
+------------------------+
- Data Collection (Dataset) Everything starts with collecting the dataset used for training. Example: 1 million images for a facial recognition model.
- Preprocessing Clean, standardize, and organize data. Example: Resize images, remove noise, label properly. Simple machines or servers suffice.
- Sending to the Cluster Data is sent to clusters (dozens or hundreds of GPUs/CPUs) for parallel processing. Example: Upload to AWS, Google Cloud, or private clusters.
- Training on the Cluster Workload is split across multiple machines for faster training. Example: GPUs process parts of batches; results combined for final model.
- Validation and Tuning Test on validation subset to check accuracy; tune hyperparameters until objectives met.
- Inference in Production Deploy trained model on servers/clusters for real-time predictions. Example: User uploads photo; model recognizes face in seconds.
- Feedback and Update Collect new data to retrain and improve the model continually. Example: User data expands dataset for next training cycle.
Tip
This Repository offers a complete, clear, and accessible guide for learners and practitioners working on data mining and AI projects.
pip install pandas numpy scikit-learn seaborn
pandas
numpy
scikit-learn
seaborn
install.packages(c("tidyverse", "caret", "cluster"))
Python Clustering Example
Cell 1: KMeans Clustering Algorithm Example with Two Clusters for Data Segmentation
# Example of clustering data using KMeans to segment into two clusters
import pandas as pd
from sklearn.cluster import KMeans
# Creating a DataFrame with two features
data = pd.DataFrame({'shape': [5, 4, 1, 2], 'color': [7, 8, 2, 1]})
# Initializing KMeans with 2 clusters and fitting to the data
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)
# Adding a 'cluster' column indicating the cluster assignment for each point
data['cluster'] = kmeans.labels_
# Printing the data with the cluster labels
print(data)
Cell 2: Classification Example
from sklearn.linear_model import LogisticRegression
import pandas as pd
df = pd.DataFrame({
'income': [2300, 4000, 1200, 6000],
'score': [600, 700, 550, 800],
'approved': [0, 1, 0, 1]
})
X = df[['income', 'score']]
y = df['approved']
model = LogisticRegression().fit(X, y)
print(model.predict([[3000, 650]]))
Cell 3: Creating a Sample Dataset (Python code with comments)
# Import pandas for data manipulation
import pandas as pd
# Create a simple dataset with categorical and numerical data
data = pd.DataFrame({
'age': [25, 30, 22, 40], # Age in years
'height': [186, 164, 175, 180], # Height in cm
'gender': ['male', 'female', 'male', 'female'] # Gender category
})
# Display the dataset
print("Sample Dataset:")
print(data)
Cell 4: K-Means Clustering Example (Python code with comments)
from sklearn.cluster import KMeans
# Sample data with two features: shape and color
fruit_data = pd.DataFrame({
'shape': [5, 4, 1, 2],
'color': [7, 8, 2, 1],
})
# Create K-Means model with 2 clusters and fixed random state
kmeans = KMeans(n_clusters=2, random_state=42).fit(fruit_data)
# Assign cluster labels to data points
fruit_data['cluster'] = kmeans.labels_
# Show clustered data
print("Clustered Data:")
print(fruit_data)
Cell 5: Logistic Regression for Credit Approval (Python code with comments)
from sklearn.linear_model import LogisticRegression
# Sample credit data: income, score, and approval status
credit_data = pd.DataFrame({
'income': [2300, 4000, 1200, 6000],
'score': [600, 700, 550, 800],
'approved': [0, 1, 0, 1] # 0 - Denied, 1 - Approved
})
# Features and target variable
X = credit_data[['income', 'score']]
y = credit_data['approved']
# Build logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X, y)
# Predict credit approval for a new client
new_client = [[3000, 650]]
prediction = model.predict(new_client)
# Print prediction result
print(f"Prediction for new client {new_client}: {'Approved' if prediction[^0] == 1 else 'Denied'}")
Cell 6: Visualizing Cluster Centers (Python code with comments)
import matplotlib.pyplot as plt
# Get cluster centers
centers = kmeans.cluster_centers_
# Plot data points colored by cluster
plt.scatter(fruit_data['shape'], fruit_data['color'], c=fruit_data['cluster'], cmap='viridis')
# Plot cluster centers as red 'X'
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
# Set labels and title
plt.xlabel("Shape")
plt.ylabel("Color")
plt.title("K-Means Clustering with Centroids")
# Show plot
plt.show()
R Clustering Example
fruit <- data.frame(shape=c(5,4,1,2), color=c(7,8,2,1))
km <- kmeans(fruit, centers=2)
print(km$cluster)
R Classification Example
data <- data.frame(income=c(2300,4000,1200,6000), score=c(600,700,550,800), approved=c(0,1,0,1))
model <- glm(approved ~ income + score, data, family=binomial)
predict(model, newdata=data.frame(income=3000, score=650), type="response")
KDD spans data selection, preprocessing, mining, and validation steps ensuring extracted knowledge is meaningful and valuable.
Concept | Description | Formula |
---|---|---|
Inflection | Point where curvature changes sign |
|
Maximum | Local peak where |
|
Minimum | Local trough where |
Type | Example |
---|---|
Discrete | Loan approved: Yes/No |
Continuous | Loan amount: $1000 to $10,000+ |
Concept | Description | Formula |
---|---|---|
Inflection | Point where curvature changes sign |
|
Maximum | Local peak where |
|
Minimum | Local trough where |
[Type() | Example |
---|---|
Discrete | Loan approved: Yes/No |
Continuous | Loan amount: $1000 to $10,000+ |
Name | Use | Language |
---|---|---|
pandas | Data manipulation | Python |
NumPy | Numerical computations | Python |
seaborn | Data visualization | Python |
scikit-learn | Machine learning algorithms | Python |
tidyverse | Data wrangling and plotting | R |
caret | Machine learning in R | R |
Type | Known Labels | Purpose | Algorithms |
---|---|---|---|
Supervised | Yes | Predict labels/values | Logistic Regression, SVM, Decision Trees |
Unsupervised | No | Discover patterns or clusters | K-Means, Hierarchical Clustering, DBSCAN |
Clusters group similar data by minimizing intra-cluster distances. Euclidean distance calculated as:
\Huge
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}
Important
- Objects assigned to Clusters with nearest Centroid.
Predict if credit will be approved based on income and score data.
Attributes like shape and color are used to cluster fruits into categories with K-means.
- Detect rare, irregular events (fraud, anomalies) statistically or by distance metrics.
- Association rules discover frequently co-occurring attributes (e.g., smartphone buyers often subscribe to data plans).
- Finance: fraud detection, credit scoring
- Energy: load forecasting, loss reduction
- Agriculture: crop yield prediction
- Web: sentiment analysis, customer segmentation
-]() Contributions welcome via pull requests
-]() See CONTRIBUTING.md
for details
1. Castro, L. No. And Ferrari, D. G. (2016). Introduction to data mining: basic concepts, algorithms and applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence - A Machine Learning Approach. 2nd Ed. LTC
3. Larson and Farber (2015). Applied Statistics. Pearson
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.