This project involves data exploration, cleaning, visualization, and statistical analysis on a dataset of customers from a flight loyalty program. The goal is to better understand customer behavior and characteristics, and to answer key business questions using Python tools.
To keep the code organized and reusable, a .py file called etl_funciones.py was created. It contains specific functions for the ETL process, such as:
- Data cleaning
- Null value handling
- Data type conversions
- Merging tables
- Saving data in different formats
These functions allowed efficient and consistent processing during the exploratory and analytical phases.
Two CSV files were merged into a single DataFrame called customer_info, which includes:
- Customer demographic data
- Flight history
- Earned and redeemed loyalty points
- Type of loyalty card and signup details
- Detected null values (e.g., cancellation date)
- Found duplicates (one record per customer per month)
- Used
info(),describe(), andvalue_counts()to evaluate data quality and structure
- Handled nulls based on relevance
- Grouped by Loyalty Number to remove duplicates for customer-level analysis
- Converted columns to efficient types like
categoryorInt64for optimization
Visualizations were created with matplotlib and seaborn to answer the following key questions:
- Grouped by booking month
- Bar plot to identify trends and seasonality
- Aggregated data per customer
- Scatter plot with regression line showing a positive correlation
- Unique customers grouped by province
- Count plot sorted by frequency
- Salaries grouped by education
- Bar and box plots to compare means and variation
- Pie chart showing distribution by card type
- Grouped bar plot (
hue='Gender') to visually compare categories
Test whether there are significant differences in flight bookings by education level.
- Aggregated monthly bookings per customer
- Filtered relevant columns:
Flights BookedandEducation
- Calculated means, standard deviation, and percentiles for each group
- Normality: Shapiro-Wilk or Kolmogorov-Smirnov → non-normal distribution
- Variance Homogeneity: Not assumed due to lack of normality
- Hypothesis Test: Kruskal-Wallis test showed no significant differences among groups
This analysis revealed useful customer behavior patterns. The statistical test confirmed that education level does not significantly affect the number of flights booked.
Marketing strategies can therefore focus on other influential variables like loyalty card type, flight frequency, points earned, or Customer Lifetime Value (CLV). These insights support more effective decision-making in personalization, loyalty programs, and campaign targeting.
funciones_etl.py: Custom ETL functionsejercicio_tecnico_m3_maria_tapia.ipynb: Jupyter Notebook with full analysisCustomer_Info.csv: Merged dataset used in the projectCustomer_Flight_Activity_limpio,Customer_Loyalty_History_limpio: clean datasets
- Build interactive dashboards in Tableau or Power BI
- Apply clustering techniques to segment customer types
- Integrate additional data sources for deeper insights
- Automate ETL with scheduled Python scripts
Feel free to reach out or explore more of my work: