R-Promperu-Clustering/Capstone CYO Project.Rmd at main · gabdmns/R-Promperu-Clustering · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
---
title: "Data Promperú Clustering PAM FAMD"
author: "Gabriel Gonzalo Ojeda Cárcamo"
date: "26/11/2025"
output:
  pdf_document:
    toc: true
    toc_depth: 3
    latex_engine: pdflatex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

\newpage
# Introduction

This project applies mixed-data clustering techniques, combining Factor Analysis of Mixed Data (FAMD) for dimensionality reduction and the Partitioning Around Medoids (PAM) algorithm for cluster formation. This methodological approach makes it possible to identify latent patterns within the population of international tourists by simultaneously incorporating sociodemographic characteristics, travel motivations, consumption preferences, and spending habits.

The resulting segmentation reveals five clearly differentiated tourist profiles, providing valuable insights for marketing strategies, product development, and destination management. Identifying these groups enhances understanding of the heterogeneity among international visitors and supports evidence-based decision-making to strengthen the competitiveness and sustainability of the tourism sector.

## Study Objective

- Identify groups of tourists who share similar characteristics

- Understand behavioral patterns related to travel, consumption, and personal attributes

- Support decision-making in tourism promotion, product design, and destination planning

## Data Source

The analysis is based on the 2024 Foreign Tourist Profile Survey, conducted in February, May, August, and November of 2024. The dataset includes 5,268 international tourists, aged 15 years and older, surveyed at the Jorge Chávez International Airport upon entering the country. The survey collects detailed information on motivations, expectations, sociodemographic characteristics, and travel behavior, making it a robust source for statistical analysis and clustering.

### Loading Packages
```{r, warning=FALSE,message=FALSE}
#Loading packages
required_packages <- c(
  "haven",        # read .sav (SPSS) files
  "dplyr",        # data manipulation
  "tidyr",        # tidy data tools
  "purrr",        # functional programming
  "janitor",      # clean tables, crosstabs
  "sjlabelled",   # convert SPSS labelled → factor
  "ggplot2",      # visualization
  "GGally",       # ggpairs, exploratory plots
  "naniar",       # missing data visualization
  "VIM",          # alternative missing data plots
  "missForest",   # mixed-type imputation
  "FactoMineR",   # FAMD (mixed data PCA)
  "factoextra",   # visualizations for FAMD/clustering
  "cluster",      # PAM, silhouette
  "clusterCrit",  # CH, Dunn indices
  "clustertend",  # Hopkins, VAT (clusterability)
  "fpc",          # cluster stability (clusterboot)
  "ggrepel",      # nicer text labels
  "scales",       # formatting (percentages, numbers)
  "stringr",      # string operations
  "forcats",      # factor manipulation
  "knitr",
  "kableExtra"
)

# Install missing packages and load them
for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, repos = "http://cran.us.r-project.org")
    library(pkg, character.only = TRUE)
  } else {
    library(pkg, character.only = TRUE)
  }
}

```

### Loading Data

```{r}
# Loading Data

url_sav  <- "https://github.com/gabdmns/Capstone_project_2/raw/main/PTE2024PROMPERU.sav"
tmp_file <- tempfile(fileext = ".sav")

download.file(
  url    = url_sav,
  destfile = tmp_file,
  mode   = "wb",
  method = "libcurl"  # <- clave en Windows moderno
)

data_sav <- read_sav(tmp_file)

nrow(data_sav)  # Number of observations
ncol(data_sav)  # Number of variables

```


### Data Wrangling

The selection of variables was based on the indicators reported by TurismoIN of PROMPERÚ, specifically on the set of summary measures from the survey. Variables related to the main purpose of travel, sociodemographic characteristics (age, gender, marital status, educational level, employment sector, and generational group), travel-planning aspects, and total spending were included.

Each variable was converted to its appropriate type (factor or numeric). For variables P59 and P43_1, the category “NS/NR” was assigned to missing values in P59, while P43_1 was log-transformed and its missing values were imputed using the missForest algorithm, which applies Random Forest models iteratively to handle mixed-type data.

```{r}
# NA percentage in each variable
na_pct <- sapply(data_sav, function(x) mean(is.na(x)) * 100)
# Filter variables >90% NA's
vars_mas_90_na <- na_pct[na_pct > 90]
# Count of variables with more than 90% NA'S
length(vars_mas_90_na)

# Variable selection

vars_final <- c("P01","P53_RANGO2","P54","P56","P57_1","P58","P59","P33","P34",
                "P33_34_RNG","P60","P61","P53_GENERACION","P43_1")
data_filtrada <- data_sav[, vars_final, drop = FALSE]

info_compacto <- tibble(
  variable  = vars_final,
  pregunta  = map_chr(
    vars_final,
    ~ {
      lab <- attr(data_sav[[.x]], "label")
      if (is.null(lab)) NA_character_ else lab
    }
  ),
  categorias = map_chr(
    vars_final,
    ~ {
      labs <- attr(data_sav[[.x]], "labels")
      if (is.null(labs)) return(NA_character_)
      paste(names(labs), collapse = " | ")
    }
  ),
  na_pct = na_pct[vars_final]
)

print(info_compacto)

# Correct Data type

data_fixed <- data_filtrada %>%
  # 1) labelled → factor
  mutate(across(
    where(~ sjlabelled::is_labelled(.x)),
    ~ sjlabelled::as_factor(.x, levels = "labels")
  )) %>%
  # 2) character → factor
  mutate(across(
    where(is.character),
    as.factor
  )) %>%
  # 3) Date → numeric
  mutate(across(
    where(~ inherits(.x, "Date")),
    ~ as.numeric(.x)
  )) %>%
  # 4) POSIXct → numeric
  mutate(across(
    where(~ inherits(.x, "POSIXt")),
    ~ as.numeric(.x)
  ))

# P59: add NS/NR category
if ("P59" %in% names(data_fixed) && is.factor(data_fixed[["P59"]])) {
  tmp <- as.character(data_fixed[["P59"]])
  tmp[is.na(tmp)] <- "NS/NR"
  data_fixed[["P59"]] <- factor(tmp)
}

# P43_1: log-transform and set invalids to NA
if ("P43_1" %in% names(data_fixed) && is.numeric(data_fixed[["P43_1"]])) {
  data_fixed <- data_fixed %>%
    mutate(
      P43_1 = ifelse(P43_1 <= 0, NA_real_, P43_1),
      P43_1 = log(P43_1)
    )
}

data_fixed_df <- as.data.frame(data_fixed)

colSums(is.na(data_fixed_df))

# Validation
which(sapply(data_fixed, is.list))

```

### Missing value imputation

```{r}
# Handling Missing Values


# Add a helper numeric variable from a factor (e.g., age range)
data_fixed_df$helper_num <- as.numeric(data_fixed_df$P53_RANGO2)

set.seed(123)
mf_res <- missForest(data_fixed_df)

# Check OOB error
mf_res$OOBerror

# Update data

data_fixed_df_imputed <- mf_res$ximp

# Drop helper variable
data_fixed_df_imputed$helper_num <- NULL

# Final imputed dataset
data_fixed <- data_fixed_df_imputed

```

Missing values were imputed using the missForest algorithm, which applies Random Forest regression and classification to iteratively estimate missing entries in mixed-type datasets. A helper numeric variable derived from the age-range factor was included to stabilize the procedure, and the spending variable was log-transformed prior to imputation. The algorithm fit Random Forest models for each incomplete variable using all others as predictors, updating imputed values until convergence. The out-of-bag error indicated good performance (NRMSE $\approx$ 0.19 for numeric variables, PFC = 0.00 for categorical variables), providing evidence that the imputation preserved the underlying structure of the data. The auxiliary variable was then removed, and the fully imputed dataset was used for subsequent FAMD and clustering analyses.


# Exploratory Data Analysis

```{r}
# Exploratory Data Analysis

# Distributions of key variables

# We work on data_fixed (already cleaned, 14 vars)
df <- data_fixed

# 1.1 Categorical variables: barplots in a loop --------------------------
cat_vars <- c("P01", "P53_RANGO2", "P54", "P56", "P57_1",
              "P58", "P59", "P33", "P34", "P33_34_RNG",
              "P60", "P61", "P53_GENERACION")

for (v in cat_vars) {
  p <- df %>%
    count(.data[[v]]) %>%
    mutate(prop = n / sum(n)) %>%
    ggplot(aes(x = .data[[v]], y = prop)) +
    geom_col() +
    geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
              vjust = -0.2, size = 3) +
    labs(
      title = paste("Distribution of", v),
      x = v, y = "Proportion"
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

  print(p)
}

# Numeric variable: log-transformed total spending (P43_1)

# Histogram of log-spending
ggplot(df, aes(x = P43_1)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Log-Transformed Total Trip Spending",
    x = "log(Spending in USD)",
    y = "Count"
  ) +
  theme_minimal()

# Density plot (better for log data)
ggplot(df, aes(x = P43_1)) +
  geom_density(fill = "steelblue", alpha = 0.4) +
  labs(
    title = "Density of Log-Transformed Total Trip Spending",
    x = "log(Spending in USD)",
    y = "Density"
  ) +
  theme_minimal()

# Boxplot of log-spending
ggplot(df, aes(y = P43_1)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Log-Transformed Total Trip Spending",
    y = "log(Spending in USD)"
  ) +
  theme_minimal()

```


# Modelling

## FAMD

```{r}
# Prepare data for FAMD (ensure unique rownames)

data_famd_use <- as.data.frame(data_fixed)
rownames(data_famd_use) <- paste0("ind_", seq_len(nrow(data_famd_use)))

# FAMD on data without NA
famd_res <- FAMD(data_famd_use, graph = FALSE)

# Individual coordinates (for clustering)
head(famd_res$ind$coord)

# Contribution of variables to each dimension
var_contrib <- famd_res$var$contrib
var_contrib

```

\paragraph{Dimension 1.}
This dimension is mainly defined by \textbf{age range (P53\_RANGO2)}, \textbf{generational group (P53\_GENERACION)}, and \textbf{trip anticipation (P58)}, with additional influence from \textbf{travel-companion type (P57\_1)}, \textbf{employment status (P56)}, and \textbf{work sector (P59)}. Together, these variables form a sociodemographic axis that separates younger, spontaneous travelers from older and more structured visitors.

\paragraph{Dimension 2.}
Dimension 2 is again driven by \textbf{generation (P53\_GENERACION)}, \textbf{age range (P53\_RANGO2)}, \textbf{trip anticipation (P58)}, and \textbf{travel-companion characteristics (P57\_1)}. Unlike Dimension 1, it emphasizes differences in planning behavior within demographic groups, distinguishing early versus late planners across age cohorts.

\paragraph{Dimension 3.}
This dimension is dominated by \textbf{trip-month variables}, particularly \textbf{P33\_34\_RNG} and \textbf{P34}, followed by \textbf{sector of work (P59)} and the \textbf{log-transformed spending variable (P43\_1)}. These contributions indicate that seasonal timing and expenditure levels are closely related in the structure of the data.

\paragraph{Dimension 4.}
Dimension 4 reflects a pure seasonality structure, defined almost entirely by \textbf{trip-month (P34)}, \textbf{combined trip timing (P33\_34\_RNG)}, and \textbf{planning month (P33)}. This axis captures temporal clustering patterns that are independent from demographic or socioeconomic characteristics.

\paragraph{Dimension 5.}
This dimension reintroduces sociodemographic variation through \textbf{age range (P53\_RANGO2)}, \textbf{generation (P53\_GENERACION)}, \textbf{trip anticipation (P58)}, and \textbf{travel-companion patterns (P57\_1)}. It highlights subtle behavioral differences within demographic groups that are not explained by the earlier dimensions.

```{r}


#  Distance + PAM + Silhouette (k = 2 to 10)
dist_famd <- dist(famd_res$ind$coord)

max_k <- 10
sil_famd <- numeric(max_k)

for (k in 2:max_k) {
  pam_fit     <- pam(dist_famd, k = k, diss = TRUE)
  sil_vals    <- silhouette(pam_fit$clustering, dist_famd)
  sil_famd[k] <- mean(sil_vals[, 3])
  cat("k =", k, "| Silhouette (FAMD+PAM) =", round(sil_famd[k], 4), "\n")
}

plot(2:max_k, sil_famd[2:max_k], type = "b",
     xlab = "Número de clusters (K)",
     ylab = "Silhouette promedio",
     main = "Selección de K usando Silhouette (FAMD + PAM)")

# Select optimal k (maximum silhouette between 2 and max_k)
k_opt <- which.max(sil_famd[2:max_k]) + 1  # +1 porque el vector empieza en k=2
cat("k óptimo elegido:", k_opt, "\n")

# CALINSKI-HARABASZ INDEX
install.packages("clusterCrit")
library(clusterCrit)

# Convert FAMD coordinates to numeric matrix
X <- as.matrix(famd_res$ind$coord)

# Evaluate CH for k = 2 to 10
ch_values <- numeric(max_k)

for (k in 2:max_k) {
  pam_fit <- pam(dist_famd, k = k, diss = TRUE)

  ch_values[k] <- intCriteria(
    X,
    as.integer(pam_fit$clustering),
    c("calinski_harabasz")
  )$calinski_harabasz
}

plot(2:max_k, ch_values[2:max_k], type = "b",
     main = "Índice de Calinski–Harabasz",
     xlab = "Número de clusters (k)",
     ylab = "CH index")

# DUNN INDEX

dunn_values <- numeric(max_k)

for (k in 2:max_k) {
  pam_fit <- pam(dist_famd, k = k, diss = TRUE)

  dunn_values[k] <- intCriteria(
    X,
    as.integer(pam_fit$clustering),
    c("dunn")
  )$dunn
}

plot(2:max_k, dunn_values[2:max_k], type = "b",
     main = "Índice de Dunn",
     xlab = "Número de clusters (k)",
     ylab = "Dunn index")


```

\paragraph{Selection of the number of clusters.}
The Dunn index attains its highest value at $K = 2$, indicating excellent compactness and separation at this level; however, it drops sharply from $K = 3$ onward and remains low and stable across larger values, suggesting increased overlap between clusters when $K > 2$. In contrast, the Calinski--Harabasz index increases steadily and reaches its maximum at $K = 6$, after which it declines, indicating that overall between-group separation relative to within-group cohesion is strongest around six clusters. The Silhouette index achieves its best value at $K = 3$, with a secondary local improvement at $K = 5$, while solutions from $K = 6$ to $K = 9$ show considerably weaker structure.

Taken together, these indices suggest that no single value of $K$ is optimal across all criteria: $K = 2$ maximizes Dunn but produces overly coarse segmentation; $K = 3$ maximizes Silhouette but may overlook relevant subgroup variation; and $K = 6$ maximizes the Calinski--Harabasz index but risks generating less interpretable clusters. Considering the objective of obtaining differentiated, interpretable, and actionable tourist profiles, and given that both Silhouette and Calinski--Harabasz show local support for intermediate solutions, selecting $K = 5$ represents a balanced and defensible compromise, even if it is not the strictly optimal solution for any single index.


## Final clustering with PAM

```{r}
#  Final clustering with PAM

pam_famd <- pam(dist_famd, k = 5, diss = TRUE)
clusters_famd <- factor(pam_famd$clustering)

table(clusters_famd)

# Final average silhouette
sil_famd_kopt <- silhouette(pam_famd$clustering, dist_famd)
mean_sil_kopt <- mean(sil_famd_kopt[, 3])
cat("Silhouette promedio (k_opt) =", round(mean_sil_kopt, 4), "\n")

# Add clusters to dataset
data_clust <- data_famd_use %>%
  mutate(cluster = clusters_famd)

table(data_clust$cluster)

# FAMD plot: individuals by cluster (with ggplot2)

# Using coordinates of the first two dimensions:
ind_coords <- as.data.frame(famd_res$ind$coord[, 1:2])
colnames(ind_coords) <- c("Dim1", "Dim2")
ind_coords$cluster <- clusters_famd

ggplot(ind_coords, aes(x = Dim1, y = Dim2, color = cluster)) +
  geom_point(alpha = 0.6, size = 1) +
  stat_ellipse(aes(group = cluster), type = "norm", level = 0.95) +
  labs(
    title = "FAMD - Individuos por cluster (PAM sobre coordenadas FAMD)",
    x = "Dimensión 1",
    y = "Dimensión 2",
    color = "Cluster"
  ) +
  theme_minimal()
```

\paragraph{Interpretation of the Final PAM Clustering ($k = 5$).}
After selecting $k = 5$ as the preferred number of clusters, the PAM algorithm was applied to the FAMD distance matrix. The resulting clusters show a balanced and clearly differentiated structure, with the following group sizes: Cluster~1: 615 individuals; Cluster~2: 919 individuals; Cluster~3: 1{,}076 individuals; Cluster~4: 1{,}983 individuals; and Cluster~5: 675 individuals. Cluster~4 is noticeably the largest group, while Clusters~1 and~5 are smaller but still substantial, indicating that none of the clusters is excessively small or unstable due solely to sample size.

The average Silhouette value for the selected solution is $\text{Silhouette}(k = 5) = 0.3088$. A value around 0.30 . This supports the conclusion that the five-cluster solution provides meaningful, interpretable, and statistically sound segmentation without imposing overly rigid cluster boundaries.


## Cluster stability analysis

```{r}
# Cluster Stability Analysis

# Wrapper function for PAM (required by clusterboot)
pam_5 <- function(x, k) {
  pam(x, k = k, diss = FALSE)$clustering
}

# Cluster stability analysis with bootstrap
set.seed(123)
invisible(
  capture.output(
    cb <- clusterboot(
      X,
      B            = 300,         # number of bootstrap samples
      bootmethod   = "boot",      # resample rows with replacement
      clustermethod = pamkCBI,    # PAM method compatible with clusterboot
      k            = 5,           # selected number of clusters
      seed         = 123,
      showplots    = FALSE
    )
  )
)

cb
```


Cluster stability was evaluated through bootstrap resampling using the \texttt{clusterboot} procedure. The resulting Jaccard similarity coefficients---$[0.98,\; 0.88,\; 0.84,\; 0.91,\; 0.93]$---indicate strong and consistent recovery of the five clusters across resampled datasets. Clusters~1, 4, and~5 exhibit extremely high stability (Jaccard $> 0.90$), Cluster~2 shows very high stability (0.88), and Cluster~3 reaches acceptable stability (0.84), despite showing a small number of dissolutions (23 out of 300 runs). All other clusters experienced zero dissolutions, confirming excellent reproducibility. Overall, these results demonstrate that the five-cluster solution is empirically robust and that the identified segments persist under resampling rather than arising from random variation.


# Results

```{r}

# Cluster profiles (categorical variables)

perfil_var_cluster <- function(var_name) {
  tab <- table(data_clust[[var_name]], data_clust$cluster)
  prop_clust <- prop.table(tab, margin = 2)  # % within each cluster
  prop_total <- prop.table(tab)              # % overall

  list(
    tabla            = tab,
    prop_por_cluster = round(prop_clust * 100, 1),
    prop_global      = round(prop_total * 100, 1)
  )
}

# Variables used in FAMD (same vars_final),
# excluding numeric P43_1 from categorical profiling:
vars_perfil <- intersect(
  c("P01", "P53_RANGO2", "P54", "P56", "P57_1", "P58",
    "P59", "P33", "P34", "P33_34_RNG", "P60", "P61", "P53_GENERACION"),
  names(data_clust)
)

# Create a list of profiles per variable
lista_perfiles <- lapply(vars_perfil, perfil_var_cluster)
names(lista_perfiles) <- vars_perfil

# --- FUNCTION TO EXTRACT DETAILED VARIABLE INFORMATION ---
info_variable <- function(v) {
  etiqueta   <- attr(data_sav[[v]], "label")
  categorias <- attr(data_sav[[v]], "labels")

  cat("\n============================\n")
  cat("Variable:", v, "\n")
  cat("Question:", ifelse(is.null(etiqueta), "NO LABEL AVAILABLE", etiqueta), "\n")

  if (!is.null(categorias)) {
    cat("Categories (original SPSS labels):\n")
    print(categorias)
  } else {
    cat("Categories: NOT APPLICABLE (numeric or no labels)\n")
  }

  cat("----------------------------\n")
}

# --- DISPLAY CLUSTER PROFILES FOR CATEGORICAL VARIABLES ---
for (v in vars_perfil) {
  info_variable(v)   # show variable name + question + categories

  cat("Percentage distribution by cluster (%):\n")
  print(lista_perfiles[[v]]$prop_por_cluster)

  cat("\n")
}


# Numerical summary per cluster


# FUNCTION TO SHOW NUMERIC VARIABLE INFO (NAME + QUESTION)
show_numeric_info <- function(var) {
  etiqueta <- attr(data_sav[[var]], "label")

  cat("\n============================\n")
  cat("Numeric variable:", var, "\n")
  cat("Question:", ifelse(is.null(etiqueta), "NO LABEL AVAILABLE", etiqueta), "\n")
  cat("----------------------------\n")
}

# Add original-scale spending (USD) for reporting
data_clust <- data_clust %>%
  mutate(
    P43_1_usd = exp(P43_1)  # back-transform from log
  )

# Optional: add a descriptive label for the back-transformed variable
attr(data_clust$P43_1_usd, "label") <-
  "Total trip spending in USD (back-transformed from log)"

# Identify numeric variables in data_clust
num_vars <- names(data_clust)[sapply(data_clust, is.numeric)]

# Remove the log version of P43_1 from the summary (we only want original-scale USD)
num_vars <- setdiff(num_vars, "P43_1")

# Show info for numeric variables (only those that exist in data_sav or are P43_1_usd)
for (var in num_vars) {
  if (var %in% names(data_sav)) {
    show_numeric_info(var)
  } else if (var == "P43_1_usd") {
    cat("\n============================\n")
    cat("Numeric variable: P43_1_usd\n")
    cat("Question: Total trip spending in USD (back-transformed from log)\n")
    cat("----------------------------\n")
  }
}

# COMPUTE NUMERIC SUMMARY PER CLUSTER (using original-scale spending)
resumen_numerico <- data_clust %>%
  select(cluster, all_of(num_vars)) %>%
  group_by(cluster) %>%
  summarise(
    across(
      .cols = where(is.numeric),
      .fns  = list(
        mean = ~ mean(.x, na.rm = TRUE),
        sd   = ~ sd(.x, na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    ),
    .groups = "drop"
  )

cat("\nNUMERIC SUMMARY BY CLUSTER (original scale for spending):\n")
print(resumen_numerico)
```

\subsubsection*{Main Defining Variables by Cluster (with Percentages)}

\begin{itemize}

%--------------------------------------------------------
\item \textbf{Cluster 1 – Very young student leisure travelers}\\
Dominant characteristics:
\begin{itemize}
  \item \textbf{Age (P53\_RANGO2):} 15--24 years (\textbf{80.8\%}); 25--34 years (18.4\%).
  \item \textbf{Generation (P53\_GENERACION):} Centennials (\textbf{96.9\%}).
  \item \textbf{Marital Status (P56):} Single (\textbf{82.1\%}).
  \item \textbf{Children (P57\_1):} No children (\textbf{98.7\%}).
  \item \textbf{Occupation (P58):} Students (\textbf{83.6\%}).
  \item \textbf{Main Motive (P01):} Vacations / leisure (\textbf{63.6\%}).
  \item \textbf{Education (P60):} University completed (44.9\%), secondary (36.9\%).
  \item \textbf{Planning Horizon (P33\_34\_RNG):} 1--4 months (\textbf{61.6\%}).
  \item \textbf{Spending (P43\_1\_usd):} Mean USD \textbf{2,410}.
\end{itemize}

%--------------------------------------------------------
\item \textbf{Cluster 2 – Business-oriented working Millennials}\\
Dominant characteristics:
\begin{itemize}
  \item \textbf{Main Motive (P01):} Business (\textbf{60.3\%}).
  \item \textbf{Age (P53\_RANGO2):} 35--54 years (60.6\%), 25--34 years (30.9\%).
  \item \textbf{Generation:} Millennials (65.0\%), Gen X (19.5\%).
  \item \textbf{Marital Status:} Married/partnered (\textbf{51.5\%}).
  \item \textbf{Children:} Children under 14 (\textbf{37.3\%}).
  \item \textbf{Occupation:} Private-sector workers (\textbf{77.3\%}).
  \item \textbf{Sector (P59):} Services (7.5\%), transport (15.1\%), health (5.8\%).
  \item \textbf{Planning Horizon:} <1 month (\textbf{82.7\%}).
  \item \textbf{Education:} University completed (\textbf{50.7\%}); postgraduate (18.1\%).
  \item \textbf{Spending:} Mean USD \textbf{1,820}.
\end{itemize}

%--------------------------------------------------------
\item \textbf{Cluster 3 – Mid-life families with higher spending}\\
Dominant characteristics:
\begin{itemize}
  \item \textbf{Age:} 35--54 years (\textbf{79.2\%}); 55+ (18.6\%).
  \item \textbf{Generation:} Gen X (\textbf{66.6\%}).
  \item \textbf{Marital Status:} Married (\textbf{71.8\%}).
  \item \textbf{Children:} Children 0--14 (\textbf{29.3\%}); children >18 (\textbf{20.3\%}).
  \item \textbf{Occupation:} Private-sector (47.3\%), public-sector (35.5\%).
  \item \textbf{Sector:} Education (\textbf{22.5\%}), health (17.0\%), services (12.1\%).
  \item \textbf{Motive:} Vacations (63.8\%).
  \item \textbf{Planning Horizon:} 1--4 months (66.4\%).
  \item \textbf{Education:} University (30.5\%), master's (23.5\%), doctorate (15.5\%).
  \item \textbf{Spending:} Mean USD \textbf{3,070}.
\end{itemize}

%--------------------------------------------------------
\item \textbf{Cluster 4 – Young adult leisure travelers}\\
Dominant characteristics:
\begin{itemize}
  \item \textbf{Age:} 25--34 years (\textbf{65.3\%}); 15--24 years (7.5\%).
  \item \textbf{Generation:} Millennials (\textbf{55.9\%}), Centennials (41.7\%).
  \item \textbf{Marital Status:} Single (\textbf{60.1\%}).
  \item \textbf{Children:} No children (\textbf{94.4\%}).
  \item \textbf{Motive:} Vacations (\textbf{82.7\%}).
  \item \textbf{Occupation:} Private-sector (59.1\%), students (21.8\%).
  \item \textbf{Planning Horizon:} 1--4 months (\textbf{66.3\%}).
  \item \textbf{Education:} Mainly university (50.7\%).
  \item \textbf{Spending:} Mean USD \textbf{2,576}.
\end{itemize}

%--------------------------------------------------------
\item \textbf{Cluster 5 – Senior, high-income, high-spending travelers}\\
Dominant characteristics:
\begin{itemize}
  \item \textbf{Age:} 55+ years (\textbf{97.5\%}).
  \item \textbf{Generation:} Baby Boomers (\textbf{80.3\%}).
  \item \textbf{Marital Status:} Married (\textbf{67.4\%}).
  \item \textbf{Children:} Adult children living independently (\textbf{70.7\%}).
  \item \textbf{Occupation:} Retired (\textbf{44.4\%}).
  \item \textbf{Motive:} Vacations (\textbf{68.1\%}); visit family (17.3\%).
  \item \textbf{Planning Horizon:} 1--4 months (\textbf{58.4\%}).
  \item \textbf{Education:} University (44.3\%), master's (17.8\%), doctorate (7.4\%).
  \item \textbf{Spending:} Highest mean: USD \textbf{3,186}.
\end{itemize}

\end{itemize}

# Conclusion

This study demonstrates the effectiveness of unsupervised machine-learning methods for uncovering latent behavioral and demographic structures within the 2024 Foreign Tourist Profile Survey. By applying a rigorous workflow centered on dimensionality reduction (FAMD) and partitioning clustering (PAM), the analysis successfully identified complex patterns in a dataset characterized by mixed variable types, nonlinear relationships, and high heterogeneity—conditions for which unsupervised learning is particularly powerful. These methods enabled the discovery of natural groupings without imposing prior assumptions or predefined categories, allowing the data itself to reveal the underlying segmentation of international visitors to Peru.

The FAMD analysis, an unsupervised multivariate technique, reduced the complexity of the dataset while preserving key sources of variance. This step ensured that the clustering did not rely on raw, noisy, or redundant variables, but on a condensed representation that captures the true structural relationships among travelers.

Using this FAMD space, the PAM clustering algorithm, a classic unsupervised method designed for mixed-data applications, was used to partition the tourist population. Multiple internal validation metrics—Silhouette, Dunn, and Calinski–Harabasz—were evaluated to avoid relying on a single criterion. Although the indices pointed to different optimal k values (a frequent occurrence in unsupervised learning), their combined interpretation and practical considerations identified five clusters as the most balanced and actionable solution.

Crucially, the reliability of the unsupervised segmentation was assessed through bootstrap-based stability analysis using clusterboot. The resulting Jaccard coefficients (0.84–0.98) confirmed that the five clusters are not artifacts of noise or sampling variation but stable structural components consistently reproduced across resampled datasets. This step is essential in unsupervised learning, where model validation does not rely on ground-truth labels but on the reproducibility of discovered patterns.

The resulting five-segment structure—ranging from very young student leisure travelers to senior high-income visitors—captures meaningful and interpretable market segments that emerge organically from the data. Differences observed across age, generation, household structure, planning behavior, occupational status, travel motivation, and spending levels illustrate the power of unsupervised methods to reveal hidden behavioral profiles that supervised models or traditional descriptive statistics would not detect.

Overall, the study provides strong evidence that unsupervised learning is a robust and insightful approach for segmenting heterogeneous populations in tourism analytics. Through careful preprocessing, dimensionality reduction, clustering, and stability validation, the analysis delivered a statistically sound and practically useful segmentation. The five-cluster solution offers a comprehensive, data-driven understanding of international tourist behavior—supporting more precise marketing strategies, product differentiation, and policy planning in Peru’s tourism sector.