Balancing sections 2 and 3.

astamm · astamm · commit 79bef173f08b · 2026-01-27T12:36:43.000+01:00
diff --git a/_02_Material.qmd b/_02_Material.qmd
@@ -1,6 +1,8 @@
 # Material {#sec-material}
 
-This section is dedicated to describing the data that we used for segmenting gait in the present work. Essentially, we used two sources of data. First, we clipped a 9-axis inertial measurement unit (IMU) sensor at the level of the right hip and measured its orientation (that we assimilate to the hip orientation) over time during walking sessions. The data is recorded in the form of a unit quaternion time series. @sec-quaternions provides a brief overview of unit quaternions and their properties. Second, we used a pressure-sensitive walkway (GAITRite® mat) as a gold standard to label the gait events. @sec-data-acquisition elicits the data acquisition protocol while @sec-data-sets summarizes the two data sets. Finally, @sec-feature-space details the feature space that we constructed from the raw data to feed our machine learning models.
+In this work, the objective is to detect right and left heel strike and toe off events from hip orientation data over time. Knowledge of the occurences of these events is critical to compute key gait parameters such as the mean and the variability of stride duration, metrics of asymmetry or ratio stance/swing which have been proven to be clinically relevant [@annweiler2009risk;@beauchet2016poor]. We propose to address this problem by training, tuning and comparing several supervised classification models. In details, we consider the timepoints as the statistical units (observations) and we aim at training models to affect a *label* to each of them. This requires to elaborate a *labelled gait data* set in which we know which timepoints correspond to the occurence of the key gait events.
+
+For this purpose, we used two sources of data. First, we clipped a 9-axis inertial measurement unit (IMU) at the level of the right hip and measured its orientation (that we assimilate to the hip orientation) over time during walking sessions. The data is recorded in the form of a unit quaternion time series. @sec-quaternions provides a brief overview of unit quaternions and their properties. Second, we used a pressure-sensitive walkway (GAITRite® mat) as a gold standard to record the occurences of the gait events of interest. @sec-data-acquisition elicits the data acquisition protocol while @sec-data-sets summarizes the two collected data sets and the elaboration of the final labelled data set. Finally, @sec-feature-space details the feature space that we constructed from the raw data to feed our machine learning models.
 
 ## Unit quaternions {#sec-quaternions}
 
@@ -197,7 +199,7 @@ smoothed_qts |>
 
 The code above illustrates some other nice S3 specializations implemented in the [{squat}](https://cran.r-project.org/package=squat/) package such as the `log()` and `exp()` functions to compute the logarithm and exponential of a unit QTS respectively. As mentioned in @sec-quaternions, the logarithm of a unit quaternion has a null scalar part, which is why we set the *w* coordinate to zero in the code above and only smooth the three other coordinates. The function `squat::qts2sqts()` is dedicated to performing this exact computation. @fig-smoothed-qts nicely shows the smoothing effect with subtle variations along the curves that are smoothed out.
 
-### Pressure mat data
+### Pressure mat data {#sec-gaitrite-data}
 
 The GAITRite® mat records the positions of the feet on the mat through pressure-sensitive sensors hidden beneath the mat. It returns a table of spatio-temporal parameters such as stride duration, stride length, walking speed, etc. @tbl-gaitrite-params in the Appendix provides the exhaustive list of all spatio-temporal gait parameters that the walkway outputs. It also returns the time of each event happening during a gait cycle such as the time where a foot touches or leaves the ground. These are the times we use to label our data to predict these events. Since the two devices were triggered simultaneously, the IMU sensor and the GAITRite® mat are assumed to share the same time clock. We use the pressure mat as a gold standard to label the observations into the different classes and train models on this labeled data.
 
@@ -392,6 +394,75 @@ plot_ts(imu_data[[33]], rhs33, rto33, lhs33, lto33)
 
 ## Feature space {#sec-feature-space}
 
+In this work, the objective is to identify which timepoints of unit QTS correspond to the RHS, LTO, LHS and RTO events. For this purpose, we consider the timepoints as statistical units (observations) and we aim at labelling them by means of supervised classification models. We therefore need to define a so-called *feature space* which consists of a data table listing the timepoints by row and collecting a number of features for each of them. A first important feature is the actual label that we want to predict with the trained model. @sec-labelled-gait-data details the elaboration of what
+
+### Labelled gait data {#sec-labelled-gait-data}
+
+ In this view, we can first create the data set that we will use for training. The following code achieves this task by binding together all timepoints from all walking sessions while attaching to each timepoint:
+
+- an `event_type` which affects it to one of the five gait events defined in @sec-gaitrite-data;
+- a `phase_type` which affects it one of the four gait pahses defined in @sec-gaitrite-data.
+
+```{r}
+events_to_phases <- function(events) {
+  events_of_interest <- events != "None"
+  first_event <- events[events_of_interest][1]
+  phase_durations <- diff(c(0, sort(which(events_of_interest))))
+  n_phases <- length(phase_durations)
+  phase_names <- switch(
+    first_event,
+    "RHS" = rep(
+      c("Swing", "Pre-Stance", "Stance", "Pre-Swing"),
+      times = n_phases
+    )[1:n_phases],
+    "LTO" = rep(
+      c("Pre-Stance", "Stance", "Pre-Swing", "Swing"),
+      times = n_phases
+    )[1:n_phases],
+    "LHS" = rep(
+      c("Stance", "Pre-Swing", "Swing", "Pre-Stance"),
+      times = n_phases
+    )[1:n_phases],
+    "RTO" = rep(
+      c("Pre-Swing", "Swing", "Pre-Stance", "Stance"),
+      times = n_phases
+    )[1:n_phases]
+  )
+  purrr::map2(
+    phase_names,
+    phase_durations,
+    \(phase_name, phase_duration) rep(phase_name, times = phase_duration)
+  ) |>
+    purrr::list_c()
+}
+
+labelled_gait_data <- purrr::map(1:nrow(bhg), \(session_index) {
+  gaitrite_data <- gaitrite_data |>
+    dplyr::filter(session == session_index) |>
+    dplyr::select(-session)
+  imu_data[[session_index]] |>
+    dplyr::left_join(gaitrite_data, by = c("time" = "event_time")) |>
+    dplyr::mutate(
+      event_type = dplyr::if_else(is.na(event_type), "None", event_type),
+      phase_type = events_to_phases(event_type)
+    )
+}) |>
+  dplyr::bind_rows(.id = "session") |>
+  dplyr::mutate(
+    session = as.numeric(session),
+    event_type = factor(
+      event_type,
+      levels = c("RHS", "LTO", "LHS", "RTO", "None")
+    ),
+    phase_type = factor(
+      phase_type,
+      levels = c("Pre-Stance", "Stance", "Pre-Swing", "Swing")
+    )
+  )
+class(labelled_gait_data) <- class(labelled_gait_data)[-1]
+head(labelled_gait_data)
+```
+
 The feature space is an important piece of machine learning models as it defines the data that will be used to train them. In our application, we constructed a feature space from the raw QTS data recorded by the IMU sensor. First, we need to define what is an observation in our context. An observation corresponds to the data recorded at a given time point $t_j$. Since QTS are ordered sets of unit quaternions, for the $j$-*th* observation (row) of the feature space, we can then use features computed from the data observed at time $t_j$ or any other time points preceding $t_j$. One feature is of course the label of the observation that we get from the gold standard. This variable is available for the train/test process but will not be available at prediction time since it corresponds to the events that we want to predict. The other features are predictors that we compute from the sensor data. The ultimate goal is to design a feature space to label each observation $t_j$ with the gait event happening at that time (if any) using only the predictors computed from the QTS at time points $t_k \le t_j$, with $1 \le k \le j$. In the remainder of this section, we describe the predictors that we computed to build our feature space.
 
 Angular velocity and acceleration vectors
diff --git a/_03_Methods.qmd b/_03_Methods.qmd
@@ -2,59 +2,12 @@
 
 ## Classification strategies {#sec-classification-strategies}
 
-Gait event detection is performed by evaluating and comparing two strategies to classify the observations.
+In this work, the objective is to identify which timepoints of unit QTS correspond to the RHS, LTO, LHS and RTO events. For this purpose, we consider the timepoints as statistical units (observations) and we aim at labelling them by means of classification models. In this view, we can first create the data set that we will use for training. The following code achieves this task by binding together all timepoints from all walking sessions while attaching to each timepoint:
 
-Strategy E: Predicting gait [E]{.underline}vents
-
-: The strategy E pertains to directly predicting the gait events occuring when walking. Specifically, time points are viewed as statistical units (observations) and we aim at classifying them into five categories:
-
-- *Right Heel Strike*,
-- *Left Toe Off*,
-- *Left Heel Strike*,
-- *Right Toe Off*,
-- *None* (all other times not corresponding to a certain event).
-
-The first four events  (RHS, LTO, LHS and RTO) are coined *events of interest* while the last one encodes the so-called *negative* class. While conveniently aiming at directly predicting the occurrence of gait events of interest, this strategy suffers from a severe class imbalance issue, with the *None* (negative) class being widely over-represented as summarized in @tbl-class-imbalance.
+- an `event_type` which affects it to one of the five gait events defined in @sec-gaitrite-data;
+- a `phase_type` which affects it one of the four gait pahses defined in @sec-gaitrite-data.
 
 ```{r}
-#| label: tbl-class-imbalance
-#| tbl-cap: "Strategy E: Count and proportion of observations in each class."
-#| tbl-pos: "H"
-tibble::tibble(
-  class = c(
-    "Right Heel Strike",
-    "Left Toe Off",
-    "Left Heel Strike",
-    "Right Toe Off",
-    "None"
-  ),
-  nb_obs = c("973", "1004", "994", "982", "158401"),
-  prop = c("0.60%", "0.62%", "0.61%", "0.60%", "97.57%")
-) |>
-  gt::gt() |>
-  gt::cols_label(
-    class = "Class",
-    nb_obs = "Number of observations",
-    prop = "Proportion"
-  ) |>
-  gt::cols_align(align = "center") |>
-  gt::tab_style(
-    style = list(gt::cell_text(style = "italic")),
-    locations = gt::cells_body(columns = class)
-  ) |>
-  gt::tab_options(column_labels.background.color = "#616161")
-```
-
-[AST] TO MODIFY
-
-Finally, the following code creates the labelled data set that we will use to elaborate the feature space and produces @tbl-class-summary which exhibits class frequencies whether we focus on gait events () or gait phases ().
-
-```{r}
-#| label: tbl-class-summary
-#| tbl-cap: Two tables
-#| tbl-subcap: ["mtcars", "Just cars"]
-#| layout-ncol: 2
-#| classes: plain
 events_to_phases <- function(events) {
   events_of_interest <- events != "None"
   first_event <- events[events_of_interest][1]
@@ -110,14 +63,82 @@ labelled_gait_data <- purrr::map(1:nrow(bhg), \(session_index) {
       levels = c("Pre-Stance", "Stance", "Pre-Swing", "Swing")
     )
   )
+class(labelled_gait_data) <- class(labelled_gait_data)[-1]
+head(labelled_gait_data)
+```
+
+We first need to decide what models should predict. In effect, we can adopt two different strategies.
+
+The most straightforward way pertains to predicting the gait events of interest themselves. We call it **Strategy E**, where **E** stands for [E]{.underline}vents. Following this strategy, this means that we must design a multiclass prediction model with 5 classes (RHS, LTO, LHS, RTO and None) as defined in @sec-gaitrite-data. The first four events (RHS, LTO, LHS and RTO) are coined *events of interest* while the last one encodes the so-called *negative* class. While conveniently aiming at directly predicting the occurrence of gait events of interest, this strategy suffers from a severe class imbalance issue, with the *None* (negative) class being widely over-represented as shown by @tbl-e-counts.
+
+A solution to mitigate this severe class imbalance issue is to predict gait phases instead of events at the cost of some post-processing efforts needed to identify the occurences of RHS, LTO, LHS and RTO after the phase prediction step. As defined in @sec-gaitrite-data, there are four phases to predict (pre-stance, stance, pre-swing and swing). We call this strategy **Strategy P**, where **P** stands for [P]{.underline}hases. @tbl-p-counts exhibits the frequency of timepoints in each phase, which demonstrate that this strategy successfully reduces dramatically class imbalance.
+
+```{r}
+#| label: tbl-class-imbalance
+#| tbl-cap: "Strategy E: Count and proportion of observations in each class."
+#| tbl-pos: "H"
+tibble::tibble(
+  class = c(
+    "Right Heel Strike",
+    "Left Toe Off",
+    "Left Heel Strike",
+    "Right Toe Off",
+    "None"
+  ),
+  nb_obs = c("973", "1004", "994", "982", "158401"),
+  prop = c("0.60%", "0.62%", "0.61%", "0.60%", "97.57%")
+) |>
+  gt::gt() |>
+  gt::cols_label(
+    class = "Class",
+    nb_obs = "Number of observations",
+    prop = "Proportion"
+  ) |>
+  gt::cols_align(align = "center") |>
+  gt::tab_style(
+    style = list(gt::cell_text(style = "italic")),
+    locations = gt::cells_body(columns = class)
+  ) |>
+  gt::tab_options(column_labels.background.color = "#616161")
+```
+
+[AST] TO MODIFY
+
+Finally, the following code creates the labelled data set that we will use to elaborate the feature space and produces @tbl-class-summary which exhibits class frequencies whether we focus on gait events () or gait phases ().
 
+```{r}
+#| label: tbl-class-summary
+#| tbl-cap: Two tables
+#| tbl-subcap: ["mtcars", "Just cars"]
+#| layout-ncol: 2
+#| html-table-processing: none
 labelled_gait_data |>
   dplyr::count(event_type) |>
-  gt::gt()
+  gt::gt() |>
+  gt::cols_label(
+    event_type = "Event",
+    n = "Frequency"
+  ) |>
+  gt::opt_stylize(style = 6, color = 'gray') |>
+  gt::cols_align(align = "center") |>
+  gt::tab_style(
+    style = "vertical-align:top",
+    locations = gt::cells_column_labels()
+  )
 
 labelled_gait_data |>
   dplyr::count(phase_type) |>
-  gt::gt()
+  gt::gt() |>
+  gt::cols_label(
+    phase_type = "Phase",
+    n = "Frequency"
+  ) |>
+  gt::opt_stylize(style = 6, color = 'gray') |>
+  gt::cols_align(align = "center") |>
+  gt::tab_style(
+    style = "vertical-align:top",
+    locations = gt::cells_column_labels()
+  )
 ```
 
 We can notice that gait events of interest are largely under-represented while this class imbalance issue is moderate when we put the focus on gait phases.