You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,11 +33,11 @@ We have to predict cognitive start for time `t` and `t+3`. The target for `t` is
33
33
34
34
I had several ideas on features generation, and I combined them into the following groups.
35
35
36
-
1 - Raw sensor data. Provide data “as is.”
37
-
2 - Rolling statistics with different time windows (5, 999999 seconds) for both separate sessions and “global” (i.e. no separate sessions). Rolling statistincs include: mean, std, z-score: [x - mean(x)] / std(x)
38
-
3 - Shift features, i.e. the value of sensor data a second ago, two secodns ago, etc.
39
-
4 - Features based on the interactions between sensor data, e.g., the value of Zephyr_HR divided by the value of Zephyr_HRV.
40
-
5 - Features based on the distances between eyes positions and gazing points.
36
+
1. Raw sensor data. Provide data “as is.”
37
+
2. Rolling statistics with different time windows (5, 999999 seconds) for both separate sessions and “global” (i.e. no separate sessions). Rolling statistincs include: mean, std, z-score: [x - mean(x)] / std(x)
38
+
3. Shift features, i.e. the value of sensor data a second ago, two secodns ago, etc.
39
+
4. Features based on the interactions between sensor data, e.g., the value of Zephyr_HR divided by the value of Zephyr_HRV.
40
+
5. Features based on the distances between eyes positions and gazing points.
Copy file name to clipboardExpand all lines: WHITEPAPER.md
+27-23Lines changed: 27 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,35 +2,39 @@
2
2
## Overview
3
3
This codebase is a winner of the Topcoder NASA Cognitive State Determination marathon match. As part of the final submission the competitor was asked to compelte this document. Personal details have been removed.
4
4
5
-
1. Introduction
5
+
## 1. Introduction
6
6
Tell us a bit about yourself, and why you have decided to participate in the contest.
7
-
- Handle: tEarth
8
-
- Placement you achieved in the MM: 2nd
9
-
- About you: Data Scientist at Unity
10
-
- Why you participated in the MM: money, fame
7
+
-**Handle:** tEarth
8
+
-**Placement you achieved in the MM:** 2nd
9
+
-**About you:** Data Scientist at Unity
10
+
-**Why you participated in the MM:** money, fame
11
11
12
-
2. Solution Development
12
+
## 2. Solution Development
13
13
How did you solve the problem? What approaches did you try and what choices did you make, and why? Also, what alternative approaches did you consider? Also try to add your cross validation approaches used.
14
14
- During EDA, I noticed that the timestamps might have “holes” between neighbors. The hole is defined as: if the delta time between two neighbors' rows is above one second, there is a “hole” between them. These two neighbors belong to different “sessions.” I noticed that the sensor data between other sessions might be different.
15
15
- I also noticed that the actual target is always constant within these sessions. I incorporated my findings into feature generation and postprocessing of my predictions.
16
16
- I used stratified group k fold for validation. Stratified = each fold has approximately the same number of samples for each class. Group = provided test_suite column.
17
17
18
-
3. Solution Architecture
18
+
## 3. Solution Architecture
19
19
Please describe how your algorithm handles the following steps:
20
-
- Feature generation: I had several ideas on features generation, and I combined them into the following groups. 1 - Raw sensor data. Provide data “as is.” 2 - Rolling statistics with different time windows (5, 999999 seconds) for both separate sessions and “global” (i.e. no separate sessions). Rolling statistics include: mean, std, z-score: [x - mean(x)] / std(x) 3 - Shift features, i.e. the value of sensor data a second ago, two secodns ago, etc. 4 - Features based on the interactions between sensor data, e.g., the value of Zephyr_HR divided by the value of Zephyr_HRV. 5 - Features based on the distances between eyes positions and gazing points.
21
-
- Correlation: I used gradient boosting trees for classification. The model consists of many decision trees, each tree outputs a probability of a certain class. The final prediction is a weighted sum of predictions of each tree.
22
-
- Postprocessing: The target is the same within a single session. Therefore, I post-processed predictions by calculating the rolling average of the model’s predictions from the beginning of the session till time t (including time t) for which we’re making predictions.
20
+
-**Feature generation:** I had several ideas on features generation, and I combined them into the following groups.
21
+
- 1 - Raw sensor data. Provide data “as is.”
22
+
- 2 - Rolling statistics with different time windows (5, 999999 seconds) for both separate sessions and “global” (i.e. no separate sessions). Rolling statistics include: mean, std, z-score: [x - mean(x)] / std(x)
23
+
- 3 - Shift features, i.e. the value of sensor data a second ago, two secodns ago, etc.
24
+
- 4 - Features based on the interactions between sensor data, e.g., the value of Zephyr_HR divided by the value of Zephyr_HRV.
25
+
- 5 - Features based on the distances between eyes positions and gazing points.
26
+
-**Correlation:** I used gradient boosting trees for classification. The model consists of many decision trees, each tree outputs a probability of a certain class. The final prediction is a weighted sum of predictions of each tree.
27
+
-**Postprocessing:** The target is the same within a single session. Therefore, I post-processed predictions by calculating the rolling average of the model’s predictions from the beginning of the session till time t (including time t) for which we’re making predictions.
23
28
24
-
4. Final Approach
29
+
## 4. Final Approach
25
30
Please provide a bulleted description of your final approach. What ideas/decisions/features have been found to be the most important for your solution performance; Include the feature importance techniques used and how you selected the data tags used.
31
+
-**Target:** We have to predict cognitive start for time t and t+3. The target for t is equal to the value of the induced_state column. The target for t+3 is the same as the target for t because the cognitive state is the same within the session. The data and target are the same, so I decided to train a single model, and use the same model for making predictions for t and t+3. Note: There is a separation between t and t+3 models in the code. I decided to keep it just in case the data would be different in the future.
32
+
-**Model:** I used the Lightgbm classifier. I optimized hyperparameters using Optuna. The final prediction is the average predictions of several Lightgbm classifiers with different hyperparameters.
33
+
-**Features importance:** To determine the most important features for making predictions for time t, I used built-in SHAP values calculation. Then, I selected top-3 sensor features from the output.
26
34
27
-
- Target: We have to predict cognitive start for time t and t+3. The target for t is equal to the value of the induced_state column. The target for t+3 is the same as the target for t because the cognitive state is the same within the session. The data and target are the same, so I decided to train a single model, and use the same model for making predictions for t and t+3. Note: There is a separation between t and t+3 models in the code. I decided to keep it just in case the data would be different in the future.
28
-
-Model: I used the Lightgbm classifier. I optimized hyperparameters using Optuna. The final prediction is the average predictions of several Lightgbm classifiers with different hyperparameters.
29
-
Features importance: To determine the most important features for making predictions for time t, I used built-in SHAP values calculation. Then, I selected top-3 sensor features from the output.
30
-
31
-
5. Open Source Resources, Frameworks and Libraries
32
-
Please specify the name of the open-source resource, the URL to where it can be found, and it’s license type: all libraries are open-sourced.
33
-
35
+
## 5. Open Source Resources, Frameworks and Libraries
36
+
Please specify the name of the open-source resource, the URL to where it can be found, and it’s license type.
37
+
- all libraries are open-sourced.
34
38
- pandas==1.3.5
35
39
- numpy==1.22.1
36
40
- lightgbm==3.3.2
@@ -39,22 +43,22 @@ Please specify the name of the open-source resource, the URL to where it can be
39
43
- scipy==1.7.3
40
44
- optuna==2.10.0
41
45
42
-
6. Potential Algorithm Improvements
46
+
## 6. Potential Algorithm Improvements
43
47
Please specify any potential improvements that can be made to the algorithm:
44
48
- We can test improving hyperparameters of the model by running additional hyperparameters tuning
45
49
- The model is vulnerable to the absence of input features, so it’s a good idea to use augmented data (i.e. features when one or more input signals are missing) for training
46
50
- We can test other than gradient boosting models. I.e. neural nets.
47
51
48
-
7. Algorithm Limitations
52
+
## 7. Algorithm Limitations
49
53
Please specify any potential limitations with the algorithm:
50
54
- The algorithm requires historical data for features engineering. If the historical data isn’t available the performance of the algorithm will drop.
51
55
- The model uses all available input features for making predictions. The absence of features may result in a worse score.
52
56
53
-
8. Deployment Guide
57
+
## 8. Deployment Guide
54
58
Please provide the exact steps required to build and deploy the code:
55
-
-Same as pt 9 excluding model training (step 2).
59
+
-Same as pt 9 (below) excluding model training (step 2).
56
60
57
-
9. Final Verification
61
+
## 9. Final Verification
58
62
Please provide instructions that explain how to train the algorithm and have it execute against sample data:
0 commit comments