|
2 | 2 | \graphicspath{{\subfix{../res/}}} |
3 | 3 | \begin{document} |
4 | 4 |
|
| 5 | +Let's begin this implementation by studying our raw dataset (cleaned) and defining the outcomes we want from our system. When these two sections will have been discussed, and choices made, we are going to explore the construct and results from our experimental system. |
| 6 | + |
| 7 | +\subsection{Analysing the raw dataset} |
| 8 | +We made a first study of our raw dataset (after some light cleaning of it) to try and understand the data we are working with. First, we wanted to understand the data universe itself. How many entry have we available on our hand. This is the result of this first analysis : |
| 9 | +\begin{table}[h] |
| 10 | + \centering |
| 11 | + \begin{tabular}{|c|c|} |
| 12 | + \hline |
| 13 | + Academic Year & Number of Students \\ |
| 14 | + \hline |
| 15 | + 2018-2019 & 14 \\ |
| 16 | + 2019-2020 & 13 \\ |
| 17 | + 2020-2021 & 13 \\ |
| 18 | + 2021-2022 & 17 \\ |
| 19 | + 2022-2023 & 32 \\ |
| 20 | + \hline |
| 21 | + Total & 90 \\ |
| 22 | + \hline |
| 23 | + \end{tabular} |
| 24 | + \caption{Number of Students per Academic Year} |
| 25 | + \label{tab:students_per_year} |
| 26 | +\end{table} |
| 27 | + |
| 28 | +Our dataset is small for our need, so we will need to exclude some parts of our conceptual model to test at least our basis hypothesis of being able to cut out dataset into excellence, average and at risk student. |
| 29 | +Here is the model we are going to work with for this first implementation, awaiting more data to arrive to test more modules of our system : |
| 30 | + |
| 31 | +\begin{figure}[H] |
| 32 | + \includegraphics[width=1\linewidth]{res//diagram/simplifiedmdl-Imp_mdl.drawio.png} |
| 33 | + \caption{Simplified algorithmic workflow used in this first implementation.} |
| 34 | + \label{fig:dataworkflow_simp} %lol |
| 35 | +\end{figure} |
| 36 | + |
| 37 | +Continuing on our analytical analysis, we wanted to see the average age of our sample using their year of birth available in our dataset. Here is a heatmap of the different academic year and the hotspots by year of birth : |
| 38 | +\begin{figure}[H] |
| 39 | + \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/heatymap_year.png} |
| 40 | + \caption{Heatmap of year of birth by academic year} |
| 41 | + \label{fig:heatmap_dob_acayear} |
| 42 | +\end{figure} |
| 43 | + |
| 44 | +As we can see, a strong hotspot is the year 1998 with a frequency of 16 over 90 samples. And, for the academic year 2022/2023, we have another strong hotspot for the year of birth 2000 (with a frequency of 9 over 9 for this year of birth) which indicated all students born in 2000 have registered on the same year (2022/2023). |
| 45 | + |
| 46 | +Another variable we wanted to study is the academic mention obtained by students for each academic year. Thus, giving us a view on which was the best and worst years academically in the dataset. |
| 47 | +\begin{figure}[H] |
| 48 | + \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/academicmentions_academic_year.png} |
| 49 | + \caption{Histogram of academic mentions by academic years} |
| 50 | + \label{fig:hist_acament_acayear} |
| 51 | +\end{figure} |
| 52 | + |
| 53 | + |
| 54 | +We can detect one \textit{outlier} for this five available academic years. The year 2022/2023 have been the best year so far with an outstanding 4 \textbf{Très bien} (very good) and 15 \textit{Bien} (good). While the year 2018/2019 lagging behind with 3 students over 14 with a \textit{Passable} mention (only average). However, for both years 2021/2022 and 2022/2023, we have more student with no mention at all. |
| 55 | + |
| 56 | +Finaly, we wanted to see the number of student admitted by year, and do a comparative table of mean of student's final grade for each year. |
| 57 | +\begin{figure}[H] |
| 58 | + \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/nbadmissions_year.png} |
| 59 | + \caption{Evolution on the number of admissions by year} |
| 60 | + \label{fig:evol_nb_admis} |
| 61 | +\end{figure} |
| 62 | +\begin{table}[h] |
| 63 | + \centering |
| 64 | + \begin{tabular}{|c|c|c|c|} |
| 65 | + \hline |
| 66 | + Academic Year & Admited & Adjourned & Total \\ |
| 67 | + \hline |
| 68 | + 2018-2019 & 12.89 & 0.00 & 12.89 \\ |
| 69 | + 2019-2020 & 13.82 & 0.00 & 13.82 \\ |
| 70 | + 2020-2021 & 13.53 & 0.00 & 13.53 \\ |
| 71 | + 2021-2022 & 13.98 & 11.49 & 13.83 \\ |
| 72 | + 2022-2023 & 14.41 & 8.67 & 14.05 \\ |
| 73 | + \hline |
| 74 | + \end{tabular} |
| 75 | + \caption{Admission Statistics (global grade mean) per Academic Year} |
| 76 | + \label{tab:admission_statistics} |
| 77 | +\end{table} |
| 78 | + |
| 79 | +Because our data is not normalized for a \textit{at risk} model, we will only consider predicting the excellence in our dataset with our model. |
| 80 | +As we can see, for each year, the global mean is quite similar, orbiting around 13.5/20. |
| 81 | + |
| 82 | +\subsection{Defining the outcomes} |
| 83 | + |
| 84 | +As discussed per our \ref{sec:soa} State of the Art and \ref{sec:analysis} Analysis sections, one hardening point in this kind of study is the definition of success. This broad question needs to be answered before constructing the system as the parameters of it will be influenced by our needed outcomes. |
| 85 | +For this first implementation, and according to our available variables and data in our dataset, we will define our success simply based on the final overall grade of the student. Thus, we can define \textbf{our} success as follow : |
| 86 | +\begin{quote} |
| 87 | + \textbf{Success :} A student who has follow through the registration correctly, is not dispensed of diploma (forbidden students) and who have an overall grade : |
| 88 | + \begin{equation} |
| 89 | + grade \geq 16 |
| 90 | + \end{equation} |
| 91 | +\end{quote} |
| 92 | + |
| 93 | +We've chose 16 at the minimum value for success depending on our dataset, which have a maximum of 17 for this variable and to have as much data available to train our system. |
| 94 | + |
| 95 | +\subsection{Building and setting up the system} |
| 96 | + |
| 97 | +From figure \ref{fig:dataworkflow_simp}, we must build a system composed of four different \acrshort{ml} algorithm (\acrfull{knn}, \acrfull{if}, \acrfull{lr}), setted up in order to find definition of success within our dataset. This means creating a profile of excellent student with an average grade of at least 16, not forbidden and with a full registration process completed. |
| 98 | +Then, after training our model, we have to teach him correlations between the profile we have setted-up to train it to find excellency within our registration datasets. |
| 99 | + |
| 100 | + |
| 101 | + |
5 | 102 |
|
6 | 103 | \end{document} |
0 commit comments