UPPA-s-University-Projects
diff --git a/‎bib/glossaries/abbreviation.tex‎
Lines changed: 4 additions & 0 deletions b/‎bib/glossaries/abbreviation.tex‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎main.tex‎
Lines changed: 4 additions & 6 deletions b/‎main.tex‎
Lines changed: 4 additions & 6 deletions
diff --git a/‎res/diagram/simplifiedmdl-Imp_mdl.drawio.png‎
86.5 KB b/‎res/diagram/simplifiedmdl-Imp_mdl.drawio.png‎
86.5 KB
diff --git a/‎res/graph/data_analysis/raw/academicmentions_academic_year.png‎
194 KB b/‎res/graph/data_analysis/raw/academicmentions_academic_year.png‎
194 KB
diff --git a/‎res/graph/data_analysis/raw/heatymap_year.png‎
13.7 KB b/‎res/graph/data_analysis/raw/heatymap_year.png‎
13.7 KB
diff --git a/‎res/graph/data_analysis/raw/nbadmissions_year.png‎
129 KB b/‎res/graph/data_analysis/raw/nbadmissions_year.png‎
129 KB
diff --git a/‎sections/conclusion.tex‎
Lines changed: 11 additions & 12 deletions b/‎sections/conclusion.tex‎
Lines changed: 11 additions & 12 deletions
diff --git a/‎sections/conprop.tex‎
Lines changed: 2 additions & 19 deletions b/‎sections/conprop.tex‎
Lines changed: 2 additions & 19 deletions
diff --git a/‎sections/imp.tex‎
Lines changed: 97 additions & 0 deletions b/‎sections/imp.tex‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎sections/soa/subsec:soa_predictingstudentdropout.tex‎
Lines changed: 6 additions & 3 deletions b/‎sections/soa/subsec:soa_predictingstudentdropout.tex‎
Lines changed: 6 additions & 3 deletions
@@ -14,6 +14,10 @@
 
 \newacronym{rf}{RF}{Random Forest}
 
+\newacronym{if}{IF}{Isolation Forest}
+
+\newacronym{lr}{LR}{Lasso Regression}
+
 \newacronym{smote}{SMOTE}{Synthetic Minority Oversampling TEchniques}
 
 \newacronym{roc}{ROC}{Receiver operating characteristic}
 
@@ -11,6 +11,7 @@
 \usepackage{float}
 \usepackage{subfiles}
 \usepackage[toc]{glossaries}
+\usepackage{listings}
 
 %Style coding
 %\restylefloat{table}
@@ -76,12 +77,9 @@ \section{Conceptual proposal}
 \label{sec:conprop}
 \subfile{sections/conprop}
 
-%Didn't had the time to recieve and experiment with the different datasets.
-%So the implementation will have to be added in a second time (version)
-%TODO: Add the implementation when experimentations are done
-% \section{Implementation}
-% \label{sec:imp}
-% \subfile{sections/imp}
+\section{Implementation}
+\label{sec:imp}
+\subfile{sections/imp}
 
 \section{Conclusion}
 \label{sec:conclusion}
 
@@ -10,20 +10,19 @@ \subsection{Summary of Literature Review Findings}
 First, let's talk about our literature review and conclude on the state, of our state of the art on the subject.
 We saw that many articles and research had been done on the subject or similar subject of predicting students success or dropout using statistical analysis and/or \acrfull{ml} algorithm and \acrfull{ai}.
 We have consider the need to predict student's success or failure the same and coming from the same human factors. When looking at the analytical part of the literature (not focusing on any machine learning model / algorithm), we were able to withdraw a list of factors commonly proven to be identifiable in someone's success and failure, and more particularly including factors targeting students. As put inside our analytical predictive approach from the state of the art \ref{subsubsec:soa_analyticalapproach}, the factors were : 
-\cite{opazo_analysis_2021,tinto_dropout_1975,caspersen_teachers_2015,lidia_problema_2006,bejarano_caso_2017,sinchi_acceso_2018,cavero_voluntad_2011,velasco_alisis_nodate}: 
 
 \begin{itemize}
-    \item Family : Does that person got support from their family? Do they still have a family, are they in good term, are they living with them?
-    \item Previous educational background : What is this individual background on an educational level? What was their last diploma, which level are they on? 
-    \item Academic potential : Do they have already been approached as potential excellent student?
-    \item Normative congruence : Does the individual conform to societal rules? 
-    \item Friendship support : Does the individual have good support from friends? Do they have friends? How are they social life with other person (preferably from within their age range)?
-    \item Intellectual development : Has the individual been able to process and have a \textit{regular} intellectual development? Do they have a condition impacting this factor? 
-    \item Educational performance : Have they proven performant on an educational level already? How were they previous performance?
-    \item Social integration : Have they integrated fine with other student, staff and their new academic environment?
-    \item Satisfaction : Are they satisfied with their life's choice (More precisely, are they happy with their study choice?)
-    \item Institutional commitment : Do they commit to their success and to the institutional life? Or do they only go in class and do the bare minimum?
-    \item Student adaptation : Just like \textbf{Social integration} and \textbf{Normative congruence}, how does that individual adapt to its new environment and life?
+    \item Family
+    \item Previous educational background
+    \item Academic potential
+    \item Normative congruence
+    \item Friendship support
+    \item Intellectual development
+    \item Educational performance
+    \item Social integration
+    \item Satisfaction
+    \item Institutional commitment
+    \item Student adaptation
 \end{itemize}
 
 When we had gather enough data to found what factors from our datasets we needed to extract to feed our models, we needed to search which models had already been tested and proven within the literature. 
 
@@ -14,29 +14,11 @@
 \subsection{Feeding data}
 \label{subsec:conprop_feedingdata}
 Our literature survey \ref{subsubsec:soa_analyticalapproach} has identified several key factors influencing student retention and success. We can extrapolate and hypothesis such wide factors could be used to determine student's success.
-These factors, hypothesized to be critical in predicting student trajectories, are: \cite{opazo_analysis_2021,tinto_dropout_1975,caspersen_teachers_2015,lidia_problema_2006,bejarano_caso_2017,sinchi_acceso_2018,cavero_voluntad_2011,velasco_alisis_nodate}: 
-
-\begin{itemize}
-    \item Family : Does that person got support from their family? Do they still have a family, are they in good term, are they living with them?
-    \item Previous educational background : What is this individual background on an educational level? What was their last diploma, which level are they on? 
-    \item Academic potential : Do they have already been approached as potential excellent student?
-    \item Normative congruence : Does the individual conform to societal rules? 
-    \item Friendship support : Does the individual have good support from friends? Do they have friends? How are they social life with other person (preferably from within their age range)?
-    \item Intellectual development : Has the individual been able to process and have a \textit{regular} intellectual development? Do they have a condition impacting this factor? 
-    \item Educational performance : Have they proven performant on an educational level already? How were they previous performance?
-    \item Social integration : Have they integrated fine with other student, staff and their new academic environment?
-    \item Satisfaction : Are they satisfied with their life's choice (More precisely, are they happy with their study choice?)
-    \item Institutional commitment : Do they commit to their success and to the institutional life? Or do they only go in class and do the bare minimum?
-    \item Student adaptation : Just like \textbf{Social integration} and \textbf{Normative congruence}, how does that individual adapt to its new environment and life?
-\end{itemize}
-
 
 \subsection{Data workflow}
 \label{subsec:concimp_dataworkflow}
 Our workflow, as depicted in Figure \ref{fig:dataworkflow}, is designed to systematically transform raw data into actionable insights. Even though we are looking to find excellence in registration for students, this model could be used and/or improved as a security measure to detect students at risk of dropping out.
 
-
-
 Each component of the workflow serves a strategic purpose:
 
 \begin{enumerate}
@@ -60,6 +42,7 @@ \subsection{Available dataset}
     \item Institutional commitment : Do they commit to their success and to the institutional life? Or do they only go in class and do the bare minimum?
 \end{itemize}
 
+
 \subsection{Validation and Expected Outcomes}
 \label{subsec:concimp_validexcpecoutcomes}
  We anticipate that this workflow will yield a robust model capable of identifying excellent students. We will gauge the efficiency of our model through rigorous validation techniques such as \acrfull{roc}, \acrfull{pca}, etc. to ensure the reliability of our predictions. 
@@ -93,7 +76,7 @@ \subsection{Usage on the field}
     \label{fig:imp_fonc}
 \end{figure}
 
-This diagram\ref{fig_imp_fond} shows a possible way for institution on one implementation possibility following a pretty basic process. It starts from the data collection at the beginning, from one or multiple database from the institution. A pre-cleanup (and if needed data aggregation) should be deployed by the institution. We have left this part free of choice at  the moment.
+This diagram \ref{fig:imp_fonc} shows a possible way for institution on one implementation possibility following a pretty basic process. It starts from the data collection at the beginning, from one or multiple database from the institution. A pre-cleanup (and if needed data aggregation) should be deployed by the institution. We have left this part free of choice at  the moment.
 Then, whenever this new dataset is constructed from the institution's data following our factors list, we can send it through are framework model and wait for the output(s). Depending on the institution's \textbf{need} and \textbf{definition of success}, we can provide one or more outputs. We can also outputs model evaluation metrics if wanted / needed. 
 The dataset fed to the machine should include in a certain way these factors seen in subsection \ref{subsec:conprop_feedingdata}, for our framework to be able to create its student profiles and evaluate them.
 
 
@@ -2,5 +2,102 @@
 \graphicspath{{\subfix{../res/}}}
 \begin{document}
 
+Let's begin this implementation by studying our raw dataset (cleaned) and defining the outcomes we want from our system. When these two sections will have been discussed, and choices made, we are going to explore the construct and results from our experimental system.
+
+\subsection{Analysing the raw dataset}
+We made a first study of our raw dataset (after some light cleaning of it) to try and understand the data we are working with. First, we wanted to understand the data universe itself. How many entry have we available on our hand. This is the result of this first analysis :
+\begin{table}[h]
+  \centering
+  \begin{tabular}{|c|c|}
+    \hline
+    Academic Year & Number of Students \\
+    \hline
+    2018-2019 & 14 \\
+    2019-2020 & 13 \\
+    2020-2021 & 13 \\
+    2021-2022 & 17 \\
+    2022-2023 & 32 \\
+    \hline
+    Total & 90 \\
+    \hline
+  \end{tabular}
+  \caption{Number of Students per Academic Year}
+  \label{tab:students_per_year}
+\end{table}
+
+Our dataset is small for our need, so we will need to exclude some parts of our conceptual model to test at least our basis hypothesis of being able to cut out dataset into excellence, average and at risk student.
+Here is the model we are going to work with for this first implementation, awaiting more data to arrive to test more modules of our system : 
+
+\begin{figure}[H]      
+    \includegraphics[width=1\linewidth]{res//diagram/simplifiedmdl-Imp_mdl.drawio.png}
+    \caption{Simplified algorithmic workflow used in this first implementation.}
+    \label{fig:dataworkflow_simp} %lol
+\end{figure}
+
+Continuing on our analytical analysis, we wanted to see the average age of our sample using their year of birth available in our dataset. Here is a heatmap of the different academic year and the hotspots by year of birth : 
+\begin{figure}[H]      
+    \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/heatymap_year.png}
+    \caption{Heatmap of year of birth by academic year}
+    \label{fig:heatmap_dob_acayear}
+\end{figure}
+
+As we can see, a strong hotspot is the year 1998 with a frequency of 16 over 90 samples. And, for the academic year 2022/2023, we have another strong hotspot for the year of birth 2000 (with a frequency of 9 over 9 for this year of birth) which indicated all students born in 2000 have registered on the same year (2022/2023).
+
+Another variable we wanted to study is the academic mention obtained by students for each academic year. Thus, giving us a view on which was the best and worst years academically in the dataset.
+\begin{figure}[H]      
+    \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/academicmentions_academic_year.png}
+    \caption{Histogram of academic mentions by academic years}
+    \label{fig:hist_acament_acayear}
+\end{figure}
+
+
+We can detect one \textit{outlier} for this five available academic years. The year 2022/2023 have been the best year so far with an outstanding 4 \textbf{Très bien} (very good) and 15 \textit{Bien} (good). While the year 2018/2019 lagging behind with 3 students over 14 with a \textit{Passable} mention (only average). However, for both years 2021/2022 and 2022/2023, we have more student with no mention at all.
+
+Finaly, we wanted to see the number of student admitted by year, and do a comparative table of mean of student's final grade for each year. 
+\begin{figure}[H]      
+    \includegraphics[width=1\linewidth]{res/graph/data_analysis/raw/nbadmissions_year.png}
+    \caption{Evolution on the number of admissions by year}
+    \label{fig:evol_nb_admis}
+\end{figure}
+\begin{table}[h]
+  \centering
+  \begin{tabular}{|c|c|c|c|}
+    \hline
+    Academic Year & Admited & Adjourned & Total \\
+    \hline
+    2018-2019 & 12.89 & 0.00 & 12.89 \\
+    2019-2020 & 13.82 & 0.00 & 13.82 \\
+    2020-2021 & 13.53 & 0.00 & 13.53 \\
+    2021-2022 & 13.98 & 11.49 & 13.83 \\
+    2022-2023 & 14.41 & 8.67 & 14.05 \\
+    \hline
+  \end{tabular}
+  \caption{Admission Statistics (global grade mean) per Academic Year}
+  \label{tab:admission_statistics}
+\end{table}
+
+Because our data is not normalized for a \textit{at risk} model, we will only consider predicting the excellence in our dataset with our model.
+As we can see, for each year, the global mean is quite similar, orbiting around 13.5/20.
+
+\subsection{Defining the outcomes}
+
+As discussed per our \ref{sec:soa} State of the Art and \ref{sec:analysis} Analysis sections, one hardening point in this kind of study is the definition of success. This broad question needs to be answered before constructing the system as the parameters of it will be influenced by our needed outcomes.
+For this first implementation, and according to our available variables and data in our dataset, we will define our success simply based on the final overall grade of the student. Thus, we can define \textbf{our} success as follow :
+\begin{quote}
+    \textbf{Success :} A student who has follow through the registration correctly, is not dispensed of diploma (forbidden students) and who have an overall grade :
+    \begin{equation}
+        grade \geq 16
+    \end{equation}
+\end{quote}
+
+We've chose 16 at the minimum value for success depending on our dataset, which have a maximum of 17 for this variable and to have as much data available to train our system.
+
+\subsection{Building and setting up the system}
+
+From figure \ref{fig:dataworkflow_simp}, we must build a system composed of four different \acrshort{ml} algorithm (\acrfull{knn}, \acrfull{if}, \acrfull{lr}), setted up in order to find definition of success within our dataset. This means creating a profile of excellent student with an average grade of at least 16, not forbidden and with a full registration process completed. 
+Then, after training our model, we have to teach him correlations between the profile we have setted-up to train it to find excellency within our registration datasets.
+
+
+
 
 \end{document}
@@ -12,9 +12,9 @@
 \end{figure}
 
 If we dive deeper into the analytical search on these platform (we are going to concentrate on Scopus for now), using this search term : 
-\begin{verbatim}
+\begin{lstlisting}[breaklines]
 TITLE-ABS-KEY ( student  AND dropout )  AND  ( LIMIT-TO ( SUBJAREA ,  "SOCI" )  OR  LIMIT-TO ( SUBJAREA ,  "COMP" )  OR  LIMIT-TO ( SUBJAREA ,  "PSYC" )  OR  LIMIT-TO ( SUBJAREA ,  "ENGI" )  OR  LIMIT-TO ( SUBJAREA ,  "MATH" ) ) 
-\end{verbatim}
+\end{lstlisting}
 We can follow the trend on the number of publication each year about the subject of student dropout prediction and we can also once again notice the longevity of the subject in research, dating all the way back since the 1950's.
 Here is the analysis as a graph extracted from Scopus : 
 \begin{figure}[H]
@@ -41,7 +41,10 @@
 
 We this understand how universal this problem is from all the different top countries publishing about that subject since the 1950's. At least one country from the five continent have one publication in this subject. Moreover, many fields have looked into the subject, giving us a lot of interesting point of view to analyze from.
 
-Now, if we look for the same subject but adding the \acrshort{ml} or \acrshort{ai} to it :TITLE-ABS-KEY ( student  AND  dropout  AND  ( machine  AND  learning  OR  artificial  AND  intelligence ) )  AND  ( LIMIT-TO ( SUBJAREA ,  "SOCI" )  OR  LIMIT-TO ( SUBJAREA ,  "COMP" )  OR  LIMIT-TO ( SUBJAREA ,  "PSYC" )  OR  LIMIT-TO ( SUBJAREA ,  "ENGI" )  OR  LIMIT-TO ( SUBJAREA ,  "MATH" )  OR  LIMIT-TO ( SUBJAREA ,  "DECI" ) )|
+Now, if we look for the same subject but adding the \acrshort{ml} or \acrshort{ai} to it :
+\begin{lstlisting}[breaklines]
+TITLE-ABS-KEY ( student  AND  dropout  AND  ( machine  AND  learning  OR  artificial  AND  intelligence ) )  AND  ( LIMIT-TO ( SUBJAREA ,  "SOCI" )  OR  LIMIT-TO ( SUBJAREA ,  "COMP" )  OR  LIMIT-TO ( SUBJAREA ,  "PSYC" )  OR  LIMIT-TO ( SUBJAREA ,  "ENGI" )  OR  LIMIT-TO ( SUBJAREA ,  "MATH" )  OR  LIMIT-TO ( SUBJAREA ,  "DECI" ) )
+\end{lstlisting}
 
 We obtain the following graph :
 \begin{figure}[H]