Skip to content

Commit 3d1a85f

Browse files
Añadidas intros a los capítulos y arregladas algunas cosas
1 parent 9cabef0 commit 3d1a85f

File tree

5 files changed

+23
-8
lines changed

5 files changed

+23
-8
lines changed

Memoria TFM/Capitulos/03Capitulo3.tex

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ \chapter{Data description and preprocessing}
1616
\end{Fuente}
1717
\end{FraseCelebre}
1818

19+
This Chapter is devoted to the first part, a very important one, in every Data Mining process: data preprocessing. In the following sections, the data we have worked with will be deeply described, as well as the way it has been preprocessed. At the end, data will be ready to be analysed with a methodology that will be specified in the next Chapter.
20+
1921
%-------------------------------------------------------------------
2022
\section{Introduction}
2123
%-------------------------------------------------------------------
@@ -146,13 +148,7 @@ \section{Company Log}
146148

147149
The dependent variable or class is a label which inherently assigns an decision (and so the following action) to every request. This can be: \textit{ALLOW} if the access is permitted according to the \ac{ISP}, or can be \textit{DENY}, if the connection is not permitted. These patterns are labelled using an `engine' based in a set of security rules, that specify the decision to make. This process is described in Chapter \ref{cap4:methodology}.
148150

149-
During the time of research for this Master Thesis, we have had access to three different log files:
150-
151-
\begin{itemize}
152-
\item In the first one, data were gathered along a period of two hours, from 8.30 to 10.30 am (30 minutes after the work started), monitoring the activity of all the employees in a medium-size Spanish company (80-100 people), obtaining 100000 patterns. We consider this dataset quite complete because it contains a very diverse amount of connection patterns, going from personal (traditionally addressed at the first hour of work) to professional issues (the rest of the day). The file was in CSV format.
153-
\item The second one was provided in JSON format, and it contains log entries from 12 days, and 5 million patterns.
154-
\item A third one, again in CSV format, that is a subset of the second dataset and contains 1 million entries.
155-
\end{itemize}
151+
During the time of research for this Master Thesis, we have had access to a set containing data that were gathered along a period of two hours, from 8.30 to 10.30 am (30 minutes after the work started), monitoring the activity of all the employees in a medium-size Spanish company (80-100 people), obtaining 100000 patterns. We consider this dataset quite complete because it contains a very diverse amount of connection patterns, going from personal (traditionally addressed at the first hour of work) to professional issues (the rest of the day). The file was in CSV format.
156152

157153

158154
%-------------------------------------------------------------------

Memoria TFM/Capitulos/05Capitulo5.tex

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ \chapter{Results}
1616
\end{Fuente}
1717
\end{FraseCelebre}
1818

19+
This Chapter reports all the experiments thas have been conducted, from the first stage using the initial (and unbalanced) log file, to the last in where we have used separated files for training and testing (see Section \ref{cap4:sec:traintest}). At the beggining, and given that we were working with rules, the initial log file (with the entries that have been labelled as allowed or denied) was tested following a \textit{cross-validation} partition, with all the possible rule and tree classifiers given in Weka. With this initial ranking, we took the five classifiers that obtained the best results and continue testing with all the rest of the partitions. Also, for each section, best results are commented, and therefore conclusions will be given in the next Chapter.
20+
1921
%-------------------------------------------------------------------
2022
\section{Experiment results}
2123
%-------------------------------------------------------------------

Memoria TFM/Capitulos/06Capitulo6.tex

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ \chapter{Conclusions and future work}
1616
\end{Fuente}
1717
\end{FraseCelebre}
1818

19+
Finally, this last Chapter is devoted to draw the conclusions over this research work, and names the papers that followed it. Also, as we want to continue researching in this topic, new objectives of future work are set out.
20+
1921
%-------------------------------------------------------------------
2022
\section{Discussion}
2123
%-------------------------------------------------------------------
@@ -53,7 +55,12 @@ \section{Future Work}
5355
%-------------------------------------------------------------------
5456
\label{cap6:sec:future}
5557

56-
Future lines of work include conducting a deeper set of experiments trying to test the generalisation power of the method, maybe considering bigger data divisions, bigger data sets (from a whole day or working day), or adding some kind of `noise' to the dataset. One of the future steps to follow will be to perform experiments with the other two datasets (one bigger of 5 million entries, and a 1 million entries subset of it) described in Section \ref{cap3:sec:log}, and obtained almost at the end of this research work.
58+
Future lines of work include conducting a deeper set of experiments trying to test the generalisation power of the method, maybe considering bigger data divisions, bigger data sets (from a whole day or working day), or adding some kind of `noise' to the dataset. One of the future steps to follow will be to perform experiments with other two datasets which we have recently had access to:
59+
60+
\begin{itemize}
61+
\item A log file provided in JSON format, which contains log entries from 12 days, and 5 million patterns. As can be found in \ac{CPAN} \citep{cpan_json}, there is a Perl module to directly work with this format, so is a great opportunity to continue testing.
62+
\item Another log file, in CSV format as the one processed during this research work, that is a subset of the previous data set and contains 1 million entries. This one colud be processed before the one in JSON format, given that we already have the implementation to preprocess CSV log files.
63+
\end{itemize}
5764

5865
So that, considering the good classification results obtained, another next step could be the application of these methods in the real system from which data was gathered, counting with the opinion of expert \ac{CSO}s, in order to know the real value of the proposal.
5966
The study of other classification methods could be another research

Memoria TFM/acronimos.gdf

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,5 @@
7171
@entry{CI, , \emph{Computational Intelligence}}
7272

7373
@entry{MCT, , \emph{Main Content Type}}
74+
75+
@entry{CPAN, , \emph{Comprehensive Perl Archive Network}}

Memoria TFM/otros.bib

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -874,6 +874,14 @@ @incollection{cost_adjustment_07
874874
pages={169-178}
875875
}
876876

877+
@misc{cpan_json,
878+
author = {Makamaka Hannyaharamitu},
879+
title = {JSON, a Perl module.},
880+
year = {2005},
881+
webpage = {http://search.cpan.org/~makamaka/JSON-2.90/lib/JSON.pm},
882+
lastaccess = {September, 2014}
883+
}
884+
877885
%--------------------------------
878886
% Appendix references
879887
%--------------------------------

0 commit comments

Comments
 (0)