Añadidas intros a los capítulos y arregladas algunas cosas

unintendedbear · unintendedbear · commit 3d1a85ffa68e · 2014-09-11T11:35:33.000+02:00
diff --git a/Memoria TFM/Capitulos/03Capitulo3.tex b/Memoria TFM/Capitulos/03Capitulo3.tex
@@ -16,6 +16,8 @@ \chapter{Data description and preprocessing}
 \end{Fuente}
 \end{FraseCelebre}
 
+This Chapter is devoted to the first part, a very important one, in every Data Mining process: data preprocessing. In the following sections, the data we have worked with will be deeply described, as well as the way it has been preprocessed. At the end, data will be ready to be analysed with a methodology that will be specified in the next Chapter.
+
 %-------------------------------------------------------------------
 \section{Introduction}
 %-------------------------------------------------------------------
@@ -146,13 +148,7 @@ \section{Company Log}
 
 The dependent variable or class is a label which inherently assigns an decision (and so the following action) to every request. This can be: \textit{ALLOW} if the access is permitted according to the \ac{ISP}, or can be \textit{DENY}, if the connection is not permitted. These patterns are labelled using an `engine' based in a set of security rules, that specify the decision to make. This process is described in Chapter \ref{cap4:methodology}.
 
-During the time of research for this Master Thesis, we have had access to three different log files:
-
-\begin{itemize}
-  \item In the first one, data were gathered along a period of two hours, from 8.30 to 10.30 am (30 minutes after the work started), monitoring the activity of all the employees in a medium-size Spanish company (80-100 people), obtaining 100000 patterns. We consider this dataset quite complete because it contains a very diverse amount of connection patterns, going from personal (traditionally addressed at the first hour of work) to professional issues (the rest of the day). The file was in CSV format.
-  \item The second one was provided in JSON format, and it contains log entries from 12 days, and 5 million patterns.
-  \item A third one, again in CSV format, that is a subset of the second dataset and contains 1 million entries.
-\end{itemize}
+During the time of research for this Master Thesis, we have had access to a set containing data that were gathered along a period of two hours, from 8.30 to 10.30 am (30 minutes after the work started), monitoring the activity of all the employees in a medium-size Spanish company (80-100 people), obtaining 100000 patterns. We consider this dataset quite complete because it contains a very diverse amount of connection patterns, going from personal (traditionally addressed at the first hour of work) to professional issues (the rest of the day). The file was in CSV format.
 
 
 %-------------------------------------------------------------------
diff --git a/Memoria TFM/Capitulos/05Capitulo5.tex b/Memoria TFM/Capitulos/05Capitulo5.tex
@@ -16,6 +16,8 @@ \chapter{Results}
 \end{Fuente}
 \end{FraseCelebre}
 
+This Chapter reports all the experiments thas have been conducted, from the first stage using the initial (and unbalanced) log file, to the last in where we have used separated files for training and testing (see Section \ref{cap4:sec:traintest}). At the beggining, and given that we were working with rules, the initial log file (with the entries that have been labelled as allowed or denied) was tested following a \textit{cross-validation} partition, with all the possible rule and tree classifiers given in Weka. With this initial ranking, we took the five classifiers that obtained the best results and continue testing with all the rest of the partitions. Also, for each section, best results are commented, and therefore conclusions will be given in the next Chapter.
+
 %-------------------------------------------------------------------
 \section{Experiment results}
 %-------------------------------------------------------------------
diff --git a/Memoria TFM/Capitulos/06Capitulo6.tex b/Memoria TFM/Capitulos/06Capitulo6.tex
@@ -16,6 +16,8 @@ \chapter{Conclusions and future work}
 \end{Fuente}
 \end{FraseCelebre}
 
+Finally, this last Chapter is devoted to draw the conclusions over this research work, and names the papers that followed it. Also, as we want to continue researching in this topic, new objectives of future work are set out.
+
 %-------------------------------------------------------------------
 \section{Discussion}
 %-------------------------------------------------------------------
@@ -53,7 +55,12 @@ \section{Future Work}
 %-------------------------------------------------------------------
 \label{cap6:sec:future}
 
-Future lines of work include conducting a deeper set of experiments trying to test the generalisation power of the method, maybe considering bigger data divisions, bigger data sets (from a whole day or working day), or adding some kind of `noise'  to the dataset. One of the future steps to follow will be to perform experiments with the other two datasets (one bigger of 5 million entries, and a 1 million entries subset of it) described in Section \ref{cap3:sec:log}, and obtained almost at the end of this research work.
+Future lines of work include conducting a deeper set of experiments trying to test the generalisation power of the method, maybe considering bigger data divisions, bigger data sets (from a whole day or working day), or adding some kind of `noise'  to the dataset. One of the future steps to follow will be to perform experiments with other two datasets which we have recently had access to: 
+
+\begin{itemize}
+  \item A log file provided in JSON format, which contains log entries from 12 days, and 5 million patterns. As can be found in \ac{CPAN} \citep{cpan_json}, there is a Perl module to directly work with this format, so is a great opportunity to continue testing.
+  \item Another log file, in CSV format as the one processed during this research work, that is a subset of the previous data set and contains 1 million entries. This one colud be processed before the one in JSON format, given that we already have the implementation to preprocess CSV log files. 
+\end{itemize}
 
 So that, considering the good classification results obtained, another next step could be the application of these methods in the real system from which data was gathered, counting with the opinion of expert \ac{CSO}s, in order to know the real value of the proposal.
 The study of other classification methods could be another research
diff --git a/Memoria TFM/acronimos.gdf b/Memoria TFM/acronimos.gdf
@@ -71,3 +71,5 @@
 @entry{CI, , \emph{Computational Intelligence}}
 
 @entry{MCT, , \emph{Main Content Type}}
+
+@entry{CPAN, , \emph{Comprehensive Perl Archive Network}}
diff --git a/Memoria TFM/otros.bib b/Memoria TFM/otros.bib
@@ -874,6 +874,14 @@ @incollection{cost_adjustment_07
 pages={169-178}
 }
 
+@misc{cpan_json,
+  author = {Makamaka Hannyaharamitu},
+  title = {JSON, a Perl module.},
+  year = {2005},
+  webpage = {http://search.cpan.org/~makamaka/JSON-2.90/lib/JSON.pm},
+  lastaccess = {September, 2014}
+}
+
 %--------------------------------
 % Appendix references
 %--------------------------------