You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The experimental results acknowledged that extracting the component changes meta-data is valuable by allowing to predict, with good precision and independently of the language, which files are more probable to have defects (Figure \ref{fig:dp-faults-position}).
6
+
The experimental results acknowledged that extracting the component changes meta-data is valuable by allowing to predict, with useful precision and independently of the language, which files are more probable to have defects (Figure \ref{fig:dp-faults-position}).
Precision and recall of the estimation of defect probability exhibited in Figure \ref{fig:dp-precision-recall} is also relevant to analyze. The precision improvement we see on this figure over an uniform probability distribution for clean and buggy illustrates the information gain we obtain with this solution.
14
14
15
-
However, the mean accuracy obtained when classifying the test folds selected by Stratified KFold (Figure \ref{fig:kfold-accuracy-dist}) does not seem to be reflected when classifying the test set, namely the project's state. This could be explained by a overfitting mistake, but we think this is not the problem. We analyzed it, tried to use normalized values for changes, instead of the raw value plus date, and tried to use less features and tested the various options by cutting out more recent data and using the rest to model using KFold. The accuracy of classifying the most recent data was always much lower than the accuracy predicting data that was from a closer time frame.
15
+
However, the mean accuracy obtained when classifying the test folds selected by Stratified KFold (Figure \ref{fig:kfold-accuracy-dist}) does not seem to be reflected when classifying the test set, namely the project's state. This could be explained by a overfitting mistake, but we think this is not the problem. We analyzed it, tried to use normalized values for changes, instead of the raw values and date, tried to use less features and tested the various options by cutting out more recent data and using the rest to create the model using KFold. The accuracy of classifying the most recent data was always much lower than the accuracy predicting data that was from a closer time frame.
16
16
17
17
This may be explained by the fact that evolution of the project may affect which patterns allow to identify faulty components, making data from within a closer time frame, more valuable.
18
18
19
-
The inability to have more consistent results of the mean accuracy, that vary mainly between $0.8$ and $0.95$ as illustrated in Figure \ref{fig:kfold-accuracy-dist} may be caused by the data imbalance and noise.
19
+
The inability to have better and more consistent results of the mean accuracy, that vary mainly between $0.8$ and $0.95$ as illustrated in Figure \ref{fig:kfold-accuracy-dist} may be caused by the data imbalance and noise.
20
20
21
21
Since in each fix commit just a small percentage is changed and all the others are considered clean, extracted data is extremely unbalanced and the number of fault components in the training set is small. Using SMOTE improved the results, but the tendency to $0$ continues to be noticeable.
Analysis of the unmodified Barinel, illustrated in \ref{fig:fault-positions}, showed how good the results are and the tenuous percentage of tests that can be improved by using our approach to modify Barinel results
31
+
Analysis of the unmodified Barinel, illustrated in \ref{fig:fault-positions}, showed how good the results are and the tenuous percentage of tests that can be improved by using our approach to modify Barinel results.
32
32
33
33
\subsection{Results Modification}
34
34
35
35
In the best case scenario, the results modification integration can improve $14.67\%$ of the tests. While in the worst case scenario, $29\%$ of the tests would worsen.
36
36
37
-
Figure \ref{fig:results-modification} shows that when considering as faulty all the components with a predicted defect probability above $0.6$ the Barinel results improve, with little or no error. Examining for example the results for $0.65$ of minimum predicted probability, where the delta is higher, $13.5\%$ of the possible improvements occurred and just one test worsened. Increasing the minimum diminishes both the number of improvements and errors, but starting at $0.75$ errors are completely eliminated.
37
+
Figure \ref{fig:results-modification} shows that when considering as faulty all the components with a predicted defect probability above $0.6$ the Barinel results improve, with little or no error. Examining for example the results for $0.65$ of minimum predicted probability, where the delta is higher, $13.5\%$ of the possible improvements occurred and just one test worsened. Increasing the minimum diminishes both the number of improvements and errors. Starting at $0.75$ errors are completely eliminated.
38
38
39
39
\subsection{Priors Replacement}
40
40
41
-
The best case scenario showed that even with $100\%$ precision the priors replacement integration can result in worsened tests. \todo{Explain why clearly}
41
+
The best case scenario showed that even with $100\%$ precision the priors replacement integration can result in worsened tests. This may be caused by Barinel calculating the defect probability for groups of components and only in the end the defect probabilities for each is calculated based on the probabilities of the groups in which it is present. The changed priors may also change the probability of a related component, which may negatively affect the results.
42
42
43
43
Even though, real results revealed to be promising by improving approximately $43\%$ of the possible tests and not damaging any. This may illustrate how important the defect probability prediction based on language agnostic features is to improve Barinel results.
44
44
45
45
\section{Threats to Validity}
46
46
47
-
There are some threats to the validity of this research. The first is the fact that the Math project (101 tests of 184) appeared to have flaky tests, since with the exact same configuration Barinel, which is deterministic, reported some value changes.
47
+
There are some threats to the validity of this research. The first is the fact that the Math project (101 tests of 184) appeared to have flaky tests, since with the exact same configuration Barinel, which is deterministic, some value changes. were observed.
48
48
49
-
Using three open source Java projects, with 184 tests, may also not be sufficient to predict the application behavior in other different projects
49
+
Using three open source Java projects, with 184 tests, may also not be sufficient to predict the application behavior in other different projects.
50
50
51
51
Being this research all about defect probabilities, we know that the application made to estimate the defect probability can also have defects and may somehow affect the predictions. Although the application was heavily tested and many results were manually checked for validity.
0 commit comments