Skip to content

Commit e5cab71

Browse files
committed
Merge branch 'develop' into ckelly_develop
2 parents f10fe03 + fcc43af commit e5cab71

File tree

5 files changed

+55
-9
lines changed

5 files changed

+55
-9
lines changed

sphinx/source/api/api_code.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,13 @@ ParamInterface
156156
:project: api
157157
:path: ../../../include/chimbuko/param.hpp
158158

159+
CopodParam
160+
----------
161+
162+
.. doxygenfile:: copod_param.hpp
163+
:project: api
164+
:path: ../../../include/chimbuko/param/copod_param.hpp
165+
159166
HbosParam
160167
---------
161168

sphinx/source/appendix/appendix_usage.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Options for the provenance database:
2525
Options for the parameter server:
2626

2727
- **ad_win_size** : Number of events around an anomaly to store; provDB entry size is proportional to this
28-
- **ad_alg** : AD algorithm to use. "sstd" or "hbos"
28+
- **ad_alg** : AD algorithm to use. "sstd" or "hbos" or "copod"
2929
- **ad_outlier_sstd_sigma** : number of standard deviations that defines an outlier.
3030
- **ad_outlier_hbos_threshold** : The percentile of events outside of which are considered anomalies by the HBOS algorithm.
3131

@@ -172,7 +172,7 @@ Additional AD Variables
172172
- **-program_idx** : For workflows with multiple component programs, a "program index" must be supplied to the AD instances attached to those processes.
173173
- **-rank** : By default the data rank assigned to an AD instance is taken from its MPI rank in MPI_COMM_WORLD. This rank is used to verify the incoming trace data. This option allows the user to manually set the rank index.
174174
- **-override_rank** : This option disables the data rank verification and instead overwrites the data rank of the incoming trace data with the data rank stored in the AD instance. The value supplied must be the original data rank (this is used to generate the correct trace filename).
175-
- **-ad_algorithm** : This sets the AD algorithm to use for online analysis: "sstd" or "hbos". Default value is "hbos".
175+
- **-ad_algorithm** : This sets the AD algorithm to use for online analysis: "sstd" or "hbos" or "copod". Default value is "hbos".
176176
- **-hbos_threshold** : This sets the threshold to control density of detected anomalies used by HBOS algorithm. Its value ranges between 0 and 1. Default value is 0.99
177177

178178

sphinx/source/introduction/ad.rst

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,14 +39,16 @@ of a function :math:`i`, respectively, and :math:`\alpha` is a control parameter
3939

4040
Advanced anomaly analysis
4141
~~~~~~~~~~~~~~~~~~~~~~~~~
42-
A determistic and non-parametric statistical anomaly detection algorithm called Histogram Based Outilier Scoring (HBOS) is implemented as part of Chimbuko's anomaly analysis module. HBOS is an unsupervised anomaly detection algorithm which scores data in linear time. It supports dynamic bin widths which ensures long-tail distributions of function executions are captured and global anomalies are detected better. HBOS normalizes the histogram and calculates the anomaly scores by taking inverse of estimated densities of function executions. The score is a multiplication of the inverse of the estimated densities given by the following Equation
42+
1. Histogram Based Outlier Score (HBOS) is a deterministic and non-parametric statistical anomaly detection algorithm. It is implemented as part of Chimbuko's anomaly analysis module. HBOS is an unsupervised anomaly detection algorithm which scores data in linear time. It supports dynamic bin widths which ensures long-tail distributions of function executions are captured and global anomalies are detected better. HBOS normalizes the histogram and calculates the anomaly scores by taking inverse of estimated densities of function executions. The score is a multiplication of the inverse of the estimated densities given by the following Equation
4343

4444
.. math::
4545
HBOS_{i} = \log_{2} (1 / density_{i})
4646
47-
where :math:`i` is a function execution and :math:`density_{i}` is function execution probability. HBOS works in :math:`O(nlogn)` using dynamic bin-width or in linear time :math:`O(n)` using fixed bin width. After scoring, the top 1% of scores are filtered as anomalous function executions. This filter value can be set at runtime to adjust the density of detected anomalies.
47+
where :math:`i` is a function execution and :math:`density_{i}` is function execution probability. HBOS works in :math:`O(nlogn)` using dynamic bin-width or in linear time :math:`O(n)` using fixed bin width. After scoring, the top 1% of scores are filtered as anomalous function executions. This filter value can be set at runtime to adjust the density of detected anomalies.
4848
49-
(See `ADOutlier <../api/api_code.html#adoutlier>`__ and `HbosParam <../api/api_code.html#hbosparam>`__).
49+
2. Another algorithm is added into Chimbuko's advanced anomaly analysis called the COPula based Outlier Detection (COPOD), which is a deterministic, parameter-free anomaly detection algorithm. It computes empirical copulas for each sample in the dataset. A copula defines the dependence structure between random variables. For each sample in the dataset, COPOD algorithm computes left-tail empirical copula from left-tail empirical cumulative distribution function, right-tail copula from right-tail empirical cumulative distribution function, and a skewness-corrected empirical copula using a skewness coefficient calculated from left-tail and right-tail empirical cumulative distribution functions. These three computed values are interpreted as left-tail, right-tail, and skewness-corrected probabilities, respectively. Lowest probability value results in largest negative-log value, which is the score assigned to the sample in the dataset. Samples with the highest scores in the dataset are tagged as anomalous.
50+
51+
(See `ADOutlier <../api/api_code.html#adoutlier>`__, `HbosParam <../api/api_code.html#hbosparam>`__ and `CopodParam <../api/api_code.html#copodparam>`__).
5052

5153
Provenance data collection
5254
--------------------------

sphinx/source/introduction/ps.rst

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Design
1515
:scale: 50 %
1616
:alt: Simple parameter server architecture
1717

18-
Parameter server architecture
18+
Parameter server architecture
1919

2020
(**C**)lients (i.e. on-node AD modules) send requests with their locally-computed anomaly detection algorithm parameters to be aggregated with the global parameters and the updated parameters returned to the client. Network communication is performed using the `ZeroMQ <https://zeromq.org>`_ library and using `Cereal <https://uscilab.github.io/cereal/>`_ for data serialization.
2121

@@ -24,11 +24,18 @@ via the **Backend** router in round-robin fashion. For the task of updating para
2424

2525
A dedicated (**S**)treaming thread (cf. :ref:`api/api_code:PSstatSender`) is maintained that periodically sends the latest global statistics to the visualization server.
2626

27+
Anomaly ranking metrics
28+
-----------------------
29+
30+
Two metrics are developed that are assigned to each outlier that allow the user to focus on the subset of anomalies that are most important:
31+
the anomaly score reflects how unlikely an anomaly is, and the anomaly severity reflects how important the anomaly is to the runtime of the application.
32+
PS includes these values in the provenance information and allow for the convenient sorting and filtering
33+
of the anomalies in post-analysis. We have tested to present the individual choice of these metrics in the
34+
online visualization module.
2735

2836
..
29-
While testing has demonstratedThis simple parameter server becomes a bottleneck as the number of requests (or clients) are increasing.
37+
While testing has demonstratedThis simple parameter server becomes a bottleneck as the number of requests (or clients) are increasing.
3038
In the following subsection, we will describe the scalable parameter server.
3139
Scalable Parameter Server
3240
-------------------------
3341
TBD
34-

sphinx/source/io_schema/pserver_schema.rst

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ The schema for the **'anomaly_stats'** object is as follows:
5151
| [
5252
| {
5353
| **'data'**: *Number of anomalies and anomaly time window for process/rank broken down by io step (array)*
54-
| [
54+
| [
5555
| {
5656
| **'app'**: *Program index*,
5757
| **'max_timestamp'**: *Latest time of anomaly in io step*,
@@ -89,6 +89,36 @@ The schema for the **'anomaly_stats'** object is as follows:
8989
| },
9090
| ...
9191
| ], *end of* **anomaly** *array*
92+
| **‘anomaly_metrics’**:
93+
| [
94+
| {
95+
| **'app'**: *Application*,
96+
| **'rank'**: *Program rank*,
97+
| **'fid'**: *function ID*,
98+
| **'fname'**: *funciton name*,
99+
| **‘_id'**: *a global index to track each (app, rank, func), for internal use*,
100+
| **'new_data'**: *Statistics of anomaly metrics aggregated over multiple IO steps since the last pserver->viz send*
101+
| {
102+
| **'first_io_step'**: *first io step in sum*
103+
| **'last_io_step'**: *last io step in sum*
104+
| **‘max_timestamp’**: *max timestamp of last IO step of this period*
105+
| **‘min_timestamp’**: *min timestamp of first IO step of this period*
106+
| **'severity'**: *RunStats assigned severity*
107+
| **'score'**: *RunStats assigned score*
108+
| **'count'**: *RunStats count*
109+
| }
110+
| **'all_data'**: *Statistics of anomaly metrics aggregated since the beginning of the run*
111+
| {
112+
| **'first_io_step'**: *first io step in sum*
113+
| **'last_io_step'**: *last io step in sum*
114+
| **‘max_timestamp’**: *max timestamp of last IO step since start of run*
115+
| **‘min_timestamp’**: *min timestamp of first IO step since start of run*
116+
| **'severity'**: *RunStats assigned severity*
117+
| **'score'**: *RunStats score*
118+
| **'count'**: *RunStats count*
119+
| }
120+
| }
121+
| ], *end of* **anomaly_metrics**
92122
| **'func'**: *Statistics on anomalies broken down by function, collected over entire run to-date (array)*
93123
| [
94124
| {

0 commit comments

Comments
 (0)