Hi,
Thank you creating a library with a wide variety of algorithms to use for outlier detection, its been very helpful in my work! However, the presence of COPOD is misleading to the average user. This algorithm is the same as the ECOD algorithm, yet, with the use of copulas, suggests that it takes into account the dependency between dimensions. I'm not knowledgeable on copulas so I hesitate to claim where an error or incorrect assumption was made but there is some discussion on it in #548. Also, this issue is separate from the incorrect implementation of both COPOD and ECOD: #453 #493.
Here I'll try to show that the algorithms are the same with reference to the papers (COPOD, ECOD). See Algorithm 1 of both papers:
With a quick observation it is easy to see that they are similar but to be thorough I'll go through each section.
The input and output formats are the same (see COPOD section II.D.).
They both have two consecutive loops, the first one looping through columns (j instead of d in ECOD), and the second looping through rows.
In both first loops we calculate the left & right tail ECDFs (with some notation change; the right tail ECDFs are calculated differently but equivalently) and then the skew (γ instead of b in ECOD) as the 3rd standardized moment, for each column.
The second loops are rearranged a bit. COPOD's lines 7-14 are condensed into ECOD's step 6 by placing U, V, and W into the logs of the negative log sums instead of being assigned in the first place. Nonetheless, in both loops, the left tail probabilities, the right tail probabilities, and the skew dependent choice tail probabilities are, for every row, separately aggregated over the columns with a negative log sum. Finally, COPOD's line 15 is the same as ECOD's step 7, where we choose the maximum of these aggregations as the final score, for each row.
COPOD's return statement indicates a different output format than defined in section II.D. and doesn't make sense in the context of the loop before it, so I'm chalking that up to a mistake. Thus, the algorithms are the same.
Not only this, COPOD's Table I and Table II seem to perfectly match ECOD's Table 4 and Table 5 respectively, apart from different rounding and added comparison algorithms in ECOD's. I'll leave you to take a look at these tables yourself.
Assuming we've established that these algorithms are the same, which should be deprecated? I suggest COPOD since 1) ECOD is newer, 2) ECOD's paper is more refined in my opinion, having fewer mistakes, and adding a runtime evaluation section, and 3) the use of copulas is misleading in my opinion.
Thanks,
Sam
Hi,
Thank you creating a library with a wide variety of algorithms to use for outlier detection, its been very helpful in my work! However, the presence of COPOD is misleading to the average user. This algorithm is the same as the ECOD algorithm, yet, with the use of copulas, suggests that it takes into account the dependency between dimensions. I'm not knowledgeable on copulas so I hesitate to claim where an error or incorrect assumption was made but there is some discussion on it in #548. Also, this issue is separate from the incorrect implementation of both COPOD and ECOD: #453 #493.
Here I'll try to show that the algorithms are the same with reference to the papers (COPOD, ECOD). See Algorithm 1 of both papers:
With a quick observation it is easy to see that they are similar but to be thorough I'll go through each section.
The input and output formats are the same (see COPOD section II.D.).
They both have two consecutive loops, the first one looping through columns (j instead of d in ECOD), and the second looping through rows.
In both first loops we calculate the left & right tail ECDFs (with some notation change; the right tail ECDFs are calculated differently but equivalently) and then the skew (γ instead of b in ECOD) as the 3rd standardized moment, for each column.
The second loops are rearranged a bit. COPOD's lines 7-14 are condensed into ECOD's step 6 by placing U, V, and W into the logs of the negative log sums instead of being assigned in the first place. Nonetheless, in both loops, the left tail probabilities, the right tail probabilities, and the skew dependent choice tail probabilities are, for every row, separately aggregated over the columns with a negative log sum. Finally, COPOD's line 15 is the same as ECOD's step 7, where we choose the maximum of these aggregations as the final score, for each row.
COPOD's return statement indicates a different output format than defined in section II.D. and doesn't make sense in the context of the loop before it, so I'm chalking that up to a mistake. Thus, the algorithms are the same.
Not only this, COPOD's Table I and Table II seem to perfectly match ECOD's Table 4 and Table 5 respectively, apart from different rounding and added comparison algorithms in ECOD's. I'll leave you to take a look at these tables yourself.
Assuming we've established that these algorithms are the same, which should be deprecated? I suggest COPOD since 1) ECOD is newer, 2) ECOD's paper is more refined in my opinion, having fewer mistakes, and adding a runtime evaluation section, and 3) the use of copulas is misleading in my opinion.
Thanks,
Sam