-
Notifications
You must be signed in to change notification settings - Fork 131
Description
I have uploaded the sample sequences in link: https://hershey.dbi.udel.edu/wangy/CD-HIT/
Basically with -s 0.8 it does not pick up the members when we cluster with CD-HIT-2D.
OPTIONS="-s 0.8 -c 0.5 -n 3 -i seed_UP000267492.fa -i2 proteome_UP001178061.fa -o test.cd_hit"
${CDHIT2D} ${OPTIONS} > cd_hit.log
This gives us a cluster result as below despite the members are all over 50% ID and 80% long than the seed.
Cluster 0
0 1941aa, >UPI00071F4AE4|1941... *
Cluster 1
0 395aa, >UPI002377C42E|395... *
Cluster 2
0 294aa, >UPI0007200FDE|294... *
Correct clustering result when we use -s 0.5 is:
Cluster 0
0 1941aa, >UPI00071F4AE4|1941... *
1 1939aa, >UNG44331|1939... at 82.77%
Cluster 1
0 395aa, >UPI002377C42E|395... *
1 388aa, >UNG44330|388... at 50.26%
Cluster 2
0 294aa, >UPI0007200FDE|294... *
1 259aa, >UNG44329|259... at 66.41%
With further investigation, we have found that there is a threshold between
-s 0.629251700680272196
-s 0.629251700680272197
When you cross that point all of a sudden all three clusters lose their members. And this clearly is not intended with the option -s. The three clusters in our example obviously have different cut thresholds.