Skip to content

Commit 97f8f90

Browse files
committed
v4.0.1
1 parent 7ef4a02 commit 97f8f90

File tree

20 files changed

+164
-242
lines changed

20 files changed

+164
-242
lines changed

doc/Release.html

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
There have been many improvements, but the following provides some recent critical points.
1515
<table>
16-
<tr><td><a href=#v343>v4.0.0</a><td>22-Mar-22<td><ttp>Assorted<td>Tidy up somethings in order to start v4.
16+
<tr><td><a href=#v343>v4.0.1</a><td>27-Apr-22<td><ttp>Random<td>Small improvements
1717
<tr><td><a href=#v339>v3.3.9</a><td>27-Dec-21<td><ttp><ttp>runAS</ttp></ttp><td>Improved interface for full subset UniProt; updated Demo.
1818
<tr><td><a href=#v335>v3.3.5</a><td>04-Nov-21<td><ttp>viewSingle</ttp><td>log2FC TPM analysis
1919
<tr><td><a href=#v334>v3.3.4</a><td>18-Oct-21<td><ttp>runSingle</ttp><td><ttl>ORF finder</ttl> improvements
@@ -27,6 +27,33 @@
2727
<tr><td><a href=#early>Earlier</a><td>2020<td>&nbsp;<td>&nbsp;
2828
</table>
2929

30+
<a name=v400>
31+
<h4>v4.0.1 27-Apr-2022</h4>
32+
33+
<ttp>runMultiTCW</ttp>
34+
<ul>
35+
<li>The cluster score for a Sum-of-Pairs has been changed to the sum-of-comparisons/#comparisons,
36+
where the #comparisons = (nSeqs*(nSeqs-1)/2) * nCols
37+
<li>Fixed a potential bug: if the MSA consensus sequence was very long, a MySQL error would occur.
38+
</ul>
39+
<ttp>viewMultiTCW</ttp>
40+
<ul>
41+
<li>Fixed a potential bug: the pre-computed MSAdb would not display if score1&lt;0.
42+
</ul>
43+
<ttp>runSingleTCW</ttp>
44+
<ul>
45+
<li>Description prune update: (1) Was taking the one with the most GOs, even if the bitScore was
46+
less; now only takes the one with GOs if the other has none and bitscores are close.
47+
(2) If there was a "{...}" at end of description,
48+
it was not being removed before finding unique descriptions.
49+
</ul>
50+
51+
<p>Scripts
52+
<ul>
53+
<li>Added/changed a few scripts that are used for results in the next version of the
54+
BioRxiv publication
55+
</ul>
56+
3057
<a name=v343>
3158
<h4>v4.0.0 22-Mar-2022</h4>
3259

doc/mtcw/UserGuide.html

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -583,11 +583,16 @@ <h4>Details on running <ttp>KaKs_calculator</ttp></h4>
583583

584584
<a id=msa></a>
585585
<h4>Details on MSA score and using <ttp>MstatX</ttp></h4>
586-
The default <u>Score1</u> and <u>Score2</u> are built-in methods. Score1 is Sum-of-Pairs,
587-
which are normalized between 0-1, where 1 is the best score.
588-
Score2 is "Wentropy", which is copied
586+
587+
By default, Score1 is the Sum-of-Pairs. The Sum-of-pairs score compares each two characters in the column,
588+
where there are 22 possible characters (20 amino acids, gap '-', and leading/trailing space ' ').
589+
The comparison scores are: (aa,aa) is the BLOSUM68 score, (aa,'-') is -4, (aa,' ') is -1, ('-','-') is 0, (' ',' ') is 0.
590+
The cluster value is the sum-of-pairs/#comparisons, where a higer score is a better score. (see note below).
591+
592+
<p> Score2 is "Wentropy", which is copied
589593
directly from the <ttp>MstatX</ttp><sup>4</sup> with the exception that the scores are (1-score) so that
590594
1 is the best score.
595+
591596
<p>The <ttp>MstatX</ttp> executable is in the <ttx>/Ext</ttx> directory and can be used for computing the scores.
592597
From the appropriate <ttx>/Ext</ttx> sub-directory, run <ttx>./mstatx -h</ttx> to view the scoring methods available.
593598
The method used by <ttp>runMultiTCW</ttp> can be changed from the command line, as follows:
@@ -606,6 +611,9 @@ <h4>Details on MSA score and using <ttp>MstatX</ttp></h4>
606611
be updated with the new scores.
607612
<p>For developers: you can add your own method to the <ttp>MstatX</ttp> program and use it in <ttp>mTCW</ttp>.
608613

614+
<p><font color=red>NOTE: </font> The Wentropy or one of the other mStatX statistics should be valid to use in a
615+
publication. Do not use the Sum-of-Pairs score in publication unless cleared with a statistician. The Sum-of-pairs
616+
score is useful on the MSA display in <ttp>viewMultiTCW</ttp> when viewing column scores.
609617
<!-- ========== Details ============= -->
610618
<a id=details></a>
611619
<table style="width: 100%"><tr><td style="text-align: left">

doc/ov/mDemo.html

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
<h2>mTCW overview for ex</h2>
33
<table width=600 border=1><tr><td><pre>Project: ex Cluster: 1.0k Pairs: 846 Seqs: 707 Hits: 4.0k GOs: 8.0k PCC Stats KaKs Multi
44

5-
Created: 15-Mar-22 v4.0.0 Last Update: 20-Mar-22 v4.0.0
5+
Created: 21-Mar-22 v4.0.0 Last Update: 04-Apr-22 v4.0.1
66

77
DATASETS: 3
88
Type #Seq #annotated #annoDB Created Remark
@@ -13,11 +13,11 @@ <h2>mTCW overview for ex</h2>
1313
CLUSTER SETS: 5
1414
Statistics
1515
Prefix Method conLen sdLen Score1 SD Score2 SD
16-
CL Closure 579.80 58.50 0.48 0.08 0.78 0.15
17-
OM orthoMCL.OM-4 606.32 85.26 0.45 0.10 0.75 0.16
18-
B12 BBH bar,fly 540.22 62.63 0.50 0.07 0.72 0.18
19-
B13 BBH bar,foo 558.75 34.11 0.53 0.07 0.79 0.15
20-
B23 BBH fly,foo 559.57 70.82 0.47 0.08 0.80 0.21
16+
CL Closure 579.75 58.50 3.72 1.06 0.78 0.15
17+
OM orthoMCL.OM-4 606.38 85.26 3.30 1.40 0.75 0.16
18+
B12 BBH bar,fly 540.22 62.63 3.58 1.10 0.72 0.18
19+
B13 BBH bar,foo 558.75 34.11 4.09 0.82 0.79 0.15
20+
B23 BBH fly,foo 559.57 70.82 3.98 1.28 0.80 0.21
2121

2222
Sizes
2323
Prefix =2 =3 4-5 6-10 11-15 16-20 21-25 >25 Total #Seqs
@@ -72,7 +72,7 @@ <h2>mTCW overview for ex</h2>
7272
foo 6 16 18
7373
----------------------------------------------------
7474
PROCESSING:
75-
AA: diamond --masking 0 --query-cover 25 --subject-cover 25
75+
AA: diamond --masking 0 --query-cover 25 --subject-cover 25
7676
NT: blastn -evalue 1e-05 -max_hsps 1 -max_target_seqs 25
7777

7878
MSA: Score1=Sum-of-Pairs Score2=Wentropy
-8.41 KB
Loading

java/src/cmp/align/MultiAlignData.java

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,7 @@ private void scoreMSA() {
292292
glScore1 = smObj.scoreMstatX(msaScoreName1, outAlgnFile, resultFile);
293293
}
294294
else
295-
glScore1 = smObj.scoreSumOfPairs(grpName, alignSeq, isRun);
295+
glScore1 = smObj.scoreSumOfPairs(grpName, alignSeq);
296296

297297
double [] score1 = smObj.getScores();
298298
String [] tScore1 = smObj.getStrSc();
@@ -316,14 +316,25 @@ private void scoreMSA() {
316316
}
317317

318318
/** Comma delimited list for saving and for MSA View **/
319+
// CAS401 remove space after comma, check for >65000
319320
strColScores1=strColScores2=null;
320321
for (double d : score1) {
321322
if (strColScores1==null) strColScores1 = String.format("%.3f", d);
322-
else strColScores1 += String.format(", %.3f", d);
323+
else strColScores1 += String.format(",%.3f", d);
323324
}
324325
for (double d : score2) {
325326
if (strColScores2==null) strColScores2 = String.format("%.3f", d);
326-
else strColScores2 += String.format(", %.3f", d);
327+
else strColScores2 += String.format(",%.3f", d);
328+
}
329+
if (strColScores1.length()>=65000) {
330+
Out.PrtErr("TCW error: too many columns to store score1 for "
331+
+ grpName + "(" + strColScores1.length() + ")");
332+
strColScores1 = "error";
333+
}
334+
if (strColScores2.length()>=65000) {
335+
Out.PrtErr("TCW error: too many columns to store score2 for "
336+
+ grpName + "(" + strColScores2.length() + ")");
337+
strColScores2 = "error";
327338
}
328339
}
329340
catch(Exception e) {ErrorReport.reportError(e, "Write Scores");}

java/src/cmp/align/ScoreMulti.java

Lines changed: 13 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
import java.io.FileReader;
1010
import java.util.Vector;
1111
import java.util.HashSet;
12-
import java.util.HashMap;
13-
import java.util.TreeSet;
1412

1513
import cmp.compile.runMTCWMain;
1614
import cmp.database.Globals;
@@ -22,12 +20,11 @@
2220
import util.methods.Out;
2321
import util.methods.RunCmd;
2422
import util.methods.Static;
25-
import util.methods.Stats;
2623
import util.methods.TCWprops;
2724

2825
public class ScoreMulti {
2926
public boolean bTest = false;
30-
public boolean SoP_NORM = runMTCWMain.bNSoP; // CAS313 default true
27+
public boolean SoP_NORM = runMTCWMain.bNSoP; // CAS313; CAS401 makes average /nCols (was doing min-max normalization)
3128

3229
private final int bothGAP = 0;
3330
private final int hangGAP = 0;
@@ -44,17 +41,16 @@ public ScoreMulti() {
4441
}
4542

4643
/**************************************************************
47-
* Avg of columns sum of Sum of Pairs
44+
* Sum of Sum of Pairs
4845
// www.info.univ-angers.fr/~gh/Idas/Wphylog/guidetree.pdf
4946
// Its averaged by the number of columns, otherwise, bigger clusters will likely have bigger scores
5047
*****************************************************************/
5148
private final char BORDER=Globalx.hangCh; // leading or trailing gap
52-
public double scoreSumOfPairs(String grpID, String [] alignedSeq, boolean isRun) {
49+
public double scoreSumOfPairs(String grpID, String [] alignedSeq) {
5350
try {
5451
int nRows = alignedSeq.length;
5552
int nCols = alignedSeq[0].length();
5653
dScores = new double [nCols];
57-
String [] comp = new String [nCols];
5854
strScores = null;
5955

6056
if (nRows>maxRow) {
@@ -85,102 +81,31 @@ public double scoreSumOfPairs(String grpID, String [] alignedSeq, boolean isRun)
8581
}
8682

8783
for (int c=0; c<nCols; c++) {
88-
int col_stat = 0;
84+
dScores[c] = 0;
8985

9086
for (int r=1; r<nRows-1; r++) { // first is consensus
9187
char a = seqs[r][c];
9288

9389
for (int x=r+1; x<nRows; x++)
94-
col_stat += scoreCh(a, seqs[x][c]);
95-
}
96-
dScores[c] = col_stat;
97-
98-
if (!isRun) { // duplicate of what is in MultiAlignPanel - write to text file
99-
HashMap <Character, Integer> aaMap = new HashMap <Character, Integer> ();
100-
TreeSet <String> prtSet = new TreeSet <String> ();
101-
102-
for (int r=1; r<nRows; r++) {
103-
char a = seqs[r][c];
104-
if (aaMap.containsKey(a)) aaMap.put(a, aaMap.get(a)+1);
105-
else aaMap.put(a, 1);
106-
}
107-
108-
for (char a : aaMap.keySet())
109-
prtSet.add(String.format("%02d:%c", aaMap.get(a), a)); // leading zero makes it sort right
110-
111-
comp[c] = null;
112-
for (String info : prtSet) {
113-
if (info.startsWith("0")) info = info.substring(1);
114-
if (comp[c]==null) comp[c] = info;
115-
else comp[c] = info + ", " + comp[c];
116-
}
117-
}
118-
}
119-
if (SoP_NORM) {
120-
double [] tScore = dScores.clone();
121-
scoreSoP_norm(grpID);
122-
123-
if (!isRun) {
124-
strScores = new String [tScore.length+2];
125-
for (int i=0; i<tScore.length; i++) {
126-
String x=" ";
127-
if (tScore[i]<q1x) x="<";
128-
else if (tScore[i]>q3x) x=">";
129-
strScores[i] = String.format("%3d. %.3f %3d%s %s",
130-
i, dScores[i], (int) tScore[i], x, comp[i]);
131-
}
132-
strScores[tScore.length] = "";
133-
strScores[tScore.length+1] = String.format("Q1 Box %.1f Q3 Box %.1f", q1x, q3x);
90+
dScores[c] += scoreCh(a, seqs[x][c]);
13491
}
13592
}
13693

137-
// Though there are (nRows*(nRows-1)/2) * nCols comparisons
138-
// The average is on the nCols since its the column sum that is relevant
139-
double sum=0;
140-
for (double d : dScores) sum+= d;
141-
double score = (sum!=0) ? (Math.abs(sum)/(double)nCols) : 0; // CAS312
94+
// #cmp = (nRows*(nRows-1)/2) * nCols
95+
// CAS401 -- was a pseudo min-max normalization, changed to /#cmp
96+
nRows--;
97+
double n = (SoP_NORM) ? nCols : ((nRows*(nRows-1)/2) * nCols);
98+
double sum = 0.0;
99+
for (double x : dScores) sum += x;
100+
101+
double score = (sum!=0) ? (Math.abs(sum)/n) : 0; // CAS312
142102
if (sum<0) score = -score;
143103

144104
return score;
145105
}
146106
catch(Exception e) {ErrorReport.reportError(e, "scoreAvgSumOfPairs " + grpID);}
147107
return Globalx.dNoScore;
148108
}
149-
// Min-max Normalization of column scores:
150-
// Find the largest positive and smallest negative number.
151-
// Add the absolute value of the smallest to each number
152-
// Divide the result by max-min
153-
// Z - X-min(X)/max(X)-min(X)
154-
// Check for outliers
155-
private void scoreSoP_norm(String grpID) {
156-
try {
157-
double [] qrt = Stats.setQuartiles(dScores);
158-
double q1 = qrt[0];
159-
double q3 = qrt[2];
160-
double iqr = (q3-q1)*1.5;
161-
double min = qrt[3];
162-
double max = qrt[4];
163-
164-
double diff = max-min;
165-
166-
q1x = q1-iqr;
167-
q3x = q3+iqr;
168-
169-
// get rid of outliers
170-
int cntL=0, cntH=0;
171-
for (int i=0; i<dScores.length; i++) {
172-
double d = dScores[i];
173-
if (d<q1x) cntL++;
174-
if (d>q3x) cntH++;
175-
dScores[i] = (d-min)/diff;
176-
}
177-
if (bTest && cntL+cntH>0) {
178-
Out.PrtSpCntMsgNz(1, cntL, "Low outlier for " + grpID + String.format(" (q1 %5.1f, Box %5.1f) Col %d", q1, q1x, dScores.length));
179-
Out.PrtSpCntMsgNz(1, cntH, "High outlier for " + grpID + String.format(" (q3 %5.1f, Box %5.1f) Col %d", q3, q3x, dScores.length));
180-
}
181-
}
182-
catch(Exception e) {ErrorReport.reportError(e, "normalize SoP ");}
183-
}
184109

185110
private double scoreCh(char c1, char c2) {
186111
if (c1==Share.gapCh && c2==Share.gapCh) return bothGAP;
@@ -403,5 +328,4 @@ public double scoreMstatX(String type, String alignedFile, String resultFile) {
403328

404329
private double [] dScores;
405330
private String [] strScores;
406-
private double q1x=0.0, q3x=0.0;
407331
}

java/src/cmp/compile/MultiStats.java

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,8 +194,8 @@ private String redoAllScoresSave() {
194194
infoObj.updateInfoKey("MSAscore2", score[1]);
195195

196196
// clear current scores
197-
mDB.tableDelete("pog_scores"); // CAS342
198-
mDB.executeUpdate("update pog_groups set score1=" + Globalx.dNoVal + ", score2=" + Globalx.dNoVal);
197+
mDB.tableDelete("pog_scores"); // CAS342; CAS401 change dNoVal to dNoScore
198+
mDB.executeUpdate("update pog_groups set score1=" + Globalx.dNoScore + ", score2=" + Globalx.dNoScore);
199199

200200
PreparedStatement psG = mDB.prepareStatement(
201201
"update pog_groups set score1=?, score2=? where PGid=?");
@@ -240,7 +240,7 @@ private String redoAllScoresSave() {
240240
Out.PrtSpCntMsgNz(1, cntBad, "Too many gaps - no SoP score");
241241
return "Complete scoring of " + cnt + " clusters";
242242
}
243-
catch(Exception e) {ErrorReport.die(e, "run align");}
243+
catch(Exception e) {ErrorReport.die(e, "redo align ");}
244244
return "Error";
245245
}
246246
private boolean saveScores(int grpid, MultiAlignData multiObj) {
@@ -252,7 +252,8 @@ private boolean saveScores(int grpid, MultiAlignData multiObj) {
252252
", score1='" + strScore1 + "', score2='" + strScore2 + "'");
253253
return true;
254254
}
255-
catch(Exception e) {ErrorReport.die(e, "run align"); return false;}
255+
catch(Exception e) {ErrorReport.die(e, "run align save " + multiObj.getColScores1()
256+
+ " for " + grpid); return false;}
256257
}
257258
/*********************************************************/
258259
//-- load group sequences --/

java/src/cmp/compile/runMTCWMain.java

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ public class runMTCWMain {
4646
static public int BI = Globals.MSA.BUILTIN_SCORE;
4747
static public int msaScore1=BI, msaScore2=BI; // defaults repeated in MultiAlignData
4848
static public String strScore1=Globals.MSA.SoP, strScore2=Globals.MSA.Wep;
49-
static public boolean bNSoP=true; // if true, normalize (default)
49+
static public boolean bNSoP=false; // if true, normalize (default); CAS401 change to F
5050

5151
public static void main(String[] args) {
5252
try {
@@ -69,16 +69,16 @@ static void printHelp(String [] args) {
6969
System.out.println(" The 'Closure' seeding with BBH algorithm can be replaced with:");
7070
System.out.println(" -BHwop #Use Bron_Kerbosch Without Pivot");
7171
System.out.println(" -BHwp #Use Bron_Kerbosch With Pivot");
72-
//This is not fully supported, so only for testing
73-
//System.out.println(" Sum-of-Pairs:"); // CAS313 add
74-
//System.out.println(" -nSoP Do not normalize SoP scores (default Score1).");
72+
//System.out.println(" Sum-of-Pairs:"); // CAS313 add - my use only
73+
//System.out.println(" -sp use #columns for average instead of #comparisons");
7574
System.out.println(" MSA options:");
7675
System.out.println(" The MSA score1 of built-in Sum-of-Pairs can be replaced with:"); // CAS312 add
7776
System.out.println(" -M1 <string>, where <string> is a valid MstatX method.");
7877
System.out.println(" The MSA score2 of built-in Wentropy can be replaced with:"); // CAS312 add
7978
System.out.println(" -M2 <string>, where <string> is a valid MstatX method.");
8079
System.out.println(" MstatX methods: trident, wentropy, mvector, jensen, kabat, gap");
8180
System.out.println(" See agcol.arizona.edu/software/TCW/doc/mtcw/UserGuide.html for more info");
81+
System.out.println(" If score1 is Sum-of-pairs:");
8282
System.exit(0);
8383
}
8484
}
@@ -121,9 +121,9 @@ else if (hasArg(args, "-BHwp")) {
121121
isMstatX(strScore2);
122122
}
123123
/* CAS313 SoP */
124-
if (hasArg(args, "-nSoP")) {
125-
bNSoP=false;
126-
System.out.println("Do not normalize SoP scores ");
124+
if (hasArg(args, "-sp")) {
125+
bNSoP=true;
126+
System.out.println("Use #columns for average SoP scores instead of the number of comparisons ");
127127
}
128128
}
129129
static boolean hasArg(String [] args, String arg) {

java/src/cmp/database/Schema.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -245,8 +245,8 @@ public void loadSchema() {
245245
"conLen int default 0, " + // consensus length
246246
"sdLen float default " + Globalx.dNoVal + ", " +// stddev from conLen
247247
// dNoVal is -2
248-
"score1 float default " + Globalx.dNoVal + ", " + // sum of sum of pairs CAS313 0-1
249-
"score2 float default " + Globalx.dNoVal + ", " + // trident -1 to 1
248+
"score1 float default " + Globalx.dNoScore + ", " + // CAS401 any value
249+
"score2 float default " + Globalx.dNoScore + ", " + // -100000
250250

251251
// dynamic summed counts for each sTCWdb is added
252252
"index idx1(PGstr)," +

0 commit comments

Comments
 (0)