Skip to content

Commit 4d0d262

Browse files
committed
Add stats for the number of sites matched in the GT-vs-GT, GT-vs-PL, etc modes.
This information is important for interpretation of the discordance score, because only the GT-vs-GT matching can be interpreted as the number of mismatching genotypes.
1 parent 0773541 commit 4d0d262

File tree

6 files changed

+96
-36
lines changed

6 files changed

+96
-36
lines changed

NEWS

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,12 @@ Changes affecting specific commands:
4545
is newly printed as
4646
3_prime_utr&NMD_transcript|PCGF3|ENST00000430644|NMD
4747

48+
* bcftools gtcheck
49+
50+
- Add stats for the number of sites matched in the GT-vs-GT, GT-vs-PL, etc modes. This
51+
information is important for interpretation of the discordance score, as only the
52+
GT-vs-GT matching can be interpreted as the number of mismatching genotypes.
53+
4854
* bcftools +mendelian2
4955

5056
- Fix in command line argument parsing, the `-p` and `-P` options were not

doc/bcftools.1

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
.\" Title: bcftools
33
.\" Author: [see the "AUTHOR(S)" section]
44
.\" Generator: Asciidoctor 2.0.15.dev
5-
.\" Date: 2023-05-30
5+
.\" Date: 2023-06-02
66
.\" Manual: \ \&
77
.\" Source: \ \&
88
.\" Language: English
99
.\"
10-
.TH "BCFTOOLS" "1" "2023-05-30" "\ \&" "\ \&"
10+
.TH "BCFTOOLS" "1" "2023-06-02" "\ \&" "\ \&"
1111
.ie \n(.g .ds Aq \(aq
1212
.el .ds Aq '
1313
.ss \n[.ss] 0
@@ -51,7 +51,7 @@ standard input (stdin) and outputs to the standard output (stdout). Several
5151
commands can thus be combined with Unix pipes.
5252
.SS "VERSION"
5353
.sp
54-
This manual page was last updated \fB2023\-05\-30 09:18 BST\fP and refers to bcftools git version \fB1.17\-50\-ga8249495+\fP.
54+
This manual page was last updated \fB2023\-06\-02 11:27 BST\fP and refers to bcftools git version \fB1.17\-52\-g0773541c+\fP.
5555
.SS "BCF1"
5656
.sp
5757
The obsolete BCF1 format output by versions of samtools <= 0.1.19 is \fBnot\fP
@@ -2478,6 +2478,10 @@ option is given, the identity of samples from \fIquery.vcf.gz\fP
24782478
is checked against the samples in the \fB\-g\fP file.
24792479
Without the \fB\-g\fP option, multi\-sample cross\-check of samples in \fIquery.vcf.gz\fP is performed.
24802480
.sp
2481+
Note that the interpretation of the discordance score depends on the options provided (specifically \fB\-e\fP and
2482+
\fB\-u\fP) and on the available annotations (FORMAT/PL vs FORMAT/GT).
2483+
The discordance score can be interpreted as the number of mismatching genotypes if only GT\-vs\-GT matching is performed.
2484+
.sp
24812485
\fB\-\-distinctive\-sites\fP \fINUM[,MEM[,DIR]]\fP
24822486
.RS 4
24832487
Find sites that can distinguish between at least NUM sample pairs. If the number is smaller or equal to 1,
@@ -2496,11 +2500,18 @@ Stop after first record to estimate required time.
24962500
Interpret genotypes and genotype likelihoods probabilistically. The value of \fIINT\fP
24972501
represents genotype quality when GT tag is used (e.g. Q=30 represents one error in 1,000 genotypes and
24982502
Q=40 one error in 10,000 genotypes) and is ignored when PL tag is used (in that case an arbitrary
2499-
non\-zero integer can be provided). See also the \fB\-u, \-\-use\fP option below. If set to 0,
2500-
the discordance equals to the number of mismatching genotypes when GT vs GT is compared.
2501-
Note that the values with and without \fB\-e\fP are not comparable, only values generated
2502-
with \fB\-e 0\fP correspond to mismatching genotypes.
2503-
If performance is an issue, set to 0 for faster run but less accurate results.
2503+
non\-zero integer can be provided).
2504+
\~
2505+
.br
2506+
\~
2507+
.br
2508+
If \fB\-e\fP is set to 0, the discordance score can be interpreted as the number of mismatching genotypes,
2509+
but only in the GT\-vs\-GT matching mode. See the \fB\-u, \-\-use\fP option below for additional notes and caveats.
2510+
\~
2511+
.br
2512+
\~
2513+
.br
2514+
If performance is an issue, set \fB\-e 0\fP for faster run times but less accurate results.
25042515
.RE
25052516
.sp
25062517
\fB\-g, \-\-genotypes\fP \fIFILE\fP
@@ -2581,8 +2592,15 @@ see \fBCommon Options\fP
25812592
\fB\-u, \-\-use\fP \fITAG1\fP[,\fITAG2\fP]
25822593
.RS 4
25832594
specifies which tag to use in the query file (\fITAG1\fP) and the \fB\-g\fP (\fITAG2\fP) file.
2584-
By default, the PL tag is used in the query file and GT in the \fB\-g\fP file when
2585-
available.
2595+
By default, the PL tag is used in the query file and, when available, the GT tags in the
2596+
\fB\-g\fP file.
2597+
\~
2598+
.br
2599+
\~
2600+
.br
2601+
Note that when the requested tag is not available, the program will attempt to use
2602+
the other tag. The output includes the number of sites that were matched by the four
2603+
possible mode (for example GT\-vs\-GT or GT\-vs\-PL).
25862604
.RE
25872605
.sp
25882606
\fBExamples:\fP

doc/bcftools.html

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ <h2 id="_description">DESCRIPTION</h2>
5050
<div class="sect2">
5151
<h3 id="_version">VERSION</h3>
5252
<div class="paragraph">
53-
<p>This manual page was last updated <strong>2023-05-30 09:18 BST</strong> and refers to bcftools git version <strong>1.17-50-ga8249495+</strong>.</p>
53+
<p>This manual page was last updated <strong>2023-06-02 11:27 BST</strong> and refers to bcftools git version <strong>1.17-52-g0773541c+</strong>.</p>
5454
</div>
5555
</div>
5656
<div class="sect2">
@@ -2178,6 +2178,11 @@ <h3 id="gtcheck">bcftools gtcheck [<em>OPTIONS</em>] [<strong>-g</strong> <em>ge
21782178
is checked against the samples in the <strong>-g</strong> file.
21792179
Without the <strong>-g</strong> option, multi-sample cross-check of samples in <em>query.vcf.gz</em> is performed.</p>
21802180
</div>
2181+
<div class="paragraph">
2182+
<p>Note that the interpretation of the discordance score depends on the options provided (specifically <strong>-e</strong> and
2183+
<strong>-u</strong>) and on the available annotations (FORMAT/PL vs FORMAT/GT).
2184+
The discordance score can be interpreted as the number of mismatching genotypes if only GT-vs-GT matching is performed.</p>
2185+
</div>
21812186
<div class="dlist">
21822187
<dl>
21832188
<dt class="hdlist1"><strong>--distinctive-sites</strong> <em>NUM[,MEM[,DIR]]</em></dt>
@@ -2196,11 +2201,14 @@ <h3 id="gtcheck">bcftools gtcheck [<em>OPTIONS</em>] [<strong>-g</strong> <em>ge
21962201
<p>Interpret genotypes and genotype likelihoods probabilistically. The value of <em>INT</em>
21972202
represents genotype quality when GT tag is used (e.g. Q=30 represents one error in 1,000 genotypes and
21982203
Q=40 one error in 10,000 genotypes) and is ignored when PL tag is used (in that case an arbitrary
2199-
non-zero integer can be provided). See also the <strong>-u, --use</strong> option below. If set to 0,
2200-
the discordance equals to the number of mismatching genotypes when GT vs GT is compared.
2201-
Note that the values with and without <strong>-e</strong> are not comparable, only values generated
2202-
with <strong>-e 0</strong> correspond to mismatching genotypes.
2203-
If performance is an issue, set to 0 for faster run but less accurate results.</p>
2204+
non-zero integer can be provided).
2205+
&#160;<br>
2206+
&#160;<br>
2207+
If <strong>-e</strong> is set to 0, the discordance score can be interpreted as the number of mismatching genotypes,
2208+
but only in the GT-vs-GT matching mode. See the <strong>-u, --use</strong> option below for additional notes and caveats.
2209+
&#160;<br>
2210+
&#160;<br>
2211+
If performance is an issue, set <strong>-e 0</strong> for faster run times but less accurate results.</p>
22042212
</dd>
22052213
<dt class="hdlist1"><strong>-g, --genotypes</strong> <em>FILE</em></dt>
22062214
<dd>
@@ -2274,8 +2282,13 @@ <h3 id="gtcheck">bcftools gtcheck [<em>OPTIONS</em>] [<strong>-g</strong> <em>ge
22742282
<dt class="hdlist1"><strong>-u, --use</strong> <em>TAG1</em>[,<em>TAG2</em>]</dt>
22752283
<dd>
22762284
<p>specifies which tag to use in the query file (<em>TAG1</em>) and the <strong>-g</strong> (<em>TAG2</em>) file.
2277-
By default, the PL tag is used in the query file and GT in the <strong>-g</strong> file when
2278-
available.</p>
2285+
By default, the PL tag is used in the query file and, when available, the GT tags in the
2286+
<strong>-g</strong> file.
2287+
&#160;<br>
2288+
&#160;<br>
2289+
Note that when the requested tag is not available, the program will attempt to use
2290+
the other tag. The output includes the number of sites that were matched by the four
2291+
possible mode (for example GT-vs-GT or GT-vs-PL).</p>
22792292
</dd>
22802293
</dl>
22812294
</div>
@@ -5257,7 +5270,7 @@ <h2 id="_copying">COPYING</h2>
52575270
</div>
52585271
<div id="footer">
52595272
<div id="footer-text">
5260-
Last updated 2023-05-30 09:18:06 +0100
5273+
Last updated 2023-06-02 11:27:10 +0100
52615274
</div>
52625275
</div>
52635276
</body>

doc/bcftools.txt

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1624,6 +1624,10 @@ option is given, the identity of samples from 'query.vcf.gz'
16241624
is checked against the samples in the *-g* file.
16251625
Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' is performed.
16261626

1627+
Note that the interpretation of the discordance score depends on the options provided (specifically *-e* and
1628+
*-u*) and on the available annotations (FORMAT/PL vs FORMAT/GT).
1629+
The discordance score can be interpreted as the number of mismatching genotypes if only GT-vs-GT matching is performed.
1630+
16271631
*--distinctive-sites* 'NUM[,MEM[,DIR]]'::
16281632
Find sites that can distinguish between at least NUM sample pairs. If the number is smaller or equal to 1,
16291633
it is interpreted as the fraction of pairs. The optional MEM string sets the maximum memory used for
@@ -1637,11 +1641,14 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
16371641
Interpret genotypes and genotype likelihoods probabilistically. The value of 'INT'
16381642
represents genotype quality when GT tag is used (e.g. Q=30 represents one error in 1,000 genotypes and
16391643
Q=40 one error in 10,000 genotypes) and is ignored when PL tag is used (in that case an arbitrary
1640-
non-zero integer can be provided). See also the *-u, --use* option below. If set to 0,
1641-
the discordance equals to the number of mismatching genotypes when GT vs GT is compared.
1642-
Note that the values with and without *-e* are not comparable, only values generated
1643-
with *-e 0* correspond to mismatching genotypes.
1644-
If performance is an issue, set to 0 for faster run but less accurate results.
1644+
non-zero integer can be provided).
1645+
{nbsp} +
1646+
{nbsp} +
1647+
If *-e* is set to 0, the discordance score can be interpreted as the number of mismatching genotypes,
1648+
but only in the GT-vs-GT matching mode. See the *-u, --use* option below for additional notes and caveats.
1649+
{nbsp} +
1650+
{nbsp} +
1651+
If performance is an issue, set *-e 0* for faster run times but less accurate results.
16451652

16461653
*-g, --genotypes* 'FILE'::
16471654
VCF/BCF file with reference genotypes to compare against
@@ -1696,8 +1703,13 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
16961703

16971704
*-u, --use* 'TAG1'[,'TAG2']::
16981705
specifies which tag to use in the query file ('TAG1') and the *-g* ('TAG2') file.
1699-
By default, the PL tag is used in the query file and GT in the *-g* file when
1700-
available.
1706+
By default, the PL tag is used in the query file and, when available, the GT tags in the
1707+
*-g* file.
1708+
{nbsp} +
1709+
{nbsp} +
1710+
Note that when the requested tag is not available, the program will attempt to use
1711+
the other tag. The output includes the number of sites that were matched by the four
1712+
possible mode (for example GT-vs-GT or GT-vs-PL).
17011713

17021714
*Examples:*
17031715
----

test/gtcheck.5.1.out

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,8 @@ INFO sites-skipped-monoallelic 1
55
INFO sites-skipped-no-data 1
66
INFO sites-skipped-GT-not-diploid 1
77
INFO sites-skipped-PL-not-diploid 1
8+
INFO sites-used-PL-vs-PL 0
9+
INFO sites-used-PL-vs-GT 1
10+
INFO sites-used-GT-vs-PL 0
11+
INFO sites-used-GT-vs-GT 1
812
DC A A 3.000150e-04 4.605170e+01 2

vcfgtcheck.c

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/* vcfgtcheck.c -- Check sample identity.
22
3-
Copyright (C) 2013-2021 Genome Research Ltd.
3+
Copyright (C) 2013-2023 Genome Research Ltd.
44
55
Author: Petr Danecek <[email protected]>
66
@@ -59,6 +59,7 @@ typedef struct
5959
int argc, gt_samples_is_file, qry_samples_is_file, regions_is_file, targets_is_file, pair_samples_is_file;
6060
int regions_overlap, targets_overlap;
6161
int qry_use_GT,gt_use_GT, nqry_smpl,ngt_smpl, *qry_smpl,*gt_smpl;
62+
int nused[2][2];
6263
double *pdiff, *qry_prob, *gt_prob;
6364
uint32_t *ndiff,*ncnt,ncmp, npairs;
6465
int32_t *qry_arr,*gt_arr, nqry_arr,ngt_arr;
@@ -309,7 +310,7 @@ static void init_data(args_t *args)
309310
init_samples(args->qry_samples, args->qry_samples_is_file, &args->qry_smpl, &args->nqry_smpl, args->qry_hdr, args->qry_fname);
310311
}
311312
if ( args->gt_samples )
312-
{
313+
{
313314
init_samples(args->gt_samples, args->gt_samples_is_file, &args->gt_smpl, &args->ngt_smpl,
314315
args->gt_hdr ? args->gt_hdr : args->qry_hdr,
315316
args->gt_fname ? args->gt_fname : args->qry_fname);
@@ -377,7 +378,7 @@ static void init_data(args_t *args)
377378
args->gt_prob = args->cross_check ? args->qry_prob : (double*) malloc(3*args->ngt_smpl*sizeof(*args->gt_prob));
378379

379380
// dsg2prob: the first index is bitmask of 8 possible dsg combinations (only 1<<0,1<<2,1<<3 are set, accessing
380-
// anything else indicated an error, this is just to reuse gt_to_dsg()); the second index are the corresponding
381+
// anything else indicated an error, this is just to reuse gt_to_dsg()); the second index are the corresponding
381382
// probabilities of 0/0, 0/1, and 1/1 genotypes
382383
for (i=0; i<8; i++)
383384
for (j=0; j<3; j++)
@@ -555,7 +556,9 @@ static void process_line(args_t *args)
555556
args->gt_arr = args->qry_arr;
556557
}
557558

559+
// stats: number of compared sites, and used tags
558560
args->ncmp++;
561+
args->nused[qry_use_GT][gt_use_GT]++;
559562

560563
double af,hwe_dsg[8];
561564
if ( args->calc_hwe_prob )
@@ -636,7 +639,7 @@ static void process_line(args_t *args)
636639
gt_dsg = gt_use_GT ? gt_to_prob(args,ptr,gt_prob) : pl_to_prob(args,ptr,gt_prob);
637640
if ( !gt_dsg ) continue; // missing value
638641
if ( args->hom_only && !(gt_dsg&5) ) continue; // not a hom
639-
642+
640643
ptr = args->qry_arr + args->pairs[i].iqry*nqry1;
641644
qry_dsg = qry_use_GT ? gt_to_prob(args,ptr,qry_prob) : pl_to_prob(args,ptr,qry_prob);
642645
if ( !qry_dsg ) continue; // missing value
@@ -797,11 +800,15 @@ static void report(args_t *args)
797800
fprintf(args->fp,"INFO\tsites-skipped-no-data\t%u\n",args->nskip_no_data);
798801
fprintf(args->fp,"INFO\tsites-skipped-GT-not-diploid\t%u\n",args->nskip_dip_GT);
799802
fprintf(args->fp,"INFO\tsites-skipped-PL-not-diploid\t%u\n",args->nskip_dip_PL);
803+
fprintf(args->fp,"INFO\tsites-used-PL-vs-PL\t%u\n",args->nused[0][0]);
804+
fprintf(args->fp,"INFO\tsites-used-PL-vs-GT\t%u\n",args->nused[0][1]);
805+
fprintf(args->fp,"INFO\tsites-used-GT-vs-PL\t%u\n",args->nused[1][0]);
806+
fprintf(args->fp,"INFO\tsites-used-GT-vs-GT\t%u\n",args->nused[1][1]);
800807
fprintf(args->fp,"# DC, discordance:\n");
801808
fprintf(args->fp,"# - query sample\n");
802809
fprintf(args->fp,"# - genotyped sample\n");
803-
fprintf(args->fp,"# - discordance (number of mismatches; smaller is better)\n");
804-
fprintf(args->fp,"# - negative log of HWE probability at matching sites (rare genotypes mataches are more informative, bigger is better)\n");
810+
fprintf(args->fp,"# - discordance (either an abstract score or number of mismatches, see -e/-u in the man page for details; smaller is better)\n");
811+
fprintf(args->fp,"# - negative log of HWE probability at matching sites (rare genotypes matches are more informative, bigger is better)\n");
805812
fprintf(args->fp,"# - number of sites compared (bigger is better)\n");
806813
fprintf(args->fp,"#DC\t[2]Query Sample\t[3]Genotyped Sample\t[4]Discordance\t[5]-log P(HWE)\t[6]Number of sites compared\n");
807814

@@ -1023,7 +1030,7 @@ static int is_input_okay(args_t *args, int nmatch)
10231030
return 1;
10241031

10251032
not_okay:
1026-
fprintf(stderr,"INFO: skipping %s:%"PRIhts_pos", %s. (This is printed only once.)\n",
1033+
fprintf(stderr,"INFO: skipping %s:%"PRIhts_pos", %s. (This is printed only once.)\n",
10271034
bcf_seqname(hdr,rec),rec->pos+1,msg);
10281035
return 0;
10291036
}
@@ -1097,7 +1104,7 @@ int main_vcfgtcheck(int argc, char *argv[])
10971104
args->es_max_mem = strdup("500M");
10981105

10991106
// In simulated sample swaps the minimum error was 0.3 and maximum intra-sample error was 0.23
1100-
// - min_inter: pairs with smaller err value will be considered identical
1107+
// - min_inter: pairs with smaller err value will be considered identical
11011108
// - max_intra: pairs with err value bigger than abs(max_intra_err) will be considered
11021109
// different. If negative, the cutoff may be heuristically lowered
11031110
args->min_inter_err = 0.23;
@@ -1169,7 +1176,7 @@ int main_vcfgtcheck(int argc, char *argv[])
11691176
case 3 : args->calc_hwe_prob = 0; break;
11701177
case 4 : error("The option -S, --target-sample has been deprecated\n"); break;
11711178
case 5 : args->dry_run = 1; break;
1172-
case 6 :
1179+
case 6 :
11731180
args->distinctive_sites = strtod(optarg,&tmp);
11741181
if ( *tmp )
11751182
{
@@ -1202,7 +1209,7 @@ int main_vcfgtcheck(int argc, char *argv[])
12021209
else if ( !strncasecmp("qry:",optarg,4) ) args->qry_samples = optarg+4;
12031210
else error("Which one? Query samples (qry:%s) or genotype samples (gt:%s)?\n",optarg,optarg);
12041211
break;
1205-
case 'S':
1212+
case 'S':
12061213
if ( !strncasecmp("gt:",optarg,3) ) args->gt_samples = optarg+3, args->gt_samples_is_file = 1;
12071214
else if ( !strncasecmp("qry:",optarg,4) ) args->qry_samples = optarg+4, args->qry_samples_is_file = 1;
12081215
else error("Which one? Query samples (qry:%s) or genotype samples (gt:%s)?\n",optarg,optarg);

0 commit comments

Comments
 (0)