Skip to content

Commit 7b37be1

Browse files
committed
Clarify the multiple incompatible uses of "phred-scale".
Sometimes this refers to $10 log_{10}(p)$, sometimes to $10 log_{10}(1-p)$, and sometimes to something normalised so $p$ isn't really a probability at all. Note CNL, CNP and CNQ don't mention phred anywhere in their short description and only Phred in the long description for CNQ, so I applied the same logic to PL, PP (is this correct?) and PQ. Also clarified the "VCF tag naming conventions" part. I changed phred-scale in one part there to phred-true-scale. I'm not so happy with that, but as it's immediately followed by the formula I think it's clear.
1 parent c236c44 commit 7b37be1

File tree

1 file changed

+9
-7
lines changed

1 file changed

+9
-7
lines changed

VCFv4.3.tex

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -427,9 +427,9 @@ \subsubsection{Genotype fields}
427427
GT & 1 & String & Genotype \\
428428
HQ & 2 & Integer & Haplotype quality \\
429429
MQ & 1 & Integer & RMS mapping quality \\
430-
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
431-
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
432-
PQ & 1 & Integer & Phasing quality \\
430+
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer\\
431+
PP & G & Integer & $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer\\
432+
PQ & 1 & Integer & Phred-scaled phasing quality\\
433433
PS & 1 & Integer & Phase set \\
434434
\end{longtable}
435435
@@ -515,8 +515,8 @@ \subsubsection{Genotype fields}
515515
516516
\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
517517
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
518-
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
519-
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
518+
\item PL (Integer): The $log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
519+
\item PP (Integer): The $log_{10}$ scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
520520
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
521521
We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
522522
\item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
@@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions}
544544
\begin{itemize}
545545
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
546546
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
547-
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
547+
The likelihood can be also represented in some cases as a phred-true scale ($-10 \log_{10}(probability\_of\_being\_correct)$) in a separate tag (e.g.\ PL).
548+
In this case they may be normalised so the most likely event has a score of 0.
548549
549550
\item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
550551
551552
\item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
552553
Examples are GQ, CNQ.
553-
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
554+
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number with $QUAL = -10 \log_{10}(probability\_of\_being\_incorrect)$).
554555
\end{itemize}
555556
556557
@@ -2085,6 +2086,7 @@ \section{List of changes}
20852086
\subsection{Changes to VCFv4.3}
20862087
20872088
\begin{itemize}
2089+
\item Clarify distinction between Phred ($-10 log_{10}(p\_of\_incorrect)$) and $-10 log_{10}(p\_of\_correct)$.
20882090
\item More strict language: ``should'' replaced with ``must'' where appropriate
20892091
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
20902092

0 commit comments

Comments
 (0)