You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The QC labelled "36" is clearly very different to the other QCs. In STATegra this QC was removed, so we will exclude it here as well. This corresponds to QC H1. STATegra also excluded QC samples measured immediately after a blank, which we will also do here.
C = pca_scores_plot(factor_name = 'sample_type',label_factor = 'order',points_to_label = 'all')
1683
+
# plot
1684
+
chart_plot(C,MS[7])
1685
+
```
1686
+
1687
+
Now we will plot the QC samples in context with the samples. There are several possible approaches, and we will apply the approach of applying PCA to the full dataset including the QCs. We will exclude the blanks as it is likely that they will dominate the plot if not removed. All samples from batch 12 were excluded from STATegra and we will replicate that here.
1688
+
1689
+
```{r}
1690
+
# prepare model sequence
1691
+
MS = filter_smeta(
1692
+
mode = 'exclude',
1693
+
levels='Blank',
1694
+
factor_name = 'sample_type') +
1695
+
1696
+
filter_smeta(
1697
+
mode = 'exclude',
1698
+
levels='12',
1699
+
factor_name = 'biol.batch') +
1700
+
1701
+
filter_by_name(
1702
+
mode = 'exclude',
1703
+
dimension='sample',
1704
+
names = c('1358BZU_0001QC_H1',
1705
+
'1358BZU_0001QC_A1',
1706
+
'1358BZU_0001QC_G1')) +
1707
+
1708
+
knn_impute(
1709
+
neighbours=5) +
1710
+
1711
+
vec_norm() +
1712
+
1713
+
log_transform(
1714
+
base = 10) +
1715
+
1716
+
mean_centre() +
1717
+
1718
+
PCA(
1719
+
number_components = 3)
1720
+
1721
+
# apply model sequence
1722
+
MS = model_apply(MS, DE)
1723
+
1724
+
# PCA scores plots
1725
+
C = pca_scores_plot(factor_name = 'sample_type')
1726
+
# plot
1727
+
chart_plot(C,MS[8])
1728
+
1729
+
```
1730
+
The QCs appear to representative of the samples, but there are strong clusters in the data, including the QC samples which have no biological variation. There is likely to be a number of 'low quality' features that should be excluded, so we will do that now, and use more sophisticated normalisation (PQN) and scaling methods (glog).
1731
+
1732
+
```{r,fig.height=10,fig.width=10}
1733
+
1734
+
MS = filter_smeta(
1735
+
mode = 'exclude',
1736
+
levels = '12',
1737
+
factor_name = 'biol.batch') +
1738
+
1739
+
filter_by_name(
1740
+
mode = 'exclude',
1741
+
dimension='sample',
1742
+
names = c('1358BZU_0001QC_H1',
1743
+
'1358BZU_0001QC_A1',
1744
+
'1358BZU_0001QC_G1')) +
1745
+
1746
+
blank_filter(
1747
+
fold_change = 20,
1748
+
qc_label = 'QC',
1749
+
factor_name = 'sample_type') +
1750
+
1751
+
filter_smeta(
1752
+
mode='exclude',
1753
+
levels='Blank',
1754
+
factor_name='sample_type') +
1755
+
1756
+
mv_feature_filter(
1757
+
threshold = 80,
1758
+
qc_label = 'QC',
1759
+
factor_name = 'sample_type',
1760
+
method = 'QC') +
1761
+
1762
+
mv_feature_filter(
1763
+
threshold = 50,
1764
+
factor_name = 'sample_type',
1765
+
method='across') +
1766
+
1767
+
rsd_filter(
1768
+
rsd_threshold=20,
1769
+
qc_label='QC',
1770
+
factor_name='sample_type') +
1771
+
1772
+
mv_sample_filter(
1773
+
mv_threshold = 50) +
1774
+
1775
+
pqn_norm(
1776
+
qc_label='QC',
1777
+
factor_name='sample_type') +
1778
+
1779
+
knn_impute(
1780
+
neighbours=5,
1781
+
by='samples') +
1782
+
1783
+
glog_transform(
1784
+
qc_label = 'QC',
1785
+
factor_name = 'sample_type') +
1786
+
1787
+
mean_centre() +
1788
+
1789
+
PCA(
1790
+
number_components = 10)
1791
+
1792
+
# apply model sequence
1793
+
MS = model_apply(MS, DE)
1794
+
1795
+
1796
+
# PCA plots using different factors
1797
+
g=list()
1798
+
for (k in c('order','biol.batch','time.point','condition')) {
1799
+
C = pca_scores_plot(factor_name = k,ellipse='none')
We can see now that the QCs are tightly clustered. This indicates that the biological variance of the remaining high quality features is much greater than the technical variance represented by the QCs.
1809
+
1810
+
There does not appear to be a trend by measurement order (A), which is an important indicator that instrument drift throughout the run is not a large source of variation in this dataset.
1811
+
1812
+
There does not appear to be strong clustering related to biological batch (B).
1813
+
1814
+
There does not appear to be a strong trend with time (C) but this is likely to be a more subtle variation and might be masked by other sources of variance at this stage.
1815
+
1816
+
There is some clustering related to condition (D) but with overlap.
1817
+
1818
+
To further explore any trends with time, we will split the data by the condition factor and only explore the Ikaros group. Removing the condition factor variation will potentially make it easier to spot any more subtle trends. We will extract the glog transformed matrix from the previous model sequence and continue from there.
Colouring by groups (A) makes the time point trend difficult to see, but by adding a `ggplot` continuous colour scale "viridis" (B) the trend with time along PC1 becomes much clearer.
0 commit comments