You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Mapping of RBP binding sites: analysis {.smaller}
97
96
98
-
::: columns
99
-
::: {.column width="50%" .nonincremental}
100
-
97
+
::::: columns
98
+
::: {.column .nonincremental width="50%"}
101
99
Most CLIP-seq approaches have single-nucleotide resolution information. However, they vary in the frequency of that information and the efficiency of the procedure.
102
100
103
101
The basic concept to **call a peak/binding sites** from CLIP-seq:
@@ -107,62 +105,51 @@ The basic concept to **call a peak/binding sites** from CLIP-seq:
107
105
- Use nucleotide level information to de/refine position of RBP-binding sites
108
106
109
107
In this class we will be working with PAR-CLIP data. Regardless, I will show you how to access ENCODE eCLIP data. You would easily be able to apply what you learn on those data.
110
-
111
108
:::
112
109
113
110
::: {.column width="50%"}
114
111

115
112
:::
116
-
:::
117
-
113
+
:::::
118
114
119
115
## Analysis overview {.smaller}
120
116
121
-
::: columns
117
+
::::: columns
122
118
::: {.column width="50%"}
123
-
124
119

125
-
126
120
:::
127
-
::: {.column width="50%" .nonincremental}
128
121
129
-
1. Filter out low quality or short reads (<18 for larger genomes)
The pattern of T = \> C conversions, coupled with read density, can thus provide a strong signal to generate a high-resolution map of confident RNA-protein interaction sites.
152
143
153
-
The pattern of T = > C conversions, coupled with read density, can thus provide a strong signal to generate a high-resolution map of confident RNA-protein interaction sites.
154
-
155
-
A non-parametric kernel-density estimate used to identify the RNA-protein interaction sites from a combination of T = > C conversions and read density.
144
+
A non-parametric kernel-density estimate used to identify the RNA-protein interaction sites from a combination of T = \> C conversions and read density.
156
145
157
146
See [PARalyzer](https://pubmed.ncbi.nlm.nih.gov/21851591/) for more information.
158
-
159
147
:::
160
-
::: {.column width="50%"}
161
148
149
+
::: {.column width="50%"}
162
150

163
151
:::
164
-
:::
165
-
152
+
:::::
166
153
167
154
## Today's menu {.smaller}
168
155
@@ -174,18 +161,11 @@ We will be starting with position of the binding sites in the genome (the output
174
161
175
162
#### 2. Perform motif analysis accounting for the background sequence regions.
176
163
177
-
178
164

179
165
180
166
## Annotation of binding sites {.smaller}
181
167
182
-
183
-
Where are the binding sites?
184
-
- Which genes?
185
-
- What region of those genes?
186
-
- How many binding sites per region?
187
-
- How many binding sites per gene?
188
-
- How many binding sites per gene by region?
168
+
Where are the binding sites? - Which genes? - What region of those genes? - How many binding sites per region? - How many binding sites per gene? - How many binding sites per gene by region?
189
169
190
170
We will use `annotatr` and `Granges` to answer these questions.
It looks like HuR prefers binding to 3' UTRs and introns. That is a bit of a surprise given the model above indicating 3' UTR binding. Well let's take a step back and frame our expectation using what we know about the genome.
@@ -398,7 +364,6 @@ In this case, how many basepairs are introns and 3' UTRs in the genome?
398
364
399
365
## binding region length biases {.smaller}
400
366
401
-
402
367
```{r}
403
368
#| eval: true
404
369
#| echo: true
@@ -427,14 +392,12 @@ for (i in 1:length(my_hg19_annots)) {
427
392
barplot(mylengths[1:4], las = 2, main = "total bases per category", log = "y")
428
393
```
429
394
430
-
431
395
## Control for CLIP-binding sites {.smaller}
432
396
433
397
We need a way to figure out a null model OR background expectation.
434
398
435
399
What if we were to take our HuR binding and randomize their position and then repeat the annotation on the randomized binding sites?
436
400
437
-
438
401
```{r}
439
402
#| eval: true
440
403
#| echo: true
@@ -511,11 +474,9 @@ ggplot(
511
474
) +
512
475
geom_bar(stat = "identity") +
513
476
ylab("Observed vs Expected") +
514
-
theme_cowplot()
477
+
theme_cowplot() + geom_hline(yintercept = 1)
515
478
```
516
479
517
-
518
-
519
480
## 5 MINUTE BREAK
520
481
521
482
## What sequence does HuR bind to? {.smaller}
@@ -524,9 +485,9 @@ Is it just *AUUUA*?
524
485
525
486
**Different transcript regions have different nucleotide composition.**
526
487
527
-
- 5' UTRs are more GC-rich
488
+
-5' UTRs are more GC-rich
528
489
529
-
- 3' UTRs are more AU-rich
490
+
-3' UTRs are more AU-rich
530
491
531
492

532
493
@@ -538,13 +499,12 @@ Steps to determine k-mer composition (we use 6mers) for any set of intervals
538
499
539
500
We'll do it for both HuR binding sites and then compare it to background seqs.
540
501
541
-
1. Create a `Granges` object for a given annotation category.
542
-
2. Remove duplicated intervals (from diff transcript ids) with `reduce`.
543
-
3. Retrieve seqeunces using `getSeqs`
544
-
4. Create a dataframe containing the count and frequency of each 6mer.
545
-
502
+
1. Create a `Granges` object for a given annotation category.
503
+
2. Remove duplicated intervals (from diff transcript ids) with `reduce`.
504
+
3. Retrieve seqeunces using `getSeqs`
505
+
4. Create a dataframe containing the count and frequency of each 6mer.
546
506
547
-
## Calculate 6mers in HuR sites {.smaller}
507
+
## Calculate 6mers in HuR sites {.smaller}
548
508
549
509
Since HuR preferentially binds to 3' UTRs, that is the region we will focus on.
Next, we will calculate 6mer frequencies in 3' UTRs. This will serve as a null model or background that we can compare with the HuR binding site 6mers.
550
+
Next, we will calculate 6mer frequencies in 3' UTRs. This will serve as a null model or background that we can compare with the HuR binding site 6mers.
0 commit comments