Skip to content

Commit dad6f76

Browse files
committed
scope: add more scope details to tutorial
1 parent eba0417 commit dad6f76

File tree

1 file changed

+103
-13
lines changed

1 file changed

+103
-13
lines changed

tutorials/data_exploration_scope.ipynb

Lines changed: 103 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,96 @@
1717
"---\n"
1818
]
1919
},
20+
{
21+
"cell_type": "markdown",
22+
"id": "f6c25706-251c-438c-9915-e8002647eb94",
23+
"metadata": {},
24+
"source": [
25+
"### Understanding [SCOPe](https://scop.berkeley.edu/) and [PDB](https://www.rcsb.org/) \n",
26+
"\n",
27+
"\n",
28+
"1. **Protein domains form chains.** \n",
29+
"2. **Chains form complexes** (protein complexes or structures). \n",
30+
"3. These **complexes are the entries in PDB**, represented by unique identifiers like `\"1A3N\"`. \n",
31+
"\n",
32+
"---\n",
33+
"\n",
34+
"#### **Protein Domain** \n",
35+
"A **protein domain** is a **structural and functional unit** of a protein. \n",
36+
"\n",
37+
"\n",
38+
"##### Key Characteristics:\n",
39+
"- **Domains are part of a protein chain.** \n",
40+
"- A domain can span: \n",
41+
" 1. **The entire chain** (single-domain protein): \n",
42+
" - In this case, the protein domain is equivalent to the chain itself. \n",
43+
" - Example: \n",
44+
" - All chains of the **PDB structure \"1A3N\"** are single-domain proteins. \n",
45+
" - Each chain has a SCOPe domain identifier. \n",
46+
" - For example, Chain **A**: \n",
47+
" - Domain identifier: `d1a3na_` \n",
48+
" - Breakdown of the identifier: \n",
49+
" - `d`: Denotes domain. \n",
50+
" - `1a3n`: Refers to the PDB protein structure identifier. \n",
51+
" - `a`: Specifies the chain within the structure. (`_` for None and `.` for multiple chains)\n",
52+
" - `_`: Indicates the domain spans the entire chain (single-domain protein). \n",
53+
" - Example: [PDB Structure 1A3N - Chain A](https://www.rcsb.org/sequence/1A3N#A)\n",
54+
" 2. **A specific portion of the chain** (multi-domain protein): \n",
55+
" - Here, a single chain contains multiple domains. \n",
56+
" - Example: Chain **A** of the **PDB structure \"1PKN\"** contains three domains: `d1pkna1`, `d1pkna2`, `d1pkna3`. \n",
57+
" - Example: [PDB Structure 1PKN - Chain A](https://www.rcsb.org/annotations/1PKN). \n",
58+
"\n",
59+
"---\n",
60+
"\n",
61+
"#### **Protein Chain** \n",
62+
"A **protein chain** refers to the entire **polypeptide chain** observed in a protein's 3D structure (as described in PDB files). \n",
63+
"\n",
64+
"##### Key Points:\n",
65+
"- A chain can consist of **one or multiple domains**:\n",
66+
" - **Single-domain chain**: The chain and domain are identical. \n",
67+
" - Example: Myoglobin. \n",
68+
" - **Multi-domain chain**: Contains several domains, each with distinct structural and functional roles. \n",
69+
"- Chains assemble to form **protein complexes** or **structures**. \n",
70+
"\n",
71+
"\n",
72+
"---\n",
73+
"\n",
74+
"#### **Key Observations About SCOPe** \n",
75+
"- The **fundamental classification unit** in SCOPe is the **protein domain**, not the entire protein. \n",
76+
"- _**The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.**_\n",
77+
"\n",
78+
"\n",
79+
"--- \n",
80+
"\n",
81+
"**SCOPe 2.08 Data Analysis:**\n",
82+
"\n",
83+
"The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:\n",
84+
"\n",
85+
"- **Classes**: 12\n",
86+
"- **Folds**: 1485\n",
87+
"- **Superfamilies**: 2368\n",
88+
"- **Families**: 5431\n",
89+
"- **Proteins**: 13,514\n",
90+
"- **Species**: 30,294\n",
91+
"- **Domains**: 344,851\n",
92+
"\n",
93+
"For more detailed statistics, please refer to the official SCOPe website:\n",
94+
"\n",
95+
"- [SCOPe 2.08 Statistics](https://scop.berkeley.edu/statistics/ver=2.08)\n",
96+
"- [SCOPe 2.08 Release](https://scop.berkeley.edu/ver=2.08)\n",
97+
"\n",
98+
"---\n",
99+
"\n",
100+
"## SCOPe Labeling \n",
101+
"\n",
102+
"- Use SCOPe labels for protein domains.\n",
103+
"- Map them back to their **protein-chain** sequences (protein sequence label = sum of all domain labels).\n",
104+
"- Train on protein sequences.\n",
105+
"- This pretraining task would be comparable to GO-based training.\n",
106+
"\n",
107+
"--- "
108+
]
109+
},
20110
{
21111
"cell_type": "code",
22112
"execution_count": 1,
@@ -171,25 +261,25 @@
171261
"text": [
172262
"Checking for processed data in data\\SCOPe\\version_2.08\\SCOPe2000\\processed\n",
173263
"Missing processed data file (`data.pkl` file)\n",
174-
"Missing PDB raw data, Downloading PDB sequence data....\n",
175-
"Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
176-
"Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
177-
"Unzipping the file....\n",
264+
"Missing PDB raw data, Downloading PDB sequence data....\n",
265+
"Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
266+
"Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
267+
"Unzipping the file....\n",
178268
"Unpacked and saved to data\\SCOPe\\pdb_sequences.txt\n",
179269
"Removed temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
180-
"Missing Scope: cla.txt raw data, Downloading...\n"
181-
]
182-
},
183-
{
184-
"name": "stderr",
270+
"Missing Scope: cla.txt raw data, Downloading...\n"
271+
]
272+
},
273+
{
274+
"name": "stderr",
185275
"output_type": "stream",
186276
"text": [
187277
"G:\\anaconda3\\envs\\env_chebai\\lib\\site-packages\\urllib3\\connectionpool.py:1099: InsecureRequestWarning: Unverified HTTPS request is being made to host 'scop.berkeley.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
188278
"warnings.warn(\n"
189-
]
190-
},
191-
{
192-
"name": "stdout",
279+
]
280+
},
281+
{
282+
"name": "stdout",
193283
"output_type": "stream",
194284
"text": [
195285
"Missing Scope: hie.txt raw data, Downloading...\n",

0 commit comments

Comments
 (0)