|
17 | 17 | "---\n" |
18 | 18 | ] |
19 | 19 | }, |
| 20 | + { |
| 21 | + "cell_type": "markdown", |
| 22 | + "id": "f6c25706-251c-438c-9915-e8002647eb94", |
| 23 | + "metadata": {}, |
| 24 | + "source": [ |
| 25 | + "### Understanding [SCOPe](https://scop.berkeley.edu/) and [PDB](https://www.rcsb.org/) \n", |
| 26 | + "\n", |
| 27 | + "\n", |
| 28 | + "1. **Protein domains form chains.** \n", |
| 29 | + "2. **Chains form complexes** (protein complexes or structures). \n", |
| 30 | + "3. These **complexes are the entries in PDB**, represented by unique identifiers like `\"1A3N\"`. \n", |
| 31 | + "\n", |
| 32 | + "---\n", |
| 33 | + "\n", |
| 34 | + "#### **Protein Domain** \n", |
| 35 | + "A **protein domain** is a **structural and functional unit** of a protein. \n", |
| 36 | + "\n", |
| 37 | + "\n", |
| 38 | + "##### Key Characteristics:\n", |
| 39 | + "- **Domains are part of a protein chain.** \n", |
| 40 | + "- A domain can span: \n", |
| 41 | + " 1. **The entire chain** (single-domain protein): \n", |
| 42 | + " - In this case, the protein domain is equivalent to the chain itself. \n", |
| 43 | + " - Example: \n", |
| 44 | + " - All chains of the **PDB structure \"1A3N\"** are single-domain proteins. \n", |
| 45 | + " - Each chain has a SCOPe domain identifier. \n", |
| 46 | + " - For example, Chain **A**: \n", |
| 47 | + " - Domain identifier: `d1a3na_` \n", |
| 48 | + " - Breakdown of the identifier: \n", |
| 49 | + " - `d`: Denotes domain. \n", |
| 50 | + " - `1a3n`: Refers to the PDB protein structure identifier. \n", |
| 51 | + " - `a`: Specifies the chain within the structure. (`_` for None and `.` for multiple chains)\n", |
| 52 | + " - `_`: Indicates the domain spans the entire chain (single-domain protein). \n", |
| 53 | + " - Example: [PDB Structure 1A3N - Chain A](https://www.rcsb.org/sequence/1A3N#A)\n", |
| 54 | + " 2. **A specific portion of the chain** (multi-domain protein): \n", |
| 55 | + " - Here, a single chain contains multiple domains. \n", |
| 56 | + " - Example: Chain **A** of the **PDB structure \"1PKN\"** contains three domains: `d1pkna1`, `d1pkna2`, `d1pkna3`. \n", |
| 57 | + " - Example: [PDB Structure 1PKN - Chain A](https://www.rcsb.org/annotations/1PKN). \n", |
| 58 | + "\n", |
| 59 | + "---\n", |
| 60 | + "\n", |
| 61 | + "#### **Protein Chain** \n", |
| 62 | + "A **protein chain** refers to the entire **polypeptide chain** observed in a protein's 3D structure (as described in PDB files). \n", |
| 63 | + "\n", |
| 64 | + "##### Key Points:\n", |
| 65 | + "- A chain can consist of **one or multiple domains**:\n", |
| 66 | + " - **Single-domain chain**: The chain and domain are identical. \n", |
| 67 | + " - Example: Myoglobin. \n", |
| 68 | + " - **Multi-domain chain**: Contains several domains, each with distinct structural and functional roles. \n", |
| 69 | + "- Chains assemble to form **protein complexes** or **structures**. \n", |
| 70 | + "\n", |
| 71 | + "\n", |
| 72 | + "---\n", |
| 73 | + "\n", |
| 74 | + "#### **Key Observations About SCOPe** \n", |
| 75 | + "- The **fundamental classification unit** in SCOPe is the **protein domain**, not the entire protein. \n", |
| 76 | + "- _**The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.**_\n", |
| 77 | + "\n", |
| 78 | + "\n", |
| 79 | + "--- \n", |
| 80 | + "\n", |
| 81 | + "**SCOPe 2.08 Data Analysis:**\n", |
| 82 | + "\n", |
| 83 | + "The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:\n", |
| 84 | + "\n", |
| 85 | + "- **Classes**: 12\n", |
| 86 | + "- **Folds**: 1485\n", |
| 87 | + "- **Superfamilies**: 2368\n", |
| 88 | + "- **Families**: 5431\n", |
| 89 | + "- **Proteins**: 13,514\n", |
| 90 | + "- **Species**: 30,294\n", |
| 91 | + "- **Domains**: 344,851\n", |
| 92 | + "\n", |
| 93 | + "For more detailed statistics, please refer to the official SCOPe website:\n", |
| 94 | + "\n", |
| 95 | + "- [SCOPe 2.08 Statistics](https://scop.berkeley.edu/statistics/ver=2.08)\n", |
| 96 | + "- [SCOPe 2.08 Release](https://scop.berkeley.edu/ver=2.08)\n", |
| 97 | + "\n", |
| 98 | + "---\n", |
| 99 | + "\n", |
| 100 | + "## SCOPe Labeling \n", |
| 101 | + "\n", |
| 102 | + "- Use SCOPe labels for protein domains.\n", |
| 103 | + "- Map them back to their **protein-chain** sequences (protein sequence label = sum of all domain labels).\n", |
| 104 | + "- Train on protein sequences.\n", |
| 105 | + "- This pretraining task would be comparable to GO-based training.\n", |
| 106 | + "\n", |
| 107 | + "--- " |
| 108 | + ] |
| 109 | + }, |
20 | 110 | { |
21 | 111 | "cell_type": "code", |
22 | 112 | "execution_count": 1, |
|
171 | 261 | "text": [ |
172 | 262 | "Checking for processed data in data\\SCOPe\\version_2.08\\SCOPe2000\\processed\n", |
173 | 263 | "Missing processed data file (`data.pkl` file)\n", |
174 | | - "Missing PDB raw data, Downloading PDB sequence data....\n", |
175 | | - "Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n", |
176 | | - "Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n", |
177 | | - "Unzipping the file....\n", |
| 264 | + "Missing PDB raw data, Downloading PDB sequence data....\n", |
| 265 | + "Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n", |
| 266 | + "Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n", |
| 267 | + "Unzipping the file....\n", |
178 | 268 | "Unpacked and saved to data\\SCOPe\\pdb_sequences.txt\n", |
179 | 269 | "Removed temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n", |
180 | | - "Missing Scope: cla.txt raw data, Downloading...\n" |
181 | | - ] |
182 | | - }, |
183 | | - { |
184 | | - "name": "stderr", |
| 270 | + "Missing Scope: cla.txt raw data, Downloading...\n" |
| 271 | + ] |
| 272 | + }, |
| 273 | + { |
| 274 | + "name": "stderr", |
185 | 275 | "output_type": "stream", |
186 | 276 | "text": [ |
187 | 277 | "G:\\anaconda3\\envs\\env_chebai\\lib\\site-packages\\urllib3\\connectionpool.py:1099: InsecureRequestWarning: Unverified HTTPS request is being made to host 'scop.berkeley.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n", |
188 | 278 | "warnings.warn(\n" |
189 | | - ] |
190 | | - }, |
191 | | - { |
192 | | - "name": "stdout", |
| 279 | + ] |
| 280 | + }, |
| 281 | + { |
| 282 | + "name": "stdout", |
193 | 283 | "output_type": "stream", |
194 | 284 | "text": [ |
195 | 285 | "Missing Scope: hie.txt raw data, Downloading...\n", |
|
0 commit comments