Skip to content

Commit 4777e05

Browse files
committed
upate max pos embedding for special tokens
1 parent a0e266e commit 4777e05

File tree

4 files changed

+10
-6
lines changed

4 files changed

+10
-6
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,3 +170,4 @@ electra_pretrained.ckpt
170170
.jupyter
171171
.virtual_documents
172172
.isort.cfg
173+
.vscode

chebai_proteins/preprocessing/datasets/deepGO/go_uniprot.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,9 +105,9 @@ class _GOUniProtDataExtractor(_DynamicDataset, ABC):
105105
# TODO: should we be really allowing all branches for single dataset?
106106
_ALL_GO_BRANCHES: str = "all"
107107
_GO_BRANCH_NAMESPACE: Dict[str, str] = {
108-
"BP": "biological_process",
109-
"MF": "molecular_function",
110-
"CC": "cellular_component",
108+
"BP": "biological_process", # Huge branch, with 20,000+ GO terms
109+
"MF": "molecular_function", # smaller branch, with 6000+ GO terms
110+
"CC": "cellular_component", # smallest branch, with 2,000+ GO terms
111111
}
112112

113113
def __init__(self, go_branch: str, max_sequence_len: int = 1002, **kwargs):

configs/model/electra.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,11 @@ init_args:
33
optimizer_kwargs:
44
lr: 1e-3
55
config:
6-
vocab_size: 31
7-
max_position_embeddings: 1002
6+
vocab_size: 31 # 21 unique + embedding offset (10)
7+
# For classification:[Maximum sequence length (1002) (padding will be also upto 1002)] + 1 for CLS token
8+
# For pretraining: [Maximum sequence length (1002) (padding will be also upto 1002)] + 10 embedding offset (includes all special tokens)
9+
# Hence, use max of (classification, pretraining): max_position_embeddings = 1002 + 10 = 1012
10+
max_position_embeddings: 1012
811
num_attention_heads: 8
912
num_hidden_layers: 6
1013
type_vocab_size: 1

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
66
name = "chebai-proteins"
77
version = "0.0.2"
88
description = "Repository for protein prediction and classification, built on top of the python-chebai codebase"
9-
authors = [{name="", email=""}]
9+
authors = []
1010
readme = "README.md"
1111
license = { text = "AGPL-3.0" }
1212
requires-python = ">=3.9, <3.13"

0 commit comments

Comments
 (0)