Skip to content

Gold standard dataset: some protein IDs exist only in the negative dataset #8

@JudithBernett

Description

@JudithBernett

This happens due to the redundancy filtering. The issue is here:

with open(f'Datasets_PPIs/Hippiev2.3/Intra{block}_pos.txt', 'r') as f:
for line in f:
pos_interactions += 1
prot_a, prot_b = line.strip().split(' ')
if prot_a not in redundant_proteins and prot_b not in redundant_proteins and prot_a not in intra_sims and prot_b not in intra_sims:
block_pos.add((prot_a, prot_b))
print(f'Positives: {len(block_pos)} / {pos_interactions} remained! Filtered {pos_interactions - len(block_pos)} PPIs ...')
neg_interactions = 0
with open(f'Datasets_PPIs/Hippiev2.3/Intra{block}_neg.txt', 'r') as f:
for line in f:
neg_interactions += 1
prot_a, prot_b = line.strip().split(' ')
if prot_a not in redundant_proteins and prot_b not in redundant_proteins and prot_a not in intra_sims and prot_b not in intra_sims:
block_neg.add((prot_a, prot_b))

If a protein only interacts with redundant proteins in the positive dataset but in the negative dataset, it interacts with non-redundant proteins, it will only have negative edges (example: Q9NR71)

Solution:
Why did I even sample the negative dataset before the redundancy reduction? Turn the steps around:

  1. Make the partitioning
  2. Kick out redundant proteins within and between the blocks
  3. Sample the negatives by expected degree sampling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions