Gold standard dataset: some protein IDs exist only in the negative dataset

This happens due to the redundancy filtering. The issue is here: 
https://github.com/daisybio/data-leakage-ppi-prediction/blob/227ea4cbb1c50fca5fa322e15c526da57a77bf2d/create_gold_standard.py#L162-L175

If a protein only interacts with redundant proteins in the positive dataset but in the negative dataset, it interacts with non-redundant proteins, it will only have negative edges (example: Q9NR71)

Solution: 
Why did I even sample the negative dataset before the redundancy reduction? Turn the steps around: 
1. Make the partitioning
2. Kick out redundant proteins within and between the blocks
3. Sample the negatives by expected degree sampling

	with open(f'Datasets_PPIs/Hippiev2.3/Intra{block}_pos.txt', 'r') as f:
	for line in f:
	pos_interactions += 1
	prot_a, prot_b = line.strip().split(' ')
	if prot_a not in redundant_proteins and prot_b not in redundant_proteins and prot_a not in intra_sims and prot_b not in intra_sims:
	block_pos.add((prot_a, prot_b))
	print(f'Positives: {len(block_pos)} / {pos_interactions} remained! Filtered {pos_interactions - len(block_pos)} PPIs ...')
	neg_interactions = 0
	with open(f'Datasets_PPIs/Hippiev2.3/Intra{block}_neg.txt', 'r') as f:
	for line in f:
	neg_interactions += 1
	prot_a, prot_b = line.strip().split(' ')
	if prot_a not in redundant_proteins and prot_b not in redundant_proteins and prot_a not in intra_sims and prot_b not in intra_sims:
	block_neg.add((prot_a, prot_b))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gold standard dataset: some protein IDs exist only in the negative dataset #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gold standard dataset: some protein IDs exist only in the negative dataset #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions