Skip to content

Commit 96bc75a

Browse files
authored
Merge pull request #420 from ATOMScience-org/bug_remove_outlier_replicates
fix remove_outlier_replicates to dismiss NaN rows from response column. Also uploads code coverage reports as one large merged report at the end of the workflow. This more correctly reports coverage after direct and indirect changes.
2 parents de14203 + bed149a commit 96bc75a

File tree

4 files changed

+210
-64
lines changed

4 files changed

+210
-64
lines changed

.github/workflows/pytest.yml

Lines changed: 171 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,34 @@ jobs:
5353
env:
5454
ENV: test
5555

56-
- name: Upload coverage reports to Codecov
57-
uses: codecov/codecov-action@v4
58-
env:
59-
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
56+
- name: Debug - Find all coverage files
57+
run: |
58+
echo "=== Current directory ==="
59+
pwd
60+
echo ""
61+
echo "=== Contents of current directory ==="
62+
ls -la
63+
echo ""
64+
echo "=== Searching for ALL .coverage files recursively ==="
65+
find . -name ".coverage*" -type f 2>/dev/null || echo "No .coverage files found"
66+
echo ""
67+
echo "=== Contents of test directory ==="
68+
ls -la atomsci/ddm/test/integrative/ || echo "Directory not found"
69+
echo ""
70+
echo "=== Checking if coverage was even run ==="
71+
which coverage
72+
coverage --version || echo "Coverage not installed"
73+
74+
- name: Save coverage
75+
uses: actions/upload-artifact@v4
76+
with:
77+
name: coverage-pytest-unit
78+
path: |
79+
atomsci/ddm/test/unit/.coverage*
80+
atomsci/modac/test/unit/.coverage*
81+
include-hidden-files: true
82+
if-no-files-found: error
83+
6084
pytest-integrative-1:
6185
runs-on: ubuntu-24.04
6286
strategy:
@@ -100,10 +124,32 @@ jobs:
100124
env:
101125
ENV: test
102126

103-
- name: Upload coverage reports to Codecov
104-
uses: codecov/codecov-action@v4
105-
env:
106-
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
127+
- name: Debug - Find all coverage files
128+
run: |
129+
echo "=== Current directory ==="
130+
pwd
131+
echo ""
132+
echo "=== Contents of current directory ==="
133+
ls -la
134+
echo ""
135+
echo "=== Searching for ALL .coverage files recursively ==="
136+
find . -name ".coverage*" -type f 2>/dev/null || echo "No .coverage files found"
137+
echo ""
138+
echo "=== Contents of test directory ==="
139+
ls -la atomsci/ddm/test/integrative/ || echo "Directory not found"
140+
echo ""
141+
echo "=== Checking if coverage was even run ==="
142+
which coverage
143+
coverage --version || echo "Coverage not installed"
144+
145+
- name: Save coverage
146+
uses: actions/upload-artifact@v4
147+
with:
148+
name: coverage-pytest-integrative-1
149+
path: atomsci/ddm/test/integrative/**/.coverage*
150+
include-hidden-files: true
151+
if-no-files-found: error
152+
107153
pytest-integrative-2:
108154
runs-on: ubuntu-24.04
109155
strategy:
@@ -147,10 +193,32 @@ jobs:
147193
env:
148194
ENV: test
149195

150-
- name: Upload coverage reports to Codecov
151-
uses: codecov/codecov-action@v4
152-
env:
153-
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
196+
- name: Debug - Find all coverage files
197+
run: |
198+
echo "=== Current directory ==="
199+
pwd
200+
echo ""
201+
echo "=== Contents of current directory ==="
202+
ls -la
203+
echo ""
204+
echo "=== Searching for ALL .coverage files recursively ==="
205+
find . -name ".coverage*" -type f 2>/dev/null || echo "No .coverage files found"
206+
echo ""
207+
echo "=== Contents of test directory ==="
208+
ls -la atomsci/ddm/test/integrative/ || echo "Directory not found"
209+
echo ""
210+
echo "=== Checking if coverage was even run ==="
211+
which coverage
212+
coverage --version || echo "Coverage not installed"
213+
214+
- name: Save coverage
215+
uses: actions/upload-artifact@v4
216+
with:
217+
name: coverage-pytest-integrative-2
218+
path: atomsci/ddm/test/integrative/**/.coverage*
219+
include-hidden-files: true
220+
if-no-files-found: error
221+
154222
pytest-integrative-3:
155223
runs-on: ubuntu-24.04
156224
strategy:
@@ -194,10 +262,32 @@ jobs:
194262
env:
195263
ENV: test
196264

197-
- name: Upload coverage reports to Codecov
198-
uses: codecov/codecov-action@v4
199-
env:
200-
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
265+
- name: Debug - Find all coverage files
266+
run: |
267+
echo "=== Current directory ==="
268+
pwd
269+
echo ""
270+
echo "=== Contents of current directory ==="
271+
ls -la
272+
echo ""
273+
echo "=== Searching for ALL .coverage files recursively ==="
274+
find . -name ".coverage*" -type f 2>/dev/null || echo "No .coverage files found"
275+
echo ""
276+
echo "=== Contents of test directory ==="
277+
ls -la atomsci/ddm/test/integrative/ || echo "Directory not found"
278+
echo ""
279+
echo "=== Checking if coverage was even run ==="
280+
which coverage
281+
coverage --version || echo "Coverage not installed"
282+
283+
- name: Save coverage
284+
uses: actions/upload-artifact@v4
285+
with:
286+
name: coverage-pytest-integrative-3
287+
path: atomsci/ddm/test/integrative/**/.coverage*
288+
include-hidden-files: true
289+
if-no-files-found: error
290+
201291
pytest-integrative-4:
202292
runs-on: ubuntu-24.04
203293
strategy:
@@ -241,7 +331,70 @@ jobs:
241331
env:
242332
ENV: test
243333

244-
- name: Upload coverage reports to Codecov
334+
- name: Debug - Find all coverage files
335+
run: |
336+
echo "=== Current directory ==="
337+
pwd
338+
echo ""
339+
echo "=== Contents of current directory ==="
340+
ls -la
341+
echo ""
342+
echo "=== Searching for ALL .coverage files recursively ==="
343+
find . -name ".coverage*" -type f 2>/dev/null || echo "No .coverage files found"
344+
echo ""
345+
echo "=== Contents of test directory ==="
346+
ls -la atomsci/ddm/test/integrative/ || echo "Directory not found"
347+
echo ""
348+
echo "=== Checking if coverage was even run ==="
349+
which coverage
350+
coverage --version || echo "Coverage not installed"
351+
352+
- name: Save coverage
353+
uses: actions/upload-artifact@v4
354+
with:
355+
name: coverage-pytest-integrative-4
356+
path: atomsci/ddm/test/integrative/**/.coverage*
357+
include-hidden-files: true
358+
if-no-files-found: error
359+
360+
coverage-merge:
361+
runs-on: ubuntu-24.04
362+
needs: [pytest-unit, pytest-integrative-1, pytest-integrative-2, pytest-integrative-3, pytest-integrative-4]
363+
steps:
364+
- uses: actions/checkout@v3
365+
- name: Set up Python
366+
uses: actions/setup-python@v4
367+
with:
368+
python-version: "3.9"
369+
370+
- name: Install coverage
371+
run: pip install coverage
372+
373+
- name: Download all coverage artifacts
374+
uses: actions/download-artifact@v5
375+
with:
376+
path: coverage-reports
377+
378+
- name: Merge coverage reports
379+
run: |
380+
# List directory structure for debugging
381+
ls -la
382+
ls -la coverage-reports/ || echo "coverage-reports directory not found"
383+
384+
# Find and combine all coverage files
385+
find coverage-reports -name ".coverage*" -type f
386+
387+
# Combine all coverage files
388+
coverage combine $(find coverage-reports -name ".coverage*" -type f -print)
389+
390+
# Generate XML report for codecov
391+
coverage xml
392+
393+
- name: Upload merged coverage to Codecov
245394
uses: codecov/codecov-action@v4
395+
with:
396+
files: ./coverage.xml
397+
flags: unittests
398+
name: codecov-umbrella
246399
env:
247-
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
400+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

atomsci/ddm/test/integrative/curation_funcs/test_curation_funcs.py

Lines changed: 30 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -33,42 +33,48 @@ def write_to_file(filt_df, filt_file):
3333
print(f"Wrote outlier-filtered data to {filt_file}")
3434
return filt_df
3535

36-
def test_remove_outlier_replicates():
37-
"""Test outlier removal using curate_data.remove_outlier_replicates"""
36+
def create_raw_and_filt_file():
37+
"""Create filtered file for testing aggregation function"""
3838
raw_df = get_raw_data()
39-
print(f"Raw data has {len(raw_df)} rows, {len(set(raw_df.base_rdkit_smiles.values))} unique compounds")
4039
filt_df = curate_data.remove_outlier_replicates(raw_df, response_col='log_efflux_ratio', id_col='base_rdkit_smiles',
4140
max_diff_from_median=0.5)
41+
write_to_file(filt_df, filt_file)
42+
43+
return raw_df, filt_df
44+
45+
def test_remove_outlier_replicates(capsys):
46+
"""Test outlier removal using curate_data.remove_outlier_replicates"""
47+
# Clean up old files
48+
clean()
49+
50+
raw_df, filt_df = create_raw_and_filt_file()
51+
52+
captured = capsys.readouterr()
53+
assert 'Removed 1 rows with missing log_efflux_ratio values' in captured.out, "Error: expected message about removed rows with missing values"
4254
n_filt_rows = len(filt_df)
4355
n_filt_cmpds = len(set(filt_df.base_rdkit_smiles.values))
4456
print(f"Filtered data has {n_filt_rows} rows, {n_filt_cmpds} unique compounds")
4557
assert (n_filt_rows == 1093), "Error: expected 1093 rows in filtered data"
4658
assert (n_filt_cmpds == 803), "Error: expected 803 unique compounds in filtered data"
4759
n_removed = len(raw_df) - n_filt_rows
48-
assert (n_removed == 7), f"Error: {n_removed} rows were removed, expected 7"
49-
50-
write_to_file(filt_df, filt_file)
60+
assert (n_removed == 8), f"Error: {n_removed} rows were removed, expected 8"
5161

5262
def test_aggregate_assay_data():
5363
"""Test curate_data.aggregate_assay_data, the preferred function for averaging replicate values over compounds"""
54-
if not os.path.exists(filt_file):
55-
test_remove_outlier_replicates()
56-
else:
57-
try:
58-
filt_df = pd.read_csv(filt_file)
59-
agg_df = curate_data.aggregate_assay_data(filt_df, value_col='log_efflux_ratio', label_actives=False,
60-
id_col='compound_id', smiles_col='base_rdkit_smiles', relation_col='relation')
61-
n_agg_rows = len(agg_df)
62-
n_agg_cmpds = len(set(agg_df.base_rdkit_smiles.values))
63-
print(f"Aggregated data has {n_agg_rows} rows, {n_agg_cmpds} unique compounds")
64-
assert (n_agg_rows == 803), "Error: expected 803 rows in aggregated data"
65-
assert (n_agg_cmpds == 803), "Error: expected 803 unique compounds in aggregated data"
66-
67-
agg_file = f"{script_path}/{test_file_prefix}-aggregated.csv"
68-
agg_df.to_csv(agg_file, index=False)
69-
print(f"Wrote aggregated data to {agg_file}")
70-
except Exception as e:
71-
pytest.fail(f"Could not read file {filt_file}: {e}")
64+
# Clean up old files
65+
clean()
66+
raw_df, filt_df = create_raw_and_filt_file()
67+
agg_df = curate_data.aggregate_assay_data(filt_df, value_col='log_efflux_ratio', label_actives=False,
68+
id_col='compound_id', smiles_col='base_rdkit_smiles', relation_col='relation')
69+
n_agg_rows = len(agg_df)
70+
n_agg_cmpds = len(set(agg_df.base_rdkit_smiles.values))
71+
print(f"Aggregated data has {n_agg_rows} rows, {n_agg_cmpds} unique compounds")
72+
assert (n_agg_rows == 803), "Error: expected 803 rows in aggregated data"
73+
assert (n_agg_cmpds == 803), "Error: expected 803 unique compounds in aggregated data"
74+
75+
agg_file = f"{script_path}/{test_file_prefix}-aggregated.csv"
76+
agg_df.to_csv(agg_file, index=False)
77+
print(f"Wrote aggregated data to {agg_file}")
7278

7379
def test_average_and_remove_duplicates():
7480
"""Test outlier removal and averaging using deprecated curation function"""
@@ -90,23 +96,3 @@ def test_average_and_remove_duplicates():
9096
curated_df.to_csv(curated_file, index=False)
9197
print(f"Wrote curated data to {curated_file}")
9298

93-
94-
def test():
95-
"""Test data curation functions"""
96-
97-
# Clean up old files
98-
clean()
99-
100-
# Filter out outliers (preferred method)
101-
test_remove_outlier_replicates()
102-
103-
# Average replicate values per compound (preferred method)
104-
test_aggregate_assay_data()
105-
106-
# Remove outliers and average over replicates (old method)
107-
test_average_and_remove_duplicates()
108-
109-
110-
111-
if __name__ == '__main__':
112-
test()

atomsci/ddm/test/test_datasets/pGP_MDCK_efflux_ratio_chembl29.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1099,3 +1099,4 @@ activity_id,compound_id,base_rdkit_smiles,relation,efflux_ratio,log_efflux_ratio
10991099
20668378,CHEMBL3892073,NC(=O)c1cc(-c2ccc(F)cc2)c2ccc(CN3C(=O)CCC3=O)cc2n1,=,7.0,0.8450980400142568
11001100
20679457,CHEMBL4644751,C[C@H](C(=O)Nc1ccc(F)cc1)C12CC(NC(=O)c3ccc(F)c(F)c3)(C1)C2,=,1.3,0.11394335230683678
11011101
20717398,CHEMBL4643105,COc1cc(/C=C2\CCCN3C2=NO[C@H](c2cc(F)cc(F)c2)[C@@H]3C)ccc1-n1cnc(C)c1,=,2.46,0.39093510710337914
1102+
20717398,CHEMBL4643105,COc1cc(/C=C2\CCCN3C2=NO[C@H](c2cc(F)cc(F)c2)[C@@H]3C)ccc1-n1cnc(C)c1,=,,

atomsci/ddm/utils/curate_data.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -519,10 +519,16 @@ def remove_outlier_replicates(df, response_col='pIC50', id_col='compound_id', ma
519519
max_diff_from_median (float): Maximum absolute difference from median value allowed for retained replicates.
520520
521521
Returns:
522-
result_df (DataFrame): Filtered data frame with outlier replicates removed.
522+
result_df (DataFrame): Filtered data frame with outlier replicates removed. Rows with NaN values in the
523+
response column are removed as a preprocessing step before outlier detection.
523524
524525
"""
525-
526+
527+
prev_len = len(df)
528+
df = df.dropna(subset=[response_col])
529+
if prev_len != len(df):
530+
print(f"Removed {prev_len - len(df)} rows with missing {response_col} values")
531+
526532
fr_df = freq_table(df, id_col, min_freq=2)
527533
rep_ids = fr_df[id_col].values.tolist()
528534
has_rep_df = df[df[id_col].isin(rep_ids)]

0 commit comments

Comments
 (0)