Skip to content

Commit f3f28d7

Browse files
committed
Add scientific data sources and methodology references
- Reference to Arning et al. (2021) pubMLST study - Dataset characteristics and source distribution - Link to public data repository - Scientific foundation for training methodology
1 parent b4e991d commit f3f28d7

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,18 @@ The current model (`modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl`) features:
116116

117117
The repository includes `data/sample_data.csv` with 10 test samples for immediate experimentation.
118118

119+
## Data Sources
120+
121+
The training approach is based on the pubMLST database methodology described in:
122+
123+
> Arning N, Sheppard SK, Bayliss S, Clifton DA, Wilson DJ (2021) **Machine learning to predict the source of campylobacteriosis using whole genome data.** *PLoS Genet* 17(10): e1009436. https://doi.org/10.1371/journal.pgen.1009436
124+
125+
**Dataset characteristics:**
126+
- 5,799 C. jejuni and C. coli genomes from pubMLST
127+
- Sources: chicken (4,147), cattle (716), sheep (584), wild bird (212), environment (140)
128+
- cgMLST approach with 1,343 core genes for enhanced accuracy
129+
- Public data available at: https://pubmlst.org/bigsdb?db=pubmlst_campylobacter_isolates&page=query&project_list=102&submit=1
130+
119131
## Credits
120132

121133
**Developer**: Laura Di Egidio (Master's thesis project)

0 commit comments

Comments
 (0)