Skip to content
This repository was archived by the owner on Feb 16, 2025. It is now read-only.

Commit 4f3fd8e

Browse files
Update README.md
1 parent 016c26c commit 4f3fd8e

File tree

1 file changed

+25
-4
lines changed

1 file changed

+25
-4
lines changed

README.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
1-
# Data Science Project
1+
# How do the affiliations of researchers influence the diversity of engineering research topics?
22

3-
## Data Prep:
3+
## Data Preparation:
44
0. `main.py` <br />
55
• Running this file will automatically run all the other files in `Data Prep`. <br />
66
• Input the starting folder (the root folder of the dataset), the folder you want to extract the data to, and the folder you want to store the imputed data.
77

88
2. `change_extension.py` <br />
9-
• This file turns the given Scorpus dataset into `.json` files.
9+
• This file turns the given Scopus dataset into `.json` files.
1010

1111
3. `data_extraction.py` <br />
12-
• This file loops through each year of the Scorpus data set and combine it into 1 single file while removing unncessary data.
12+
• This file loops through each year of the Scopus data set and combine it into 1 single file while removing unncessary data.
1313

1414
4. `impute_missing_value.py` <br />
1515
• This file imputes any missing values in the dataset.
1616

17+
5. `remove_duplicates.py` <br / >
18+
• This file will drop duplicated paper from the file.
19+
1720
## Web Scraping:
1821
0. `main.py` <br />
1922
• Use this file to run `web_scraping.py`
@@ -26,3 +29,21 @@
2629
2. `join_json.py` <br />
2730
• This file is used to join the json files obtained from web scraping into one file. <br />
2831
• You can then use the result of this function to impute the missing values using `impute_missing_value.py`.
32+
33+
## Model Training:
34+
1. `Model.ipynb` <br />
35+
• This file trains the model from the data collected from Scopus and from web scraping. <br />
36+
• We used `Latent Dirichlet Allocation (LDA)` and `K Means`.
37+
38+
2. `combine_csv.py` <br />
39+
• This file is used for combining all the CSV files together into 1 file.
40+
41+
## Data Visualisation
42+
1. `data_visualisation.py` <br />
43+
• This file visualises the data from Data Prep as well the model's fitted data. <br />
44+
• The data is visualised using StreamLit. <br />
45+
• There are 3 files which are used here (`main_data.csv`, `cluster_data.csv`, `calculated_map_data.csv`).
46+
47+
2. `calculate_map.py` <br />
48+
• This file is originally in `data_visualisation.py`, but since it was too computationally heavy. <br />
49+
• Therefore it was split off and the output is stored in a file to be loaded instead.

0 commit comments

Comments
 (0)