|
1 | | -# Data Science Project |
| 1 | +# How do the affiliations of researchers influence the diversity of engineering research topics? |
2 | 2 |
|
3 | | -## Data Prep: |
| 3 | +## Data Preparation: |
4 | 4 | 0. `main.py` <br /> |
5 | 5 | • Running this file will automatically run all the other files in `Data Prep`. <br /> |
6 | 6 | • Input the starting folder (the root folder of the dataset), the folder you want to extract the data to, and the folder you want to store the imputed data. |
7 | 7 |
|
8 | 8 | 2. `change_extension.py` <br /> |
9 | | -• This file turns the given Scorpus dataset into `.json` files. |
| 9 | +• This file turns the given Scopus dataset into `.json` files. |
10 | 10 |
|
11 | 11 | 3. `data_extraction.py` <br /> |
12 | | -• This file loops through each year of the Scorpus data set and combine it into 1 single file while removing unncessary data. |
| 12 | +• This file loops through each year of the Scopus data set and combine it into 1 single file while removing unncessary data. |
13 | 13 |
|
14 | 14 | 4. `impute_missing_value.py` <br /> |
15 | 15 | • This file imputes any missing values in the dataset. |
16 | 16 |
|
| 17 | +5. `remove_duplicates.py` <br / > |
| 18 | +• This file will drop duplicated paper from the file. |
| 19 | + |
17 | 20 | ## Web Scraping: |
18 | 21 | 0. `main.py` <br /> |
19 | 22 | • Use this file to run `web_scraping.py` |
|
26 | 29 | 2. `join_json.py` <br /> |
27 | 30 | • This file is used to join the json files obtained from web scraping into one file. <br /> |
28 | 31 | • You can then use the result of this function to impute the missing values using `impute_missing_value.py`. |
| 32 | + |
| 33 | +## Model Training: |
| 34 | +1. `Model.ipynb` <br /> |
| 35 | +• This file trains the model from the data collected from Scopus and from web scraping. <br /> |
| 36 | +• We used `Latent Dirichlet Allocation (LDA)` and `K Means`. |
| 37 | + |
| 38 | +2. `combine_csv.py` <br /> |
| 39 | +• This file is used for combining all the CSV files together into 1 file. |
| 40 | + |
| 41 | +## Data Visualisation |
| 42 | +1. `data_visualisation.py` <br /> |
| 43 | +• This file visualises the data from Data Prep as well the model's fitted data. <br /> |
| 44 | +• The data is visualised using StreamLit. <br /> |
| 45 | +• There are 3 files which are used here (`main_data.csv`, `cluster_data.csv`, `calculated_map_data.csv`). |
| 46 | + |
| 47 | +2. `calculate_map.py` <br /> |
| 48 | +• This file is originally in `data_visualisation.py`, but since it was too computationally heavy. <br /> |
| 49 | +• Therefore it was split off and the output is stored in a file to be loaded instead. |
0 commit comments