Skip to content

jeewonkimm2/Prediction_of_integration_and_abolition_for_university

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Business Analytics Term Project

Overview

  • It is a term project for the year 3 module, Business Analytics. We needed to find any problem that can be solved by data mining techniques in one of business domains.

  • Topic : Prediction of integration and abolition for university.

  • Goal : To predict departments that might be integrated or abolished using competition rate for each department of schools.

1. Background

  • A trend of decreasing the school-age population and the number of students eligible for admission.

image

  • Trends in integration and abolition of university departments is increasing in South Korea.

image

As the number of students decreased, the restructuring of universities was also being carried out intensively.

  • Expected Effect
    • For Student : The predictions can be considered when selecting a university to go.
    • For University : With the predictions, universities can know the position of themselves and prepare for changing phenomenons.
  1. Filtered government sponsorship university list
  • This dataset has 3 columns, Year, University, and Gov_Sponsorship.
  • Our team manually collected separate dataset from Korea Development Institute and editted into one file.
    • Year : The year of evaluation
    • University : The name of university
    • Gov_Sponsorship
      • 1 : Government sponsorship
      • 0 : No government sponsorship
  1. Progress dataset
  • Year : Year information for each admission
  • Region : Region information for each school
  • Departmental Affiliation : A major classification of departmental affiliations
  • The number to be admitted : Information on students admitted by departments
  • Candidate : The number of candidates
  • The number of admissions : Information on students who admitted
  • Government financial support
    • 1 : Existence of government financial support
    • 0 : No existence of government financial support
  • School & Department : Information merged on school and each department
  • Competition rate : Information derived from dividing the number of applicants by the number to be admitted
  • Total fertility rate : Information on fertility rate by each region
  • Net mover : Information on net mover by each region
  • Immigration : Information on the number of immigration
  1. Future dataset
  • File Name : future.csv
  • The number to be admitted : Information on students admitted by departments
  • Candidate : The number of candidates
  • Government financial support
    • 1 : Existence of government financial support
    • 0 : No existence of government financial support
  • Awareness : Information on awareness of departmental affiliations to students
  • Total fertility rate : Information on fertility rate by each region
  • Net mover : Information on net mover by each region
  • Immigration : Information on the number of immigration
  • Schoolage : Information on schoolage population by each year

Q. how to get future features?

'Schoolage' column : dataset in statistics Korea
'The number to be admitted', 'Government financial support', 'Awareness' column : maintain 2021 data to find the change of competition rate in same condition
'Total fertility rate', 'Net mover', 'Immigration' : Calculate average of difference between 2 years with last 3 years data and subtract from last data
  • File Name : future features 2022/2023/2024/2025/2026.ipynb

    • Total fertility rate
    fertility_rate_dif = (float(fertility_rate_2019['Total_fertility_rate']) - float(fertility_rate_2020['Total_fertility_rate'])) + (float(fertility_rate_2020['Total_fertility_rate']) - float(fertility_rate_2021['Total_fertility_rate']))
    avg_ft.append(fertility_rate_dif / 2)
    • Net mover
    net_mover_dif = (float(net_mover_2019['Net_Mover']) - float(net_mover_2020['Net_Mover'])) + (float(net_mover_2020['Net_Mover']) - float(net_mover_2021['Net_Mover']))
    avg_net.append(net_mover_dif / 2) 
    • Immigration
    immigration_dif = (float(immigration_2019['immigration']) - float(immigration_2020['immigration'])) + (float(immigration_2020['immigration']) - float(immigration_2021['immigration']))
    avg_imm.append(immigration_dif / 2)
  1. Final dataset

3717 * 7(years) = 26019 Instances

  • Year : Year information for each admission
  • Region : Region information for each school
  • Departmental Affiliation (major series) : A major classification of departmental affiliations
  • Departmental Affiliation (middle series) : A middle classification of departmental affiliations
  • The number to be admitted : Information on students admitted by departments
  • Candidate : The number of candidates
  • The number of admissions : Information on students who admitted
  • Government financial support
    • 1 : Existence of government financial support
    • 0 : No existence of government financial support
  • Awareness : Information on awareness of departmental affiliations to students by each year
  • School & Department : Information merged on school and each department
  • Competition rate : Information derived from dividing the number of applicants by the number to be admitted
  • Total fertility rate : Information on fertility rate by each region
  • Net mover : Information on net mover by each region
  • Immigration : Information on the number of immigration
  • Schoolage : Information on schoolage population by each year

Preprocessing

  • File Name : future features 2022.ipynb

    • only using intersection data
    b = a_2015.intersection(a_2016)
    b = b.intersection(a_2017)
    b = b.intersection(a_2018)
    b = b.intersection(a_2019)
    b = b.intersection(a_2020)
    b = b.intersection(a_2021)
    bl = list(b)
    
    main_check = main.loc[main['학교_학과명'].isin(bl)]
    • handling duplicated data
    main_2015 = main_2015.drop_duplicates(['학교_학과명'], keep='first')
    main_2016 = main_2016.drop_duplicates(['학교_학과명'], keep='first')
    main_2017 = main_2017.drop_duplicates(['학교_학과명'], keep='first')
    main_2018 = main_2018.drop_duplicates(['학교_학과명'], keep='first')
    main_2019 = main_2019.drop_duplicates(['학교_학과명'], keep='first')
    main_2020 = main_2020.drop_duplicates(['학교_학과명'], keep='first')
    main_2021 = main_2021.drop_duplicates(['학교_학과명'], keep='first')
  1. Model training and predicting result
  • File Name : main_training.ipynb
  • Independent variables(X) : '연도','시도','대계열','중계열','모집인원_학부계','Gov_Spnsorship','awareness','Total_fertility_rate','Net_Mover','immigration','schoolage'
  • Target variable(y) : '지원자_전체_계'
  • Model Selection : GridSearchCV(KFold when n_splits is 5) and train_test_split(Train, Test, and Validation)
  • Preprocessing with ColumnTransformer
    • Numeric features : '모집인원_학부_계','Total_fertility_rate','Net_Mover','immigration','schoolage'
      • We used polynomialfeatures with degree 2 and standardscaler.
    • Categorical features : '연도','시도','대계열','중계열','Gov_Spnsorship','awareness'
      • We used onehotencoder.
  • Pipeline combining preprocessor
    1. When fit is LinearRegression
      • r2 score : 0.622863334546101
    2. When fit is LassoRegression
      • r2 score : 0.6232647986187436
    3. When fit is RidgeRegression
      • r2 score : 0.6228460532502517
    4. When fit is GradientBoostingRegressor
      • r2 score : 0.6961511258186199
    5. When fit is RandomForestRegressor
      • r2 score : 0.6447984110664
    6. When fit is SupportVectorRegression
      • r2 score : 0.6116517939646415
    7. When fit is KernelRidgeRegression
      • r2 score : 0.6363528761071908
  • Conclusion from model training
    • GradientBoostingRegressor performed the best on our dataset with hyperparameters {'fit__loss': 'ls', 'fit__max_depth': 7, 'fit__max_features': 'sqrt', 'fit__min_samples_split': 6}
  • Applying on test dataset
    • r2 score : 0.9192784570251001
  1. Predicting future admission
  • File Name : main_training.ipynb
  • Phase
    • Importing future dataset for future independent variables.
    • Applying the best model and hyperparameters we obtained from previous step.
    • A type of target variable['지원자_전체_계'] was changed into integer because it is counting the number of human.
    • We could obtain 'competition_rate' rate, which is our final goal, dividing '지원자_전체_계' into '모집인원_학부_계'
    • Extracted the data into excel for visualization.

conclusion_image

Average competitiveness by region in 2026

The larger the size of the circle on the map, the higher the mean. You can check the area and figures according to the color on the right side of the map.

conclusion_image

Additional analysis of Jeju with the lowest average competitiveness and Incheon with the highest average competitiveness

We analyzed Incheon with the highest average and Incheon with the lowest average in detail. While 65% of all universities in Jeju Island belong to the bottom 20% group, there were no universities belonging to the group in Incheon. Also, 74.2% of universities in Incheon are in the top 40% group. We think that the geographical disadvantage of Jeju has resulted in this.

conclusion_image

The list of departments with poor competency and the number of times selected for each year

The departments on this list are in danger of being eliminated due to lack of competitiveness. In particular, in the case of Gyeongju University's_special physical education department, the risk of disappearing is very high, so it is recommended not to enter.

About

It is a team project for the year 3 module, Business Analytics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors