Skip to content

Latest commit

 

History

History
349 lines (275 loc) · 8.95 KB

File metadata and controls

349 lines (275 loc) · 8.95 KB

Regression and Other Stories: Earnings data

Andrew Gelman, Jennifer Hill, Aki Vehtari 2021-04-20

Tidyverse version by Bill Behrman.

Read in and prepare earnings data. See Appendix A in Regression and Other Stories.

Source: Ross, Catherine E. Work, Family, and Well-Being in the United States, 1990. 1996-06-10. https://doi.org/10.3886/ICPSR06666.v1


# Packages
library(tidyverse)

# Parameters
  # Earnings data
file_earnings <- here::here("Earnings/data/wfw90.dat")
  # Common code
file_common <- here::here("_common.R")

#===============================================================================

# Run common code
source(file_common)

A Computing in R

A.6 Working with messy data

Reading in survey data, one question at a time

Read in survey data.

earnings <- 
  file_earnings %>% 
  read_fwf(
    col_positions = 
      fwf_cols(
        height_feet = 144,
        height_inches = c(145, 146),
        weight = c(147, 149),
        earn_exact = c(203, 208),
        earn2 = c(209, 210),
        sex = 219
      ),
    col_types = cols(.default = col_double())
  )

Cleaning data within R

  1. Look at the data.
  2. Identify errors or missing data.
earnings
#> # A tibble: 2,031 x 6
#>    height_feet height_inches weight earn_exact earn2   sex
#>          <dbl>         <dbl>  <dbl>      <dbl> <dbl> <dbl>
#>  1           5             6    140         NA    90     2
#>  2           5             4    150         NA    90     1
#>  3           6             2    210      50000    NA     1
#>  4           5             6    125      60000    NA     2
#>  5           5             4    126      30000    NA     2
#>  6           5             5    200         NA    25     2
#>  7           5             3    110      50000    NA     2
#>  8           5             8    165         NA    62     2
#>  9           5             3    190      51000    NA     2
#> 10           5             4    125       9000    NA     2
#> # … with 2,021 more rows

From the first rows, it appears as though for each observation only one of earn_exact or earn2 is non-NA. Let’s see if this holds for all observations.

earnings %>% 
  count(!is.na(earn_exact), !is.na(earn2))
#> # A tibble: 2 x 3
#>   `!is.na(earn_exact)` `!is.na(earn2)`     n
#>   <lgl>                <lgl>           <int>
#> 1 FALSE                TRUE              651
#> 2 TRUE                 FALSE            1380

It does.

summary(earnings)
#>   height_feet   height_inches      weight      earn_exact         earn2     
#>  Min.   :4.00   Min.   : 0.0   Min.   : 80   Min.   :     0   Min.   : 1    
#>  1st Qu.:5.00   1st Qu.: 3.0   1st Qu.:130   1st Qu.:  6000   1st Qu.:15    
#>  Median :5.00   Median : 5.0   Median :150   Median : 16450   Median :25    
#>  Mean   :5.14   Mean   : 5.5   Mean   :174   Mean   : 20290   Mean   :46    
#>  3rd Qu.:5.00   3rd Qu.: 8.0   3rd Qu.:180   3rd Qu.: 28000   3rd Qu.:90    
#>  Max.   :9.00   Max.   :99.0   Max.   :999   Max.   :400000   Max.   :96    
#>                                              NA's   :651      NA's   :1380  
#>       sex      
#>  Min.   :1.00  
#>  1st Qu.:1.00  
#>  Median :2.00  
#>  Mean   :1.63  
#>  3rd Qu.:2.00  
#>  Max.   :2.00  
#> 

height_feet has values of 9 feet, height_inches has values of 99 inches, and weight has values of 999 pounds. From the data codebook, we see that the missing data codes are:

  • Don’t know: 8, 98, 998
  • No response: 9, 99, 999

We’ll recode these values as NAs.

earnings <- 
  earnings %>% 
  mutate(
    height_feet = height_feet %>% na_if(8) %>% na_if(9),
    height_inches = height_inches %>% na_if(98) %>% na_if(99),
    weight = weight %>% na_if(998) %>% na_if(999)
  )

earnings %>% 
  select(height_feet, height_inches, weight) %>% 
  summary()
#>   height_feet   height_inches       weight   
#>  Min.   :4.00   Min.   : 0.00   Min.   : 80  
#>  1st Qu.:5.00   1st Qu.: 3.00   1st Qu.:130  
#>  Median :5.00   Median : 5.00   Median :150  
#>  Mean   :5.12   Mean   : 5.09   Mean   :156  
#>  3rd Qu.:5.00   3rd Qu.: 8.00   3rd Qu.:180  
#>  Max.   :7.00   Max.   :11.00   Max.   :342  
#>  NA's   :8      NA's   :8       NA's   :42

Let’s look at the observations with height_feet of 7 feet.

earnings %>% 
  filter(height_feet == 7)
#> # A tibble: 1 x 6
#>   height_feet height_inches weight earn_exact earn2   sex
#>         <dbl>         <dbl>  <dbl>      <dbl> <dbl> <dbl>
#> 1           7             7    110         NA    25     2

The observation of someone 7 feet 7 inches tall who weighs 110 pounds is probably a data error, so we will recode the height_feet value of 7 as missing.

earnings <- 
  earnings %>% 
  mutate(height_feet = height_feet %>% na_if(7))
  1. Transform or combine raw data into summaries of interest.

Using the codebook, we’ll first recode sex.

earnings <- 
  earnings %>% 
   mutate(
    sex = 
      case_when(
        sex == 1 ~ "Male",
        sex == 2 ~ "Female",
        TRUE ~ NA_character_
      )
  )

We’ll next create a combined variable height.

earnings <- 
  earnings %>% 
  mutate(height = 12 * height_feet + height_inches)

Using the codebook, we’ll create a new variable earnings_approx from earn2.

earnings <- 
  earnings %>% 
  mutate(
    earn_approx =
      case_when(
        earn2 >= 90 ~ NA_real_,
        earn2 == 1 ~ 
          median(earn_exact[earn_exact > 100000], na.rm = TRUE) / 1000,
        TRUE ~ earn2
      )
  )

We’ll finally create a combined earnings variable earn.

earnings <- 
  earnings %>% 
  mutate(earn = if_else(!is.na(earn_exact), earn_exact, 1000 * earn_approx))

The new earn variable still has 237 missing values (out of 2031 respondents in total) and is imperfect in various ways, but we have to make some choices when working with real data.

Looking at the data

Sex distribution.

earnings %>% 
  ggplot(aes(sex)) +
  geom_bar() +
  labs(title = "Sex distribution")

We have a greater number of women than men in the data, and a greater proportion than in the general adult population.

Height distributions.

earnings %>% 
  drop_na(height) %>% 
  ggplot(aes(height)) +
  geom_bar() +
  facet_grid(rows = vars(sex)) +
  scale_x_continuous(breaks = scales::breaks_width(2)) +
  labs(title = "Height distributions")

There appears to be an excess of women who at 5 feet (60 inches) and 5 feet 6 inches (66 inches), and an excess of men who are 6 feet (72 inches), probably due to respondents rounding.

Weight distributions.

earnings %>% 
  drop_na(weight) %>% 
  ggplot(aes(weight)) +
  geom_histogram(binwidth = 10, boundary = 0) +
  facet_grid(rows = vars(sex)) +
  scale_x_continuous(breaks = scales::breaks_width(50)) +
  labs(title = "Weight distributions")

Nothing remarkable.

Earnings distributions.

earnings %>% 
  drop_na(earn) %>% 
  ggplot(aes(earn)) +
  geom_histogram(binwidth = 5e3, boundary = 0) +
  coord_cartesian(xlim = c(NA, 70e3)) +
  facet_grid(rows = vars(sex)) +
  scale_x_continuous(breaks = scales::breaks_width(10e3)) +
  labs(title = "Earnings distributions")

There are spikes in the lowest bin. Let’s check to see if these are due to respondents with no earnings.

v <- 
  earnings %>% 
  filter(earn == 0) %>% 
  count(sex)

v
#> # A tibble: 2 x 2
#>   sex        n
#> * <chr>  <int>
#> 1 Female   172
#> 2 Male      15

187 respondents had no earnings, with 172 of these women.

Earnings distributions.

earnings %>% 
  drop_na(earn) %>% 
  ggplot(aes(sex, earn)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(NA, 70e3)) +
  labs(title = "Earnings distributions")

Men have a higher median earnings.

v <- 
  earnings %>% 
  group_by(sex) %>% 
  summarize(earn_median = median(earn, na.rm = TRUE))

v
#> # A tibble: 2 x 2
#>   sex    earn_median
#> * <chr>        <dbl>
#> 1 Female       15000
#> 2 Male         25000

The median earnings of men is 10000 higher.