Skip to content
This repository was archived by the owner on Mar 11, 2022. It is now read-only.

Commit 7bed71e

Browse files
authored
Add example datasets nsw and mpdta (#18)
1 parent 67f5352 commit 7bed71e

File tree

14 files changed

+178
-3334
lines changed

14 files changed

+178
-3334
lines changed

Project.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ version = "0.2.1"
55

66
[deps]
77
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
8+
CodecZlib = "944b1d66-785c-5afd-91f1-9de20f533193"
89
Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa"
910
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
1011
MacroTools = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
@@ -17,6 +18,7 @@ Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
1718

1819
[compat]
1920
CSV = "0.8"
21+
CodecZlib = "0.7"
2022
Combinatorics = "1"
2123
DataAPI = "1.6"
2224
DataFrames = "0.22"

data/README.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,30 @@
22

33
A collection of data files are provided here for the ease of testing and illustrations.
44
The included data are modified from the original sources
5-
and stored in `.csv` files.
6-
See [`make.py`](src/make.py) for the source code
7-
that generates these files from the original data.
5+
and stored in compressed CSV (`.csv.gz`) files.
6+
See [`data/src/make.jl`](src/make.jl) for the source code
7+
that generates these files from original data.
88

99
[DiffinDiffsBase.jl](https://github.com/JuliaDiffinDiffs/DiffinDiffsBase.jl)
1010
provides methods for looking up and loading these example data.
1111
Call `exampledata()` for a name list of the available datasets.
12-
To load one of them into a `DataFrame`, use the method `exampledata(name)`.
12+
To load one of them, call `exampledata(name)`
13+
where `name` is the `Symbol` of filename without extension (e.g., `:hrs`).
1314

1415
## Sources and Licenses
1516

1617
| Name | Source | File Link | License | Note |
1718
| :--- | :----: | :-------: | :-----: | :--- |
18-
| hrs | [Dobkin et al. (2018)](#DobkinFK18E) | [HRS_long.dta](https://doi.org/10.3886/E116186V1-73160) | [CC BY 4.0](https://doi.org/10.3886/E116186V1-73120) | Data are processed as in [Sun and Abraham (2020)](#SunA20) |
19+
| hrs | [Dobkin et al. (2018)](https://doi.org/10.1257/aer.20161038) | [HRS_long.dta](https://doi.org/10.3886/E116186V1-73160) | [CC BY 4.0](https://doi.org/10.3886/E116186V1-73120) | Data are processed as in [Sun and Abraham (2020)](https://doi.org/10.1016/j.jeconom.2020.09.006) |
20+
| nsw | [Diamond and Sekhon (2013)](https://doi.org/10.1162/REST_a_00318) | [ec675_nsw.tab](https://doi.org/10.7910/DVN/23407/DYEWLO) | [CC0 1.0](https://dataverse.org/best-practices/harvard-dataverse-general-terms-use) | Data are rearranged in a long format as in the R package [DRDID](https://github.com/pedrohcgs/DRDID/blob/master/data-raw/nsw.R) |
21+
| mpdta | [Callaway and Sant'Anna (2020)](https://doi.org/10.1016/j.jeconom.2020.12.001) | [mpdta.rda](https://github.com/bcallaway11/did/blob/master/data/mpdta.rda) | [GPL-2](https://cran.r-project.org/web/licenses/GPL-2) | |
1922

2023
## References
2124

22-
<a name="DobkinFK18E">**Dobkin, Carlos, Finkelstein, Amy, Kluender, Raymond, and Notowidigdo, Matthew J.** 2018. "Replication data for: The Economic Consequences of Hospital Admissions." *American Economic Association* [publisher], Inter-university Consortium for Political and Social Research [distributor]. https://doi.org/10.3886/E116186V1.</a>
25+
<a name="CallawayS20">**Callaway, Brantly, and Pedro H. C. Sant'Anna.** 2020. "Difference-in-Differences with Multiple Time Periods." *Journal of Econometrics*, forthcoming.</a>
26+
27+
<a name="DiamondS13G">**Diamond, Alexis and Jasjeet S. Sekhon.** 2013. "Replication data for: Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies." *MIT Press* [publisher], Harvard Dataverse [distributor]. https://doi.org/10.7910/DVN/23407/DYEWLO.</a>
28+
29+
<a name="DobkinFK18E">**Dobkin, Carlos, Amy Finkelstein, Raymond Kluender, and Matthew J. Notowidigdo.** 2018. "Replication data for: The Economic Consequences of Hospital Admissions." *American Economic Association* [publisher], Inter-university Consortium for Political and Social Research [distributor]. https://doi.org/10.3886/E116186V1.</a>
2330

2431
<a name="SunA20">**Sun, Liyang, and Sarah Abraham.** 2020. "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." *Journal of Econometrics*, forthcoming.</a>

data/hrs.csv

Lines changed: 0 additions & 3281 deletions
This file was deleted.

data/hrs.csv.gz

51.7 KB
Binary file not shown.

data/mpdta.csv.gz

29.4 KB
Binary file not shown.

data/nsw.csv.gz

299 KB
Binary file not shown.

data/src/Project.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[deps]
2+
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
3+
CodecBzip2 = "523fee87-0ab8-5b00-afb7-3ecf72e48cfd"
4+
CodecZlib = "944b1d66-785c-5afd-91f1-9de20f533193"
5+
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
6+
DataValues = "e7dc6d0d-1eca-5fa6-8ad6-5aecde8b7ea5"
7+
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
8+
RData = "df47a6cb-8c03-5eed-afd8-b6050d6c41da"
9+
ReadStat = "d71aba96-b539-5138-91ee-935c3ee1374c"
10+
11+
[compat]
12+
CSV = "0.8"
13+
CodecBzip2 = "0.7"
14+
CodecZlib = "0.7"
15+
DataFrames = "0.22"
16+
DataValues = "0.4"
17+
FileIO = "< 1.6"
18+
RData = "0.7"
19+
ReadStat = "1"
20+
julia = "1.3"

data/src/make.jl

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Generate example datasets as compressed CSV files
2+
3+
# See data/README.md for the sources of the input data files
4+
# To regenerate the .csv.gz files:
5+
# 1) Have all input files ready in the data folder
6+
# 2) Instantiate the package environment for data/src
7+
# 3) Run this script and call `make()` with the root folder as working directory
8+
9+
using CSV, CodecBzip2, CodecZlib, DataFrames, DataValues, RData, ReadStat
10+
11+
function _to_array(d::DataValueArray{T}) where T
12+
a = Array{T}(undef, size(d))
13+
hasmissing = false
14+
@inbounds for i in eachindex(d)
15+
v = d[i]
16+
if hasvalue(v)
17+
a[i] = v.value
18+
elseif !hasmissing
19+
a = convert(Array{Union{T,Missing}}, a)
20+
hasmissing = true
21+
a[i] = missing
22+
else
23+
a[i] = missing
24+
end
25+
end
26+
return a
27+
end
28+
29+
function _get_columns(data::ReadStatDataFrame, names::Vector{Symbol})
30+
lookup = Dict(data.headers.=>keys(data.headers))
31+
cols = Vector{AbstractVector}(undef, length(names))
32+
for (i, n) in enumerate(names)
33+
col = data.data[lookup[n]]
34+
cols[i] = _to_array(col)
35+
end
36+
return cols
37+
end
38+
39+
# The steps for preparing data follow Sun and Abraham (2020)
40+
function hrs()
41+
raw = read_dta("data/HRS_long.dta")
42+
names = [:hhidpn, :wave, :wave_hosp, :evt_time, :oop_spend, :riearnsemp, :rwthh,
43+
:male, :spouse, :white, :black, :hispanic, :age_hosp]
44+
cols = _get_columns(raw, names)
45+
df = dropmissing!(DataFrame(cols, names), [:wave, :age_hosp, :evt_time])
46+
df = df[(df.wave.>=7).&(df.age_hosp.<=59), :]
47+
# Must count wave after the above selection
48+
transform!(groupby(df, :hhidpn), nrow=>:nwave, :evt_time => minimum => :evt_time)
49+
df = df[(df.nwave.==5).&(df.evt_time.<0), :]
50+
transform!(groupby(df, :hhidpn), :wave_hosp => minimumskipmissing => :wave_hosp)
51+
select!(df, Not([:nwave, :evt_time, :age_hosp]))
52+
for n in (:male, :spouse, :white, :black, :hispanic)
53+
df[!, n] .= ifelse.(df[!, n].==100, 1, 0)
54+
end
55+
for n in propertynames(df)
56+
if !(n in (:oop_spend, :riearnsemp, :wrthh))
57+
df[!, n] .= convert(Array{Int}, df[!, n])
58+
end
59+
end
60+
# Replace the original hh index with enumeration
61+
ids = IdDict{Int,Int}()
62+
hhidpn = df.hhidpn
63+
newid = 0
64+
for i in 1:length(hhidpn)
65+
oldid = hhidpn[i]
66+
id = get(ids, oldid, 0)
67+
if id === 0
68+
newid += 1
69+
ids[oldid] = newid
70+
hhidpn[i] = newid
71+
else
72+
hhidpn[i] = id
73+
end
74+
end
75+
open(GzipCompressorStream, "data/hrs.csv.gz", "w") do stream
76+
CSV.write(stream, df)
77+
end
78+
end
79+
80+
# Produce a subset of nsw_long from the DRDID R package
81+
function nsw()
82+
df = DataFrame(CSV.File("data/ec675_nsw.tab", delim='\t'))
83+
df = df[(isequal.(df.treated, 0)).|(df.sample.==2), Not([:dwincl, :early_ra])]
84+
df.experimental = ifelse.(ismissing.(df.treated), 0, 1)
85+
select!(df, Not([:treated, :sample]))
86+
df.id = 1:nrow(df)
87+
# Convert the data to long format
88+
df = stack(df, [:re75, :re78])
89+
df.year = ifelse.(df.variable.=="re75", 1975, 1978)
90+
select!(df, Not(:variable))
91+
rename!(df, :value=>:re)
92+
sort!(df, :id)
93+
open(GzipCompressorStream, "data/nsw.csv.gz", "w") do stream
94+
CSV.write(stream, df)
95+
end
96+
end
97+
98+
# Convert mpdta from the did R package to csv format
99+
function mpdta()
100+
df = load("data/mpdta.rda")["mpdta"]
101+
df.first_treat = convert(Vector{Int}, df.first_treat)
102+
select!(df, Not(:treat))
103+
open(GzipCompressorStream, "data/mpdta.csv.gz", "w") do stream
104+
CSV.write(stream, df)
105+
end
106+
end
107+
108+
function make()
109+
hrs()
110+
nsw()
111+
mpdta()
112+
end

data/src/make.py

Lines changed: 0 additions & 35 deletions
This file was deleted.

src/DiffinDiffsBase.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
module DiffinDiffsBase
22

3-
using CSV: File
3+
using CSV
4+
using CodecZlib: GzipDecompressorStream
45
using Combinatorics: combinations
56
using DataAPI: refarray, refpool
67
using MacroTools: @capture, isexpr, postwalk

0 commit comments

Comments
 (0)