Skip to content

Commit c1a3739

Browse files
authored
add Titanic dataset (#91)
1 parent 4985172 commit c1a3739

File tree

5 files changed

+1072
-0
lines changed

5 files changed

+1072
-0
lines changed

src/MLDatasets.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ include("download.jl")
4141
include("BostonHousing/BostonHousing.jl")
4242
include("Iris/Iris.jl")
4343
include("Mutagenesis/Mutagenesis.jl")
44+
include("Titanic/Titanic.jl")
4445

4546
# Vision
4647
include("CIFAR10/CIFAR10.jl")

src/Titanic/Titanic.jl

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
export Titanic
2+
"""
3+
Titanic Dataset
4+
5+
The titanic and titanic2 data frames describe the survival status of individual passengers on the Titanic.
6+
7+
The titanic data frame does not contain information from the crew, but it does contain actual ages of half of the passengers.
8+
The principal source for data about Titanic passengers is the Encyclopedia Titanica.
9+
The datasets used here were begun by a variety of researchers.
10+
One of the original sources is Eaton & Haas (1994)
11+
Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.
12+
13+
The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex.
14+
pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class.
15+
Age is in years, and some infants had fractional values.
16+
The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child.
17+
These data were obtained from Robert Dawson, Saint Mary's University, E-mail.
18+
The variables are pclass, age, sex, survived.
19+
These data frames are useful for demonstrating many of the functions in Hmisc as well as
20+
demonstrating binary logistic regression analysis using the Design library.
21+
For more details and references see Simonoff, Jeffrey S (1997): The "unusual episode" and a second statistics course.
22+
J Statistics Education, Vol. 5 No. 1. Thomas Cason of UVa has greatly updated and improved the titanic data frame
23+
using the Encyclopedia Titanica and created a new dataset called titanic3.
24+
These datasets reflects the state of data available as of 2 August 1999.
25+
Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created.
26+
27+
# Interface
28+
29+
- [`Titanic.features`](@ref)
30+
- [`Titanic.targets`](@ref)
31+
- [`Titanic.feature_names`](@ref)
32+
33+
DATASET specs
34+
35+
NAME: titanic3
36+
TYPE: Census
37+
SIZE: 1309 Passengers, 14 Variables
38+
39+
DESCRIPTIVE ABSTRACT:
40+
41+
The titanic3 data frame describes the survival status of individual passengers on the Titanic.
42+
The titanic3 data frame does not contain information for the crew, but it does contain actual and estimated ages for almost 80% of the passengers.
43+
44+
SOURCES:
45+
46+
Hind, Philip. Encyclopedia Titanica. Online-only resource. Retrieved 01Feb2012 from http://www.encyclopedia-titanica.org/
47+
48+
VARIABLE DESCRIPTIONS
49+
50+
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
51+
survival Survival (0 = No; 1 = Yes)
52+
name Name
53+
sex Sex
54+
age Age
55+
sibsp Number of Siblings/Spouses Aboard
56+
parch Number of Parents/Children Aboard
57+
ticket Ticket Number
58+
fare Passenger Fare (British pound)
59+
cabin Cabin
60+
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
61+
boat Lifeboat
62+
body Body Identification Number
63+
home.dest Home/Destination
64+
65+
66+
67+
68+
SPECIAL NOTES
69+
70+
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
71+
72+
Age is in Years; Fractional if Age less than One (1) If the Age is estimated, it is in the form xx.5
73+
74+
Fare is in Pre-1970 British Pounds ()
75+
Conversion Factors: 1 = 12s = 240d and 1s = 20d
76+
77+
78+
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored.
79+
The following are the definitions used for sibsp and parch.
80+
81+
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
82+
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
83+
Parent: Mother or Father of Passenger Aboard Titanic
84+
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
85+
86+
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.
87+
Some children travelled only with a nanny, therefore parch=0 for them.
88+
As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
89+
90+
91+
An interesting result may be obtained using functions from the Hmisc library.
92+
93+
attach (titanic3)
94+
plsmo (age, survived, group=sex, datadensity=T)
95+
# or group=pclass plot (naclus (titanic3)) # study patterns of missing values summary (survived ~ age + sex + pclass + sibsp + parch, data=titanic3)
96+
97+
"""
98+
module Titanic
99+
100+
using DataDeps
101+
using DelimitedFiles
102+
103+
export features, targets, feature_names
104+
105+
const DATA = joinpath(@__DIR__, "titanic.csv")
106+
107+
"""
108+
targets(; dir = nothing)
109+
110+
Get the targets for the Titanic dataset,
111+
a 891 element array listing the targets for each example.
112+
113+
```jldoctest
114+
julia> using MLDatasets: Titanic
115+
116+
julia> target = Titanic.targets();
117+
118+
julia> summary(target)
119+
"1×891 Matrix{Float64}"
120+
121+
"""
122+
123+
function targets(; dir = nothing)
124+
titanic_data = readdlm(DATA, ',')
125+
reshape(Vector(titanic_data[2:end,2]), (1, 891))
126+
end
127+
128+
"""
129+
feature_names()
130+
131+
Return the the names of the features provided in the dataset.
132+
"""
133+
134+
function feature_names()
135+
["PassengerId", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]
136+
end
137+
138+
"""
139+
features()
140+
141+
Return the features of the Boston Housing dataset. This is a 13x506 Matrix of Float64 datatypes.
142+
The values are in the order ["crim","zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","b","lstat"].
143+
It has 506 examples.
144+
145+
```jldoctest
146+
julia> using MLDatasets: BostonHousing
147+
148+
julia> features = BostonHousing.features();
149+
150+
julia> summary(features)
151+
"13×506 Matrix{Float64}"
152+
```
153+
"""
154+
155+
function features()
156+
titanic_data = readdlm(DATA, ',')
157+
reshape(Matrix(hcat(titanic_data[2:end, 1], titanic_data[2:end, 3:12])),(11,891))
158+
end
159+
160+
end # module

0 commit comments

Comments
 (0)