1
+ export Titanic
2
+ """
3
+ Titanic Dataset
4
+
5
+ The titanic and titanic2 data frames describe the survival status of individual passengers on the Titanic.
6
+
7
+ The titanic data frame does not contain information from the crew, but it does contain actual ages of half of the passengers.
8
+ The principal source for data about Titanic passengers is the Encyclopedia Titanica.
9
+ The datasets used here were begun by a variety of researchers.
10
+ One of the original sources is Eaton & Haas (1994)
11
+ Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.
12
+
13
+ The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex.
14
+ pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class.
15
+ Age is in years, and some infants had fractional values.
16
+ The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child.
17
+ These data were obtained from Robert Dawson, Saint Mary's University, E-mail.
18
+ The variables are pclass, age, sex, survived.
19
+ These data frames are useful for demonstrating many of the functions in Hmisc as well as
20
+ demonstrating binary logistic regression analysis using the Design library.
21
+ For more details and references see Simonoff, Jeffrey S (1997): The "unusual episode" and a second statistics course.
22
+ J Statistics Education, Vol. 5 No. 1. Thomas Cason of UVa has greatly updated and improved the titanic data frame
23
+ using the Encyclopedia Titanica and created a new dataset called titanic3.
24
+ These datasets reflects the state of data available as of 2 August 1999.
25
+ Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created.
26
+
27
+ # Interface
28
+
29
+ - [`Titanic.features`](@ref)
30
+ - [`Titanic.targets`](@ref)
31
+ - [`Titanic.feature_names`](@ref)
32
+
33
+ DATASET specs
34
+
35
+ NAME: titanic3
36
+ TYPE: Census
37
+ SIZE: 1309 Passengers, 14 Variables
38
+
39
+ DESCRIPTIVE ABSTRACT:
40
+
41
+ The titanic3 data frame describes the survival status of individual passengers on the Titanic.
42
+ The titanic3 data frame does not contain information for the crew, but it does contain actual and estimated ages for almost 80% of the passengers.
43
+
44
+ SOURCES:
45
+
46
+ Hind, Philip. Encyclopedia Titanica. Online-only resource. Retrieved 01Feb2012 from http://www.encyclopedia-titanica.org/
47
+
48
+ VARIABLE DESCRIPTIONS
49
+
50
+ Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
51
+ survival Survival (0 = No; 1 = Yes)
52
+ name Name
53
+ sex Sex
54
+ age Age
55
+ sibsp Number of Siblings/Spouses Aboard
56
+ parch Number of Parents/Children Aboard
57
+ ticket Ticket Number
58
+ fare Passenger Fare (British pound)
59
+ cabin Cabin
60
+ embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
61
+ boat Lifeboat
62
+ body Body Identification Number
63
+ home.dest Home/Destination
64
+
65
+
66
+
67
+
68
+ SPECIAL NOTES
69
+
70
+ Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
71
+
72
+ Age is in Years; Fractional if Age less than One (1) If the Age is estimated, it is in the form xx.5
73
+
74
+ Fare is in Pre-1970 British Pounds ()
75
+ Conversion Factors: 1 = 12s = 240d and 1s = 20d
76
+
77
+
78
+ With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored.
79
+ The following are the definitions used for sibsp and parch.
80
+
81
+ Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
82
+ Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
83
+ Parent: Mother or Father of Passenger Aboard Titanic
84
+ Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
85
+
86
+ Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.
87
+ Some children travelled only with a nanny, therefore parch=0 for them.
88
+ As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
89
+
90
+
91
+ An interesting result may be obtained using functions from the Hmisc library.
92
+
93
+ attach (titanic3)
94
+ plsmo (age, survived, group=sex, datadensity=T)
95
+ # or group=pclass plot (naclus (titanic3)) # study patterns of missing values summary (survived ~ age + sex + pclass + sibsp + parch, data=titanic3)
96
+
97
+ """
98
+ module Titanic
99
+
100
+ using DataDeps
101
+ using DelimitedFiles
102
+
103
+ export features, targets, feature_names
104
+
105
+ const DATA = joinpath (@__DIR__ , " titanic.csv" )
106
+
107
+ """
108
+ targets(; dir = nothing)
109
+
110
+ Get the targets for the Titanic dataset,
111
+ a 891 element array listing the targets for each example.
112
+
113
+ ```jldoctest
114
+ julia> using MLDatasets: Titanic
115
+
116
+ julia> target = Titanic.targets();
117
+
118
+ julia> summary(target)
119
+ "1×891 Matrix{Float64}"
120
+
121
+ """
122
+
123
+ function targets (; dir = nothing )
124
+ titanic_data = readdlm (DATA, ' ,' )
125
+ reshape (Vector (titanic_data[2 : end ,2 ]), (1 , 891 ))
126
+ end
127
+
128
+ """
129
+ feature_names()
130
+
131
+ Return the the names of the features provided in the dataset.
132
+ """
133
+
134
+ function feature_names ()
135
+ [" PassengerId" , " Pclass" , " Name" , " Sex" , " Age" , " SibSp" , " Parch" , " Ticket" , " Fare" , " Cabin" , " Embarked" ]
136
+ end
137
+
138
+ """
139
+ features()
140
+
141
+ Return the features of the Boston Housing dataset. This is a 13x506 Matrix of Float64 datatypes.
142
+ The values are in the order ["crim","zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","b","lstat"].
143
+ It has 506 examples.
144
+
145
+ ```jldoctest
146
+ julia> using MLDatasets: BostonHousing
147
+
148
+ julia> features = BostonHousing.features();
149
+
150
+ julia> summary(features)
151
+ "13×506 Matrix{Float64}"
152
+ ```
153
+ """
154
+
155
+ function features ()
156
+ titanic_data = readdlm (DATA, ' ,' )
157
+ reshape (Matrix (hcat (titanic_data[2 : end , 1 ], titanic_data[2 : end , 3 : 12 ])),(11 ,891 ))
158
+ end
159
+
160
+ end # module
0 commit comments