Skip to content

Commit 1f130d4

Browse files
author
maechler
committed
add penguin(_raw) data sets
git-svn-id: https://svn.r-project.org/R/trunk@87776 00db46b3-68df-0310-9c12-caf00c1e9a41
1 parent 502603f commit 1f130d4

File tree

5 files changed

+375
-2
lines changed

5 files changed

+375
-2
lines changed

doc/NEWS.Rd

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,9 @@
202202

203203
\item New dataset \code{gait} thanks to \I{Heather Turner} and
204204
\I{Ella Kaye}, used in examples.
205+
206+
\item New datasets \code{penguins} and \code{penguins_raw} thanks to
207+
\I{Ella Kaye}, \I{Heather Turner}, and \I{Kristen Gorman}.
205208
}
206209
}
207210

11.6 KB
Binary file not shown.
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
\name{penguins}
2+
\encoding{UTF-8}
3+
\docType{data}
4+
\title{Measurements of Penguins near Palmer Station, Antarctica}
5+
\alias{penguins}
6+
\alias{penguins_raw}
7+
\description{
8+
Data on adult penguins covering three species found on three islands in the
9+
Palmer Archipelago, Antarctica, including their size
10+
(flipper length, body mass, bill dimensions), and sex.
11+
12+
The columns of \code{penguins} are a subset of the more extensive
13+
\code{penguins_raw} data frame, which includes nesting observations
14+
and blood isotope data. There are differences in the column names
15+
and data types. See the \sQuote{Format} section for details.
16+
}
17+
\usage{
18+
penguins
19+
penguins_raw
20+
}
21+
\format{
22+
\code{penguins} is a data frame with 344 rows and 8 variables:
23+
\describe{
24+
\item{\code{species}}{\code{\link{factor}}, with levels
25+
\code{Adelie}, \code{Chinstrap}, and \code{Gentoo}}
26+
\item{\code{island}}{\code{factor},
27+
with levels \code{Biscoe}, \code{Dream}, and \code{Torgersen})}
28+
\item{\code{bill_len}}{\code{\link{numeric}}, bill length (millimeters)}
29+
\item{\code{bill_dep}}{\code{numeric}, bill depth (millimeters)}
30+
\item{\code{flipper_len}}{\code{\link{integer}}, flipper length (millimeters)}
31+
\item{\code{body_mass}}{\code{integer}, body mass (grams)}
32+
\item{\code{sex}}{\code{factor}, with levels \code{female} and \code{male}}
33+
\item{\code{year}}{\code{integer}, study year: 2007, 2008, or 2009}
34+
}
35+
36+
\code{penguins_raw} is a data frame with 344 rows and 17 variables.
37+
8 columns correspond to columns in \code{penguins},
38+
though with different variable names and/or classes:
39+
\describe{
40+
\item{\code{Species}}{\code{character}}
41+
\item{\code{Island}}{\code{character}}
42+
\item{\code{Culmen Length (mm)}}{\code{numeric}, bill length}
43+
\item{\code{Culmen Depth (mm)}}{\code{numeric}, bill depth}
44+
\item{\code{Flipper Length (mm)}}{\code{numeric}, flipper length}
45+
\item{\code{Body Mass (g)}}{\code{numeric}, body mass}
46+
\item{\code{Sex}}{\code{character}}
47+
\item{\code{Date Egg}}{\code{\link{Date}}, when study nest observed with 1 egg.
48+
The year component is the \code{year} column in \code{penguins}}
49+
}
50+
51+
There are 9 further columns in \code{penguins_raw}:
52+
\describe{
53+
\item{\code{studyName}}{\code{character}, expedition during which the data was collected}
54+
\item{\code{Sample Number}}{\code{numeric}, continuous numbering sequence for each sample}
55+
\item{\code{Region}}{\code{character}, the region of Palmer LTER sampling grid}
56+
\item{\code{Stage}}{\code{character}, denoting reproductive stage at sampling}
57+
\item{\code{Individual ID}}{\code{character}, unique ID for each individual in dataset}
58+
\item{\code{Clutch Completion}}{\code{character},
59+
if the study nest was observed with a full clutch, i.e., 2 eggs}
60+
\item{\code{Delta 15 N (o/oo)}}{\code{numeric}, the ratio of stable isotopes 15N:14N}
61+
\item{\code{Delta 13 C (o/oo)}}{\code{numeric}, the ratio of stable isotopes 13C:12C}
62+
\item{\code{Comments}}{\code{character}, additional relevant information}
63+
}
64+
}
65+
\source{
66+
\describe{
67+
\item{\enc{Adélie}{Adelie} penguins:}{Palmer Station Antarctica LTER and K. Gorman (2020).
68+
Structural size measurements and isotopic signatures of foraging
69+
among adult male and female \enc{Adélie}{Adelie} penguins (Pygoscelis adeliae)
70+
nesting along the Palmer Archipelago near Palmer Station, 2007-2009
71+
ver 5. Environmental Data Initiative, \doi{10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f}.}
72+
73+
\item{Gentoo penguins:}{Palmer Station Antarctica LTER and K. Gorman (2020).
74+
\doi{10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689}.}
75+
76+
\item{Chinstrap penguins:}{Palmer Station Antarctica LTER and K. Gorman (2020).
77+
\doi{10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e}.}
78+
}
79+
80+
The title naming convention for the source for the Gentoo and Chinstrap
81+
data is that same as for \enc{Adélie}{Adelie} penguins.
82+
}
83+
\references{
84+
Gorman, K. B., Williams, T. D. and Fraser, W. R. (2014)
85+
Ecological Sexual Dimorphism and Environmental Variability within a
86+
Community of Antarctic Penguins (Genus Pygoscelis).
87+
\emph{PLoS ONE} \bold{9}, 3, e90081; \doi{10.1371/journal.pone.0090081}.
88+
89+
Horst, A. M., Hill, A. P. and Gorman, K. B. (2022)
90+
Palmer Archipelago Penguins Data in the palmerpenguins R Package
91+
- An Alternative to Anderson's Irises.
92+
\emph{R Journal} \bold{14}, 1; \doi{10.32614/RJ-2022-020}.
93+
94+
Kaye, E., Turner, H., Gorman, K. B., Horst, A. M. and Hill, A. P. (2025)
95+
Preparing the Palmer Penguins Data for the \pkg{datasets} Package in R.
96+
\doi{10.5281/zenodo.14902740}.
97+
}
98+
\details{
99+
\bibcite{Gorman \abbr{et al.}\sspace(2014)}
100+
used the data to study sex dimorphism separately for the three species.
101+
102+
\bibcite{Horst \abbr{et al.}\sspace(2022)} popularized the data as an illustration
103+
for different statistical methods, as an alternative to the \code{\link{iris}} data.
104+
105+
\bibcite{Kaye \abbr{et al.}\sspace(2025)} provide the scripts used to create
106+
these data sets from the original source data,
107+
and a notebook reproducing results from \bibcite{Gorman \abbr{et al.}\sspace(2014)}.
108+
}
109+
\note{
110+
These data sets are also available in the \CRANpkg{palmerpenguins} package.
111+
See the \href{https://allisonhorst.github.io/palmerpenguins/}{package website}
112+
for further details and resources.
113+
114+
The \code{penguins} data has some shorter variable names than the \bold{palmerpenguins} version,
115+
for compact code and data display.
116+
}
117+
\examples{
118+
## view summaries
119+
summary(penguins)
120+
summary(penguins_raw) # not useful for character vectors
121+
## convert character vectors to factors first
122+
dFactor <- function(dat) {
123+
dat[] <- lapply(dat, \(.) if (is.character(.)) as.factor(.) else .)
124+
dat
125+
}
126+
summary(dFactor(penguins_raw))
127+
128+
## visualise distribution across factors
129+
plot(island ~ species, data = penguins)
130+
plot(sex ~ interaction(island, species, sep = "\n"), data = penguins)
131+
132+
## bill depth vs. length by species (color) and sex (symbol):
133+
## positive correlations for all species, males tend to have bigger bills
134+
sym <- c(1, 16)
135+
pal <- c("darkorange","purple","cyan4")
136+
plot(bill_dep ~ bill_len, data = penguins, pch = sym[sex], col = pal[species])
137+
138+
## simplified sex dimorphism analysis for Adelie species:
139+
## proportion of males increases with several size measurements
140+
adelie <- subset(penguins, species == "Adelie")
141+
plot(sex ~ bill_len, data = adelie)
142+
plot(sex ~ bill_dep, data = adelie)
143+
plot(sex ~ body_mass, data = adelie)
144+
m <- glm(sex ~ bill_len + bill_dep + body_mass, data = adelie, family = binomial)
145+
summary(m)
146+
147+
## Produce the long variable names as from {palmerpenguins} pkg:
148+
long_nms <- sub("len", "length_mm",
149+
sub("dep","depth_mm",
150+
sub("mass", "mass_g", colnames(penguins))))
151+
## compare long and short names:
152+
noquote(rbind(long_nms, nms = colnames(penguins)))
153+
154+
\dontrun{ # << keeping shorter 'penguins' names in this example:
155+
colnames(penguins) <- long_nms
156+
}
157+
}
158+
\keyword{datasets}

tests/Examples/datasets-Ex.Rout.save

Lines changed: 183 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
R Under development (unstable) (2025-02-15 r87721) -- "Unsuffered Consequences"
2+
R Under development (unstable) (2025-02-20 r87772) -- "Unsuffered Consequences"
33
Copyright (C) 2025 The R Foundation for Statistical Computing
44
Platform: x86_64-pc-linux-gnu
55

@@ -2968,6 +2968,187 @@ Warning: not plotting observations with leverage one:
29682968
>
29692969
>
29702970
> cleanEx()
2971+
> nameEx("penguins")
2972+
> ### * penguins
2973+
>
2974+
> flush(stderr()); flush(stdout())
2975+
>
2976+
> ### Name: penguins
2977+
> ### Title: Measurements of Penguins near Palmer Station, Antarctica
2978+
> ### Aliases: penguins penguins_raw
2979+
> ### Keywords: datasets
2980+
>
2981+
> ### ** Examples
2982+
>
2983+
> ## view summaries
2984+
> summary(penguins)
2985+
species island bill_len bill_dep
2986+
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
2987+
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
2988+
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
2989+
Mean :43.92 Mean :17.15
2990+
3rd Qu.:48.50 3rd Qu.:18.70
2991+
Max. :59.60 Max. :21.50
2992+
NA's :2 NA's :2
2993+
flipper_len body_mass sex year
2994+
Min. :172.0 Min. :2700 female:165 Min. :2007
2995+
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
2996+
Median :197.0 Median :4050 NA's : 11 Median :2008
2997+
Mean :200.9 Mean :4202 Mean :2008
2998+
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
2999+
Max. :231.0 Max. :6300 Max. :2009
3000+
NA's :2 NA's :2
3001+
> summary(penguins_raw) # not useful for character vectors
3002+
studyName Sample Number Species Region
3003+
Length:344 Min. : 1.00 Length:344 Length:344
3004+
Class :character 1st Qu.: 29.00 Class :character Class :character
3005+
Mode :character Median : 58.00 Mode :character Mode :character
3006+
Mean : 63.15
3007+
3rd Qu.: 95.25
3008+
Max. :152.00
3009+
3010+
Island Stage Individual ID Clutch Completion
3011+
Length:344 Length:344 Length:344 Length:344
3012+
Class :character Class :character Class :character Class :character
3013+
Mode :character Mode :character Mode :character Mode :character
3014+
3015+
3016+
3017+
3018+
Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
3019+
Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
3020+
1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
3021+
Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
3022+
Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
3023+
3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
3024+
Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
3025+
NA's :2 NA's :2 NA's :2
3026+
Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
3027+
Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
3028+
1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
3029+
Median :4050 Mode :character Median : 8.652 Median :-25.83
3030+
Mean :4202 Mean : 8.733 Mean :-25.69
3031+
3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
3032+
Max. :6300 Max. :10.025 Max. :-23.79
3033+
NA's :2 NA's :14 NA's :13
3034+
Comments
3035+
Length:344
3036+
Class :character
3037+
Mode :character
3038+
3039+
3040+
3041+
3042+
> ## convert character vectors to factors first
3043+
> dFactor <- function(dat) {
3044+
+ dat[] <- lapply(dat, \(.) if (is.character(.)) as.factor(.) else .)
3045+
+ dat
3046+
+ }
3047+
> summary(dFactor(penguins_raw))
3048+
studyName Sample Number Species
3049+
PAL0708:110 Min. : 1.00 Adelie Penguin (Pygoscelis adeliae) :152
3050+
PAL0809:114 1st Qu.: 29.00 Chinstrap penguin (Pygoscelis antarctica): 68
3051+
PAL0910:120 Median : 58.00 Gentoo penguin (Pygoscelis papua) :124
3052+
Mean : 63.15
3053+
3rd Qu.: 95.25
3054+
Max. :152.00
3055+
3056+
Region Island Stage Individual ID
3057+
Anvers:344 Biscoe :168 Adult, 1 Egg Stage:344 N13A1 : 3
3058+
Dream :124 N13A2 : 3
3059+
Torgersen: 52 N18A1 : 3
3060+
N18A2 : 3
3061+
N21A1 : 3
3062+
N21A2 : 3
3063+
(Other):326
3064+
Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm)
3065+
No : 36 Min. :2007-11-09 Min. :32.10 Min. :13.10
3066+
Yes:308 1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60
3067+
Median :2008-11-09 Median :44.45 Median :17.30
3068+
Mean :2008-11-27 Mean :43.92 Mean :17.15
3069+
3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70
3070+
Max. :2009-12-01 Max. :59.60 Max. :21.50
3071+
NA's :2 NA's :2
3072+
Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo)
3073+
Min. :172.0 Min. :2700 FEMALE:165 Min. : 7.632
3074+
1st Qu.:190.0 1st Qu.:3550 MALE :168 1st Qu.: 8.300
3075+
Median :197.0 Median :4050 NA's : 11 Median : 8.652
3076+
Mean :200.9 Mean :4202 Mean : 8.733
3077+
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.: 9.172
3078+
Max. :231.0 Max. :6300 Max. :10.025
3079+
NA's :2 NA's :2 NA's :14
3080+
Delta 13 C (o/oo) Comments
3081+
Min. :-27.02 Nest never observed with full clutch.: 34
3082+
1st Qu.:-26.32 Not enough blood for isotopes. : 7
3083+
Median :-25.83 Sexing primers did not amplify. : 4
3084+
Mean :-25.69 No blood sample obtained for sexing. : 2
3085+
3rd Qu.:-25.06 No blood sample obtained. : 2
3086+
Max. :-23.79 (Other) : 5
3087+
NA's :13 NA's :290
3088+
>
3089+
> ## visualise distribution across factors
3090+
> plot(island ~ species, data = penguins)
3091+
> plot(sex ~ interaction(island, species, sep = "\n"), data = penguins)
3092+
>
3093+
> ## bill depth vs. length by species (color) and sex (symbol):
3094+
> ## positive correlations for all species, males tend to have bigger bills
3095+
> sym <- c(1, 16)
3096+
> pal <- c("darkorange","purple","cyan4")
3097+
> plot(bill_dep ~ bill_len, data = penguins, pch = sym[sex], col = pal[species])
3098+
>
3099+
> ## simplified sex dimorphism analysis for Adelie species:
3100+
> ## proportion of males increases with several size measurements
3101+
> adelie <- subset(penguins, species == "Adelie")
3102+
> plot(sex ~ bill_len, data = adelie)
3103+
> plot(sex ~ bill_dep, data = adelie)
3104+
> plot(sex ~ body_mass, data = adelie)
3105+
> m <- glm(sex ~ bill_len + bill_dep + body_mass, data = adelie, family = binomial)
3106+
> summary(m)
3107+
3108+
Call:
3109+
glm(formula = sex ~ bill_len + bill_dep + body_mass, family = binomial,
3110+
data = adelie)
3111+
3112+
Coefficients:
3113+
Estimate Std. Error z value Pr(>|z|)
3114+
(Intercept) -85.088438 17.811711 -4.777 1.78e-06 ***
3115+
bill_len 0.840401 0.237656 3.536 0.000406 ***
3116+
bill_dep 1.305989 0.423911 3.081 0.002064 **
3117+
body_mass 0.007790 0.001969 3.957 7.59e-05 ***
3118+
---
3119+
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3120+
3121+
(Dispersion parameter for binomial family taken to be 1)
3122+
3123+
Null deviance: 202.399 on 145 degrees of freedom
3124+
Residual deviance: 54.024 on 142 degrees of freedom
3125+
(6 observations deleted due to missingness)
3126+
AIC: 62.024
3127+
3128+
Number of Fisher Scoring iterations: 8
3129+
3130+
>
3131+
> ## Produce the long variable names as from {palmerpenguins} pkg:
3132+
> long_nms <- sub("len", "length_mm",
3133+
+ sub("dep","depth_mm",
3134+
+ sub("mass", "mass_g", colnames(penguins))))
3135+
> ## compare long and short names:
3136+
> noquote(rbind(long_nms, nms = colnames(penguins)))
3137+
[,1] [,2] [,3] [,4] [,5]
3138+
long_nms species island bill_length_mm bill_depth_mm flipper_length_mm
3139+
nms species island bill_len bill_dep flipper_len
3140+
[,6] [,7] [,8]
3141+
long_nms body_mass_g sex year
3142+
nms body_mass sex year
3143+
>
3144+
> ## Not run:
3145+
> ##D # << keeping shorter 'penguins' names in this example:
3146+
> ##D colnames(penguins) <- long_nms
3147+
> ## End(Not run)
3148+
>
3149+
>
3150+
>
3151+
> cleanEx()
29713152
> nameEx("precip")
29723153
> ### * precip
29733154
>
@@ -3813,7 +3994,7 @@ c0 -5.14
38133994
> cleanEx()
38143995
> options(digits = 7L)
38153996
> base::cat("Time elapsed: ", proc.time() - base::get("ptime", pos = 'CheckExEnv'),"\n")
3816-
Time elapsed: 1.091 0.066 1.158 0 0
3997+
Time elapsed: 2.05 0.135 2.209 0 0
38173998
> grDevices::dev.off()
38183999
null device
38194000
1

0 commit comments

Comments
 (0)