RepData_PeerAssessment1/PA1_template.Rmd at master · fennekit/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Reproducible Research: Peer Assessment 1"
author: Erik Konijnenburg
output:
  html_document:
    keep_md: true
---


## Loading and preprocessing the data

First we download the zipfile from the website if it does not exist on the disk. Then we unzip the file. Then we read in the csvDataFile. As a last step we convert the *date* column to a Date type.
```{r}
zipfile<-"./activity.zip"
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
if(!file.exists(zipfile)) {
    download.file(url, zipfile, method="libcurl")
}

unzip(zipfile)
csvDataFile<-"./activity.csv"
activity<-read.csv(csvDataFile, header = TRUE, sep=",")

library(lubridate)
activity$date<-ymd(activity$date)

```

## What is mean total number of steps taken per day?
Next the total number of steps per day is determined. From this result the mean and median number of steps is calculated.

```{r}
library(dplyr)
stepsPerDay <- activity %>%
               group_by(date) %>%
               summarize(totalsteps = sum(steps, na.rm=TRUE))


hist(x=stepsPerDay$totalsteps, breaks = 10, xlab="Total steps per day. ",
        main = "Histogram of total steps per day")

meanStepsPerDay <- mean(stepsPerDay$totalsteps)
meanStepsPerDay

medianStepsPerDay <- median(stepsPerDay$totalsteps)
medianStepsPerDay

```

## What is the average daily activity pattern?

To determine the average number of steps per interval the data is grouped  by interval and  the mean of all steps in each group is taken.

```{r}
stepsPerInterval <- activity %>%
               group_by(interval) %>%
               summarize(averageSteps = mean(steps, na.rm=TRUE))

plot(stepsPerInterval$interval,stepsPerInterval$averageSteps,type="l",
            xlab="Interval", ylab="Number of steps", main="Average number of steps per interval")

```

Next the interval with the maximum number of steps is determined
```{r}
maxIndex <- stepsPerInterval$averageSteps==max(stepsPerInterval$averageSteps)
stepsPerInterval[maxIndex,]$interval
```


## Imputing missing values
The initial dataset contains NA values. Let's calculate the amount of NA values
```{r}
naIndex <- is.na(activity$steps)
sum(naIndex)
```

For imputing the missing values a simple strategy is choosen. The strategy is to take the average step value of that interval (averaged over all days) if the value is missing.

```{r}
funcImputeValue <- function(x) {
    stepsPerInterval[stepsPerInterval["interval"] == x,]$averageSteps
}

imputedActivities <- activity %>%
    mutate(steps = ifelse(
        is.na(steps), funcImputeValue(interval), steps))

stepsPerDayWithImpute <- imputedActivities %>%
               group_by(date) %>%
               summarize(totalsteps = sum(steps))


hist(x=stepsPerDayWithImpute$totalsteps, breaks = 10, xlab="Total steps per day (with imputed NA values). ",
        main = "Histogram of total steps per day")

meanStepsPerDayImputed <- mean(stepsPerDayWithImpute$totalsteps)
meanStepsPerDayImputed

medianStepsPerDayImputed <- median(stepsPerDayWithImpute$totalsteps)
medianStepsPerDayImputed

```

By imputing values the both median and mean values change compared to the initial calculation of these values without imputing missing values. Imputing results in a median that is closer to the average.

## Are there differences in activity patterns between weekdays and weekends?

First a weekend/weekday label is assigned to the imputed dataset. Then the average of the steps per interval and weekday is calculated.

```{r}

imputedActivities <- imputedActivities %>%
        mutate(weekday=
                as.factor(ifelse(weekdays(date) == "Saturday" | weekdays(date) == "Sunday",
                "weekend", "weekday" ) ))

stepsPerIntervalWithWeekday <- imputedActivities %>%
               group_by(interval, weekday) %>%
               summarize(meanstep = mean(steps))


```

Finally the steps as function of the interval is plotted as for both weekdays and weekend
```{r}
library(lattice)
xyplot(meanstep ~ interval | factor(weekday),
       data=stepsPerIntervalWithWeekday, type="l", layout=c(1,2), xlab = "Number of steps",
       ylab="Steps" )
```