RepData_PeerAssessment1/PA1_template.Rmd at master · wangdong2023/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "Reproducible Research: Peer Assessment 1"
output:
  html_document:
    keep_md: true
---

The data of this report comes from [Activity monitoring data](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip).

## Loading and preprocessing the data

After downloading and unzip the data, we first load the data into R and store it in a variable called `data`.
```{r, cache = TRUE}
setwd("D:\\working folder\\RepData_PeerAssessment1")
data <- read.csv("activity.csv")
head(data)
```

Note that there is NA values in `data$steps`, so we first remove the NA values and store the new dataframe in a variable called `dataclean`.
```{r, cache=TRUE}
dataclean = data[complete.cases(data), ]
dataclean$date <- factor(dataclean$date)
```

## What is mean total number of steps taken per day?

From `dataclean` the total steps of each can be easily calculated. The mean, median and histogram are presented below.
```{r, cache=TRUE}
daysteps <- tapply(dataclean$steps, dataclean$date, sum)
mean(daysteps)
median(daysteps)
hist(daysteps, main = "Histogram of total number of steps of each day", xlab = "total steps (/day)")
```

## What is the average daily activity pattern?

We first split the steps into different intervals, and then take the mean value.
```{r, cache=TRUE}
meansteps <- sapply(split(dataclean$steps, dataclean$interval), mean)
plot(levels(factor(dataclean$interval)), meansteps, type = "l", xlab = "intervals (min)", ylab = "mean number of steps" )
```

The interval that corresponds to the maximum mean number of steps is
```{r, cache = TRUE}
maxM <- max(meansteps)
lvl = levels(factor(dataclean$interval))
lvl[meansteps == maxM]
```

## Imputing missing values

The total number of rows with `NA`s is
```{r}
idx = is.na(data$steps)
sum(idx)
```

We fill these `NA` values in with the mean value of the corresponding interval. We creat a new variable called `stepsnew` to store the steps with the missing data filled in.
```{r, cache=TRUE}
stepsnew <- data$steps
for (i in seq_along(stepsnew)[idx]) {
  stepsnew[i] <- meansteps[lvl == data$interval[i]]
}
```

We can calculate the number of missing values in `stepsnew` to show that all the missing values are filled in.
```{r}
sum(is.na(stepsnew))
```

We can creat a dataframe called `datanew` to store the original data but with missing value filled in. Again, we can show the mean, median and histogram as in the second step.
```{r, cache=TRUE}
datanew <- data
datanew$steps <- stepsnew
daystepsnew <- tapply(datanew$steps, datanew$date, sum)
mean(daystepsnew)
median(daystepsnew)
hist(daystepsnew, main = "Histogram of total number of steps of each day", xlab = "total steps (/day)")
```

The results are very similar, showing that we filled in the missing value without affecting the original stastics much.

## Are there differences in activity patterns between weekdays and weekends?

To study the differences in activity patterns between weekdays and weekends, we should first separate the days into two catagories, "weekdays" and "weekends". We label the data by adding a new column into our `datanew`.
```{r, cache=TRUE}
t <- strptime(data$date, "%Y-%m-%d")
wkday <- weekdays(t, abbreviate = TRUE)
weekday = c("Mon","Tue","Wed","Thu","Fri")
datanew["Weekdays"] <- NA
datanew$Weekdays[wkday %in% weekday] <- "weekdays"
datanew$Weekdays[!(wkday %in% weekday)] <- "weekends"
```

Now our new `datanew` set has a column indicating where a certain number of step happens in weekdays or not. We creat a new data frame for the mean steps corresponding to intervals and days.
```{r}
stepofday <- aggregate(datanew$steps, list(datanew$interval, factor(datanew$Weekdays)), mean)
names(stepofday) <- c("interval", "weekday", "meanstep")
library(lattice)
xyplot(meanstep~as.numeric(interval)|weekday, data = stepofday, layout = c(1,2),type = "l", xlab = "intervals (min)", ylab = "mean number of steps" , lwd = 1.5)
```

From the plot we see that while there are different patterns between weekdays and weekends, the different are not dominant.