Iris-Data-Analysis/sri1.Rmd at master · SridharCR/Iris-Data-Analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: "Data Analysis on Iris Flowers"
author: "Sridhar"
date: "7 January 2018"
output: html_document
---

## Load the data

```{r iris}
head(iris)
```

## Scatterplot

By using scatterplots,we can find how much the parameters are correlated

```{r pressure, echo=FALSE}
library('ggvis')
iris %>% ggvis(~Sepal.Length,~Sepal.Width,fill =~Species) %>% layer_points()
```

The Sepal Length and Sepal width are some what correlated but not that much,we can see that the setosa,
is completely separated since they have small sepal length and small sepal width than other species.But
the real problem is that the virgincia,versicolor species were mixed apart.Hence we move to the next parameters.

```{r}
iris %>% ggvis(~Petal.Length,~Petal.Width,fill = ~Species) %>% layer_points()
```
Check this,this scatterplot is pretty good,which separates the species and forms a perfect correlation
line.

## Correlations
Let's check the numerical correlations of the parameters
```{r}
print(cor(iris$Sepal.Length,iris$Sepal.Width))
print(cor(iris$Petal.Length,iris$Petal.Width))

```
## Correlation matrix

For each property the correlations are identified for different species i.e, sentosa,versicolor,virginica
```{r}
type <- levels(iris$Species)
print(type[1])
cor(iris[iris$Species==type[1],1:4])

print(type[2])
cor(iris[iris$Species==type[3],1:4])

print(type[3])
cor(iris[iris$Species==type[3],1:4])
```
## Knowing the data

```{r}
head(iris)
```
## Structure of the data

```{r}
str(iris)
```
## Tabulations

```{r}
table(iris$Species)
```

```{r}
round(prop.table(table(iris$Species)) * 100, digits = 1)

```

```{r}
summary(iris)
summary(iris[c("Petal.Width","Sepal.Width")])
```
## Normalization

The normalization/feature scaling is not necessary but still,it improves the accuracy of this classification system.Here normalization process makes all the columns to be in the range of 0 to 1.

```{r}
library(class)
normalize <- function(x) {
num <- x - min(x)
denom <- max(x) - min(x)
return (num/denom)
}


iris_norm <- as.data.frame(lapply(iris[1:4], normalize))


summary(iris_norm)
```
## Training and Testing sets

The dataset is divided into two parts
1) Training set : To train the classifier,it contains 2/3 of the dataset.
2) Testing set : To test the classifier,it contains 1/3 of the dataset.

So for the division purpose we need random rows,that's why we are using seed() method.
```{r}
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))
ind
```

```{r}
iris.training <- iris[ind==1, 1:4]

head(iris.training)

iris.test <- iris[ind==2, 1:4]

head(iris.test)
```
Here the data is being separated!with the above found random possibilities.
```{r}
iris.trainLabels <- iris[ind==1,5]

print(iris.trainLabels)

iris.testLabels <- iris[ind==2, 5]

print(iris.testLabels)
```

## Classification

Here the k-Nearest Neighbour Classification is applied,with the training set and the testing set and the
species were predicted.The knn() method does a good job by predicting the species based on the training
set and they were tested by the testing set.

```{r}
iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)
iris_pred
```

## Comparison
 We need to make sure that our classifier has classified the species correctly,in order to do that we merge the real species name and the predicted name.As a result we find something unsual.

```{r}
irisTestLabels <- data.frame(iris.testLabels)

merge <- data.frame(iris_pred, iris.testLabels)

names(merge) <- c("Predicted Species", "Observed Species")

merge
```
The classifier did a small mistake i.e, instead of versicolor,it predicted as virginica.
This k-NN classification is not 100 % percent accurate.

## Proper summary

```{r}
library(gmodels)
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)
```