An Analysis of the Iris Dataset.

The iris dataset dates back almost a century to 1934, having been collected by Edgar Anderson – an American botanist and geneticist. Anderson was attempting to determine the possibility of one species of iris evolving from another. From the Gaspé Pensinsula in Canada, he collected 50 samples from two species of iris, the Setosa and the Versicolor. From each sample collected, Anderson measured the petal width and length and also the sepal width and length. He hoped to investigate if the similarities could show one evolved from the other. It should be noted that the current iris dataset contains a third species of iris, the Virginica which differs from the other two samples and was taken from a different colony. It is not clear if indeed Anderson collected this third sample.

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png

Around the time Anderson collected his samples, Sir Ronald Aylmer Fisher, a famous mathematician and statistician, was investigating linear discrimination analysis (LDA), that is, a method of finding a linear combination of features that characterises two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. Anderson’s dataset had characteristics that made it suitable for data analytics and ws therefore was chosen by Fisher in his 1936 classic paper, The Use of Multiple Measurements in Taxonomic Problems.

Why is this dataset used?

The dataset includes three species of iris with 50 samples each, making it an ideal size of dataset with which to work. Each sample has five attributes recorded, four of them being measurements (in centimetres) of the width and length of the sepal, and the width and length of petal, while the fifth being its species or class of iris. Previous studies have found that one species is linearly separable from the other two, but the other two are not linearly separable from each other. Based on the combination of the four measurement features, Fisher developed a linear discriminant model to discriminate or distinguish the Iris species from each other. That is referred to as discriminant analysis, finding the best linear combination of independent variables that will discriminate between the categories of the dependent variables, and determine if significant differences exists among the groups of predictor variables. He showed that the differences between the Setosa and the Versicolor was substantially greater than the standard deviations of the compound measurements, while the difference between the Virginica and the Versicolor was less than four times the standard deviation of each species. Therefore, he concluded that unlike with Setosa, the distributions of the Virginica and the Versicolor are not as easily distinguished from one another based solely on the four measurements.

Conclusion of my Investigation

In my investigation of this iris dataset, I wanted to see which of the four features could be the most effective in distinguishing between three species/classes of iris. According to my statistical analysis, the petal length was the most effective in determining the species of iris, followed closely by the petal width.

This exploratory data analysis included visualising the iris dataset using scatter plots to identify the relationship between sepal lengths and widths, and the relationship between petal lengths and widths, using histograms of each feature to understand the distribution, and using box plots to compare their interquartile ranges and central tendencies. Finally, pair plots are used to visualize the pairwise relationships between all four features.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
research		research
spec		spec
.gitignore		.gitignore
A_Statistical_Analysis_of_Fishers_Iris_Dataset.docx		A_Statistical_Analysis_of_Fishers_Iris_Dataset.docx
Fitted_Line_on_petal_length_vs_petal_width.png		Fitted_Line_on_petal_length_vs_petal_width.png
Fitted_Line_on_sepal_length_vs_sepal_width.png		Fitted_Line_on_sepal_length_vs_sepal_width.png
README.md		README.md
analysis.py		analysis.py
boxplot_of_petal_lengths.png		boxplot_of_petal_lengths.png
boxplot_of_petal_widths.png		boxplot_of_petal_widths.png
boxplot_of_sepal_lengths.png		boxplot_of_sepal_lengths.png
boxplot_of_sepal_widths.png		boxplot_of_sepal_widths.png
correlation_coefficients.txt		correlation_coefficients.txt
fitted_line_petal_lengths_vs_petal_widths.png		fitted_line_petal_lengths_vs_petal_widths.png
fitted_line_sepal_lengths_vs_sepal_widths.png		fitted_line_sepal_lengths_vs_sepal_widths.png
heatmap_of_correlation_coefficients.png		heatmap_of_correlation_coefficients.png
heatmap_of_iris_ correlation_coefficients.png		heatmap_of_iris_ correlation_coefficients.png
heatmap_of_setosa_correlation_coefficients.png		heatmap_of_setosa_correlation_coefficients.png
heatmap_of_versicolor_correlation_coefficients.png		heatmap_of_versicolor_correlation_coefficients.png
heatmap_of_virginica_correlation_coefficients.png		heatmap_of_virginica_correlation_coefficients.png
histogram_of_petal_lengths.png		histogram_of_petal_lengths.png
histogram_of_petal_widths.png		histogram_of_petal_widths.png
histogram_of_sepal_lengths.png		histogram_of_sepal_lengths.png
histogram_of_sepal_widths.png		histogram_of_sepal_widths.png
pairplot.png		pairplot.png
scatter_plot_petal_lengths_vs_petal_widths.png		scatter_plot_petal_lengths_vs_petal_widths.png
scatter_plot_sepal_lengths_vs_sepal_widths.png		scatter_plot_sepal_lengths_vs_sepal_widths.png
summary_statistics_petal_lengths.txt		summary_statistics_petal_lengths.txt
summary_statistics_petal_widths.txt		summary_statistics_petal_widths.txt
summary_statistics_sepal_lengths.txt		summary_statistics_sepal_lengths.txt
summary_statistics_sepal_widths.txt		summary_statistics_sepal_widths.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An Analysis of the Iris Dataset.

Why is this dataset used?

Conclusion of my Investigation

About

Uh oh!

Releases

Packages

Languages

callagg2/pands-project

Folders and files

Latest commit

History

Repository files navigation

An Analysis of the Iris Dataset.

Why is this dataset used?

Conclusion of my Investigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages