You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/05_data_and_types.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,15 +4,15 @@ There are many different definitions for the term **data**. Here's one of them,
4
4
5
5
> Data ≈ values or findings obtained through observations, measurements, etc.
6
6
7
-
####History of Data
7
+
## History of Data
8
8
9
9
A characteristic of data is that it must be measured or recorded. Hence, data is as old as the oldest human systems for recording data. One of the earliest remnants of such an endeavor is the Ishango bone, estimated to be around 20,000 years old {cite}`pletser_ishango_1999`. This artifact, displaying systematic notches grouped in blocks, resembles an ancient tally stick, hinting at our ancestral drive to count or record.
10
10
11
-
The very essence of data collection goes hand in hand with the birth of writing and symbolic representation. Interestingly, the oldest instances of these systems were not used to create elaborate letters, prose, or captivating stories. Instead, they were wielded for what may seem mundane today: collect and manage taxes and outstanding services. This fits to a saying that is usually linked to Benjamin Franklin: *"...in this world nothing can be said to be certain, except death and taxes"*[^death-taxes]
11
+
The very essence of data collection goes hand in hand with the birth of writing and symbolic representation. Interestingly, the oldest instances of these systems were not used to create elaborate letters, prose, or captivating stories. Instead, they were wielded for what may seem mundane today: collecting and managing taxes and outstanding services. This fits a saying that is usually linked to Benjamin Franklin: *"...in this world nothing can be said to be certain, except death and taxes"*[^death-taxes]
12
12
13
13
While data has been collected and used for millennia, the term *"data"* itself is relatively nascent. Originating from the Latin word *datum*, which means "given", the English adoption of *data* is believed to have been in the 1640s. The evolution of the term underscores our ever-growing understanding and reliance on the structured representation of knowledge.
14
14
15
-
####Data Types
15
+
## Data Types
16
16
17
17
Before we dive deeper into how to acquire and process data, we first need to know some fundamental terms and their distinctions.
18
18
@@ -26,19 +26,19 @@ Conversely, *unstructured data* doesn't simply fit into a table. Or, we could sa
26
26
27
27
**Feature vs. Data Points and the Dimensionality of Data**
28
28
29
-
When we speak of data, especially in tables, we refer to *features*, which are - depending on the field and context also called *variables* or *attributes*, and *data points*. Features are the distinct attributes or properties of the data set. In tabular data, these often appear as columns. For instance, in a table cataloging books, features might include "Title", "Author", and "Publication Year".
29
+
When we speak of data, especially in tables, we refer to *features*, which are, depending on the field and context, also called *variables* or *attributes*, and *data points*. Features are the distinct attributes or properties of the dataset. In tabular data, these often appear as columns. For instance, in a table cataloging books, features might include "Title", "Author", and "Publication Year".
30
30
31
31
On the other hand, *data points* are individual pieces of information, often represented as rows in tabular data. For example, each book listed in the aforementioned table would be a data point.
32
32
33
-
The concept of dimensionality arises from the number of features. A table with three features is 3-dimensional, while a table with 15 features is 15-dimensional. Understanding dimensionality is crucial, especially in domains like machine learning, where high-dimensionality can lead to challenges such as the "curse of dimensionality".
33
+
The concept of dimensionality arises from the number of features. A table with three features is 3-dimensional, while a table with 15 features is 15-dimensional. Understanding dimensionality is crucial, especially in domains like machine learning, where highdimensionality can lead to challenges such as the "curse of dimensionality".
34
34
35
35
**Data vs. Metadata**
36
36
37
37
While *data* represents the core information we aim to analyze or utilize, *metadata* is the information about this data. It describes the data's context, quality, condition, origin, and other characteristics. If data is a book, metadata is the blurb on the back, providing insights about its content, author, and publication details.
38
38
39
39
**Categorical vs. Numerical Data**
40
40
41
-
When it comes to data analysis we often distinguish two main types of *data*: categorical and numerical.
41
+
When it comes to data analysis, we often distinguish two main types of *data*: categorical and numerical.
42
42
43
43
*Categorical data* refers to data that falls into distinct groups or categories without any natural order or ranking among them. These categories are defined by qualitative characteristics that describe or identify traits or attributes. For example, the color of a shirt—whether it is blue, red, or green—represents categorical data.
44
44
@@ -48,10 +48,10 @@ When it comes to data analysis we often distinguish two main types of *data*: ca
48
48
49
49
Very often, the distinction between *categorical* and *numerical* data is not good enough. Helpful for later steps in a data science process are the following data scales:
50
50
51
-
-**Nominal Scale**: This scale is used for categorizing data without implying any order. It's applicable to categorical data where the emphasis is on distinguishing between items based on names or labels. Typical examples are gender, country, ethnicity, or eye color.
52
-
-**Ordinal Scale**: Here, data is still categorical but with an inherent order or ranking among the categories. However, the differences between these ranks are not equal or standardized. A common example would be a rating system such as: poor, fair, good, excellent. Here the scale implies order but not the magnitude of difference between adjacent rankings.
51
+
-**Nominal Scale**: This scale is used for categorizing data without implying any order. It applies to categorical data where the emphasis is on distinguishing between items based on names or labels. Typical examples are gender, country, ethnicity, or eye color.
52
+
-**Ordinal Scale**: Here, data is still categorical but with an inherent order or ranking among the categories. However, the differences between these ranks are not equal or standardized. A common example would be a rating system such as: poor, fair, good, excellent. Here, the scale implies order but not the magnitude of difference between adjacent rankings.
53
53
-**Interval Scale**: This scale applies to numerical data that has equal intervals between values but no true zero point, making ratios meaningless. Temperature measured in Celsius is a classic example, where the difference between 10°C and 20°C is the same as between 20°C and 30°C, such that we can talk about a 10°C difference in both cases. But 0°C does not denote an absence of temperature. As a consequence, 5°C is **not** five times as warm as 1°C.
54
-
-**Ratio Scale**: Ratio scales are similar to interval scales in that they feature equal spacing between values but also include a true zero point, allowing for meaningful ratios. Examples include measurements of length, weight, and age. A 10yearold bottle of wine is indeed two times older than a 5yearold bottle.
54
+
-**Ratio Scale**: Ratio scales are similar to interval scales in that they feature equal spacing between values but also include a true zero point, allowing for meaningful ratios. Examples include measurements of length, weight, and age. A 10-year-old bottle of wine is indeed two times older than a 5-year-old bottle.
55
55
56
56
| Scale Type | Characteristics | Data Type | Operations | Examples |
@@ -60,17 +60,17 @@ Very often, the distinction between *categorical* and *numerical* data is not go
60
60
| Interval | Equal intervals, no true zero | Numerical | Addition, Subtraction, Mean, Standard Deviation | Temperature (Celsius, Fahrenheit) |
61
61
| Ratio | Equal intervals, true zero, meaningful ratios | Numerical | All arithmetic operations | Height, Weight, Age, Income |
62
62
63
-
####Big Data
63
+
## Big Data
64
64
65
-
Working in data science there is really no way to avoid dealing with the challenges, the promises, or even the (many) definitions of **Big Data**. Since this is not our core concern in this book, I will simply stick to the very simple definition, roughly following {cite}`russom2011big`, and say:
65
+
Working in data science, there is no way to avoid dealing with the challenges, the promises, or even the (many) definitions of **Big Data**. Since this is not our core concern in this book, I will simply stick to the very simple definition, roughly following {cite}`russom2011big`, and say:
66
66
67
67
> **Big Data** ≈ Data that is too large, too complex, or too volatile to be evaluated using manual and traditional data processing methods.
68
68
69
-
Still, what does this mean? And why is there no sharp definition of which data is "big data" and which is not? Well, in essence, there is no sharp boundary between big and "not big" data. Big makes us first think of the sheer size of the data, or **volume**, say number of Giga-/Terra-/Peta-bytes. This is, however, far too simple. Astrophysicists collecting super-high-resolution telescope pictures of the sky would probably never consider a dataset that fits on a USB stick to be *big* (high-resolution images take a lot of disk space!). But librarians or historians might have a very different view on what is big and what is not. To get a better intuition for such volumes: 6.5 million English Wikipedia articles require only 20GB but equate to about 3000 encyclopedia volumes. This easily makes this *big data*. And yet, ten high-resolution movies require the same storage but don't form a *big* dataset.
69
+
Still, what does this mean? And why is there no sharp definition of which data is "big data" and which is not? In essence, there is simply no sharp boundary between big and "not big" data. The word "big" makes us first think of the sheer size of the data, or **volume**, say number of Giga-/Terra-/Peta-bytes. This is, however, far too simple. Astrophysicists collecting super-high-resolution telescope pictures of the sky would probably never consider a dataset that fits on a USB stick to be *big* (high-resolution images take a lot of disk space!). But librarians or historians might have a very different view on what is big and what is not. To get a better intuition for such volumes: 6.5 million English Wikipedia articles require only 20GB but equate to about 3000 encyclopedia volumes. This easily makes this *big data*. And yet, ten high-resolution movies require the same storage but don't form a *big* dataset.
70
70
71
-
Beyond the volume-related discussions, there are other factors that also contribute to whether or not we consider data as *big data*, which here means: things that further complicate the handling of the data. This can, for instance, be the **variety** of data, but also the **velocity** by which the data needs to be processed or analyzed. Together those terms are called the **"3V's"** (Volume, Variety, Velocity) that contribute to data being considered *big data*. Over the years, people started to add to this list, so that we now also have 4V's or 5V's ... but I will leave this for you to research if you want to know more about these definitions.
71
+
Beyond the volume-related discussions, other factors also contribute to whether or not we consider data as *big data*, which here means: things that further complicate the handling of the data. This can, for instance, be the **variety** of data, but also the **velocity** by which the data needs to be processed or analyzed. Together, those terms are called the **"3V's"** (Volume, Variety, Velocity) that contribute to data being considered *big data*. Over the years, people started to add to this list, so that we now also have 4V's or 5V's ... but I will leave this for you to research if you want to know more about these definitions.
72
72
73
-
If you still feel like you have no idea what big data actually means, feel free to just go ahead to the next chapters with a basic first intuition of:
73
+
If you still feel like you have no idea what big data means, feel free to just go ahead to the next chapters with a basic first intuition of:
74
74
75
75
> Anything that can be done with basic Excel methods **is not Big Data**.
Copy file name to clipboardExpand all lines: notebooks/11_correlation_analysis.ipynb
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@
25
25
"\n",
26
26
"In everyday life, we say that something *correlates* when different events or measurements align very well. Or we refer to things such as co-incidence. In the following, however, we are looking for a metric that can numerically describe whether a correlation exists and, if so, how pronounced it is.\n",
"A first option for measuring correlation is through variance, which measures the spread of individual data points around the mean. Expanding from variance, we encounter **covariance**, a measure that extends the idea to two variables. Covariance quantifies how much two variables change together, but its value is scale-dependent, making it difficult to interpret directly.\n",
31
31
"\n",
@@ -271,7 +271,7 @@
271
271
"id": "a2005783",
272
272
"metadata": {},
273
273
"source": [
274
-
"### Correlation Matrix\n",
274
+
"## Correlation Matrix\n",
275
275
"\n",
276
276
"The correlation matrix for data with the variables a, b, and c would thus be a matrix that contains the Pearson Correlation Coefficients for all possible combinations, i.e.:\n",
277
277
"\n",
@@ -868,7 +868,7 @@
868
868
"id": "d7286edb-4de1-4d0d-8238-71504f54ecf6",
869
869
"metadata": {},
870
870
"source": [
871
-
"### Limitations of the (Pearson) Correlation Measure\n",
871
+
"## Limitations of the (Pearson) Correlation Measure\n",
872
872
"Pronounced high (or low) correlation coefficients indicate actual correlations in the data, which -in the case of the Pearson correlation- usually means that there is a clear linear dependency between two features.\n",
873
873
"\n",
874
874
"This approach, however, has several limitations that can complicate the interpretation of such correlation measures. In the following, the most common pitfalls will be presented.\n",
@@ -1218,7 +1218,7 @@
1218
1218
"id": "50536550",
1219
1219
"metadata": {},
1220
1220
"source": [
1221
-
"### What Does a Correlation Tell Us?\n",
1221
+
"## What Does a Correlation Tell Us?\n",
1222
1222
"\n",
1223
1223
"If we discover correlations in our data, for example, through high or low Pearson Correlation Coefficients, what can we do with it?\n",
1224
1224
"\n",
@@ -1232,7 +1232,7 @@
1232
1232
"id": "0e72b5cf",
1233
1233
"metadata": {},
1234
1234
"source": [
1235
-
"### Correlation vs. Causality\n",
1235
+
"## Correlation vs. Causality\n",
1236
1236
"\n",
1237
1237
"Searching for correlations within data is often a pivotal step in the data science process. But why? What can we do with a high correlation (that isn't too high, i.e., equal to 1.0)?\n",
0 commit comments