You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Smalltalk despite the fact that many important analysis tools are already present (for example, in the [PolyMath](https://github.com/PolyMathOrg/PolyMath) library), we are still missing this essential part of the data science toolkit. These specialized data structures for tabular data sets can provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data sources that are currently cumbersome to use. The DataFrame and DataSeries collections, stored in this repository, are specifically designed for working with structured data.
7
+
Data frames are the essential part of the data science toolkit. They are the specialized data structures for tabular data sets that provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data sources that are currently cumbersome to use. The DataFrame and DataSeries collections, stored in this repository, are specifically designed for working with structured data.
8
8
9
9
## Installation
10
-
The following script installs DataFrame and its dependencies in Pharo 6
10
+
The following script installs DataFrame and its dependencies into a Pharo image. Along with all the other code blocks in this tutorial, this script has been tested on Pharo-6.0 and Pharo64-6.0 for both Linux and OSX, and Pharo-6.0 for Windows.
11
11
12
12
```smalltalk
13
13
Metacello new
@@ -17,24 +17,53 @@ Metacello new
17
17
```
18
18
19
19
## Tutorial
20
-
There are two primary data structures in this package:
21
-
*`DataSeries` can be seen as an Ordered Collection that combines the properties of an Array and a Dictionary, while extending the functionality of both. Every DataSeries has a name and contains an array of data mapped to a corresponding array of keys (that are used as index values).
22
-
*`DataFrame` is a tabular data structure that can be seen as an ordered collection of columns. It works like a spreadsheet or a relational database with one row per subject and one column for each subject identifier, outcome variable, explanatory variable etc. A DataFrame has both row and column indices which can be changed if needed.
20
+
DataFrame library consists of two primary data structures:
21
+
*`DataFrame` is a spreadsheet-like tabular data structure that works like a relational database by providing simple and powerful API for querying the data. Each row represents an observation, and every column is a feature. Rows and columns of a DataFrame have names (keys) by which they can be accessed.
22
+
*`DataSeries` is an array-like data structure used for working with specific rows or columns of a DataFrame. It has a name and contains an array of data mapped to a corresponding array of keys. DataSeries is a SequenceableCollection that combines the properties of an Array and a Dictionary, while extending the functionality of both by providing advanced messages for working with data, such as statistical summaries, visualizations etc.
23
23
24
24
### Creating DataSeries
25
-
The easiest way of creating a series is to convert another collection (for example, an Array) to DataSeries
25
+
DataSeries can be created from an array of values
26
+
27
+
```smalltalk
28
+
series := DataSeries fromArray: #(a b c).
29
+
```
30
+
31
+
By extending the Collection class DataFrame library provides us with a handy shortcut for converting any collection (e.g. an Array) to DataSeries
26
32
27
33
```smalltalk
28
34
series := #(a b c) asDataSeries.
29
35
```
30
36
31
-
The keys will be automatically set to the numeric sequence of the array indexes, which can be described as an interval (1 to: n), where n is the size of array. The name of the series at this point will remain empty. Both the name and the keys of a DataSeries can be changed later, as follows:
37
+
By default the keys will be initialized with an interval `(1 to: self size)`. The name of a newly created series is considered empty and set by default to `nil`. You can always change the name and keys of your series using these messages
32
38
33
39
```smalltalk
34
40
series name: 'letters'.
35
41
series keys: #(k1 k2 k3).
36
42
```
37
43
44
+
### Accessing elements of DataSeries
45
+
When accessing the elements of a DataSeries, you can think of is as an Array. `at:` message allows you to access elements by their index, with `at:put:` you can modify the given element.
46
+
47
+
```smalltalk
48
+
series at: 2. "b"
49
+
series at: 3 put: 'x'.
50
+
```
51
+
52
+
Besides the standard Array accessors, DataSeries provides additional operations for accessing elements by their keys
53
+
54
+
```smalltalk
55
+
series atKey: #k2. "b"
56
+
series atKey: #k3 put: 'x'.
57
+
```
58
+
59
+
Messages for enumerating, such as `do:` or `withIndexDo:` work the same as in Array, and the `collect:` message creates a new DataSerie preserving the name and keys of the receiver.
60
+
61
+
```smalltalk
62
+
newSeries := series collect: [ :each | each, 'x' ].
63
+
newSeries name. "letters"
64
+
newSeries atKey: 'k1'. "ax"
65
+
```
66
+
38
67
### Creating a DataFrame
39
68
There are four ways of creating a data frame:
40
69
1. Creating an empty data frame, then filling it with data
0 commit comments