Skip to content

Commit 9f42c17

Browse files
committed
update README.md
1 parent e9f653b commit 9f42c17

File tree

1 file changed

+72
-10
lines changed

1 file changed

+72
-10
lines changed

README.md

Lines changed: 72 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ database specific SQL. The package also implements the same transforms for Panda
99
A good introduction can be found [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Introduction/data_algebra_Introduction.ipynb),
1010
and many worked examples are [here](https://github.com/WinVector/data_algebra/tree/main/Examples). A catalog of expression methods is found [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Methods/op_catalog.csv). The pydoc documentation is [here](https://winvector.github.io/data_algebra/). And the [README](https://github.com/WinVector/data_algebra/blob/main/README.md) is a good place to check for news or updates.
1111

12-
Currently the system is primarily adapted and testing for Pandas, Google BigQuery, PostgreSQL, SQLite, and Spark. Porting and extension is designed to be easy.
12+
Currently, the system is primarily adapted and testing for Pandas, Google BigQuery, PostgreSQL, SQLite, and Spark. Porting and extension is designed to be easy.
1313

1414
[This](https://github.com/WinVector/data_algebra) is to be the [`Python`](https://www.python.org) equivalent of the [`R`](https://www.r-project.org) packages [`rquery`](https://github.com/WinVector/rquery/), [`rqdatatable`](https://github.com/WinVector/rqdatatable), and [`cdata`](https://CRAN.R-project.org/package=cdata). This package supplies piped Codd-transform style notation that can perform data engineering in [`Pandas`](https://pandas.pydata.org) and generate [`SQL`](https://en.wikipedia.org/wiki/SQL) queries from the same specification.
1515

@@ -21,7 +21,6 @@ Install `data_algebra` with `pip install data_algebra`
2121

2222
This article introduces the [`data_algebra`](https://github.com/WinVector/data_algebra) project: a data processing tool family available in `R` and `Python`. These tools are designed to transform data either in-memory or on remote databases. For an example (with video) of using `data_algebra` to re-arrange data layout please see [here](https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/ranking_pivot_example.md). The key question is: what operators (or major steps) are supported by the data algebra, and what methods (operations on columns) are supported. The operators are documented [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Introduction/data_algebra_Introduction.ipynb), and which methods can be used in which contexts is linsted [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Methods/op_catalog.csv). Also, please check the [README](https://github.com/WinVector/data_algebra/blob/main/README.md) for news.
2323

24-
2524
In particular, we will discuss the `Python` implementation (also called `data_algebra`) and its relation to the mature `R` implementations (`rquery` and `rqdatatable`).
2625

2726
## Introduction
@@ -81,7 +80,7 @@ Let's start our `Python` example. First we import the packages we are going to
8180

8281
```python
8382
import pandas
84-
from data_algebra.data_ops import * # https://github.com/WinVector/data_algebra
83+
import data_algebra
8584
import data_algebra.BigQuery
8685

8786

@@ -91,7 +90,7 @@ data_algebra.__version__
9190

9291

9392

94-
'1.1.2'
93+
'1.4.4'
9594

9695

9796

@@ -115,6 +114,19 @@ d_local
115114

116115

117116
<div>
117+
<style scoped>
118+
.dataframe tbody tr th:only-of-type {
119+
vertical-align: middle;
120+
}
121+
122+
.dataframe tbody tr th {
123+
vertical-align: top;
124+
}
125+
126+
.dataframe thead th {
127+
text-align: right;
128+
}
129+
</style>
118130
<table border="1" class="dataframe">
119131
<thead>
120132
<tr style="text-align: right;">
@@ -176,7 +188,7 @@ db_handle = data_algebra.BigQuery.example_handle()
176188
print(db_handle)
177189
```
178190

179-
BigQuery_DBHandle(db_model=BigQueryModel, conn=<google.cloud.bigquery.client.Client object at 0x7ff3390df2e0>)
191+
BigQuery_DBHandle(db_model=BigQueryModel, conn=<google.cloud.bigquery.client.Client object at 0x7f9f10a152b0>)
180192

181193

182194

@@ -196,6 +208,19 @@ remote_table_description.head
196208

197209

198210
<div>
211+
<style scoped>
212+
.dataframe tbody tr th:only-of-type {
213+
vertical-align: middle;
214+
}
215+
216+
.dataframe tbody tr th {
217+
vertical-align: top;
218+
}
219+
220+
.dataframe thead th {
221+
text-align: right;
222+
}
223+
</style>
199224
<table border="1" class="dataframe">
200225
<thead>
201226
<tr style="text-align: right;">
@@ -253,15 +278,15 @@ Normally one does not read data back from a database, but instead materializes r
253278

254279
Now we continue our example by importing the `data_algebra` components we need.
255280

256-
Now we use the `data_algebra` to define our processing pipeline: `ops`. We are writing this pipeline using a [method chaining](https://en.wikipedia.org/wiki/Method_chaining) notation where we have placed `Python` method-dot at the end of lines using the `.\` notation. This notation will look *very* much like a [pipe](https://en.wikipedia.org/wiki/Pipeline_(Unix)) to `R`/[`magrittr`](https://CRAN.R-project.org/package=magrittr) users.
281+
Now we use the `data_algebra` to define our processing pipeline: `ops`. We are writing this pipeline using a [method chaining](https://en.wikipedia.org/wiki/Method_chaining) notation. This notation will look *very* much like a [pipe](https://en.wikipedia.org/wiki/Pipeline_(Unix)) to `R`/[`magrittr`](https://CRAN.R-project.org/package=magrittr) users.
257282

258283

259284

260285
```python
261286
scale = 0.237
262287

263288
ops = (
264-
data_algebra.data_ops.describe_table(d_local, 'd')
289+
data_algebra.descr(d=d_local)
265290
.extend({'probability': f'(assessmentTotal * {scale}).exp()'})
266291
.extend({'total': 'probability.sum()'},
267292
partition_by='subjectID')
@@ -281,8 +306,6 @@ We are deliberately writing a longer pipeline of simple steps, so we can use the
281306

282307
The intent is: the user can build up very sophisticated processing pipelines using a small number of primitive steps. The pipelines tend to be long, but can still be very efficient- as they are well suited for use with `Pandas` and with `SQL` query optimizers. Most of the heavy lifting is performed by the very powerful "window functions" (triggered by use of `partition_by` and `order_by`) available on the `extend()` step. Multiple statements can be combined into extend steps, but only when they have the same window-structure, and don't create and use the same value name in the same statement (except for replacement, which is shown in this example). Many conditions are checked and enforced during pipeline construction, making debugging very easy.
283308

284-
The question is: what operators (or major steps) are supported by the data algebra, and what methods (operations on columns) are supported. The operators are documented [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Introduction/data_algebra_Introduction.ipynb), and which methods can be used in which contexts is linsted [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Methods/op_catalog.csv). Also, please check the [README](https://github.com/WinVector/data_algebra/blob/main/README.md) for news.
285-
286309
For a more Pythonic way of writing the same pipeline we can show how the code would have been formatted by [`black`](https://github.com/psf/black).
287310

288311

@@ -347,7 +370,7 @@ print(sql)
347370
```
348371

349372
-- data_algebra SQL https://github.com/WinVector/data_algebra
350-
-- dialect: BigQueryModel
373+
-- dialect: BigQueryModel 1.4.4
351374
-- string quote: "
352375
-- identifier quote: `
353376
WITH
@@ -427,6 +450,19 @@ db_handle.read_query(sql)
427450

428451

429452
<div>
453+
<style scoped>
454+
.dataframe tbody tr th:only-of-type {
455+
vertical-align: middle;
456+
}
457+
458+
.dataframe tbody tr th {
459+
vertical-align: top;
460+
}
461+
462+
.dataframe thead th {
463+
text-align: right;
464+
}
465+
</style>
430466
<table border="1" class="dataframe">
431467
<thead>
432468
<tr style="text-align: right;">
@@ -475,6 +511,19 @@ ops.eval({'d': d_local})
475511

476512

477513
<div>
514+
<style scoped>
515+
.dataframe tbody tr th:only-of-type {
516+
vertical-align: middle;
517+
}
518+
519+
.dataframe tbody tr th {
520+
vertical-align: top;
521+
}
522+
523+
.dataframe thead th {
524+
text-align: right;
525+
}
526+
</style>
478527
<table border="1" class="dataframe">
479528
<thead>
480529
<tr style="text-align: right;">
@@ -517,6 +566,19 @@ ops.transform(d_local)
517566

518567

519568
<div>
569+
<style scoped>
570+
.dataframe tbody tr th:only-of-type {
571+
vertical-align: middle;
572+
}
573+
574+
.dataframe tbody tr th {
575+
vertical-align: top;
576+
}
577+
578+
.dataframe thead th {
579+
text-align: right;
580+
}
581+
</style>
520582
<table border="1" class="dataframe">
521583
<thead>
522584
<tr style="text-align: right;">

0 commit comments

Comments
 (0)