Skip to content

Commit 635b74a

Browse files
committed
edit, rebuild, and recheck
1 parent a87b18e commit 635b74a

File tree

7 files changed

+41
-59
lines changed

7 files changed

+41
-59
lines changed

Examples/data_schema/df_types.ipynb

Lines changed: 27 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" * Separate types for atomic columns (such as `int`, `bool`, and `float`) and columns of objects (such as `str`).\n",
1313
" * No out-of-band representation of missing values. Instead, missingness must be signaled by the insertion of a value representing missingness. This causes problems for types that don't have such a representation such as `int` and `bool`.\n",
1414
"\n",
15-
"To work around the above the Pandas data frame have a number of non-avoidable column type promotion rules and cell type promotion rules. Let's take a look at a problem data frame."
15+
"To work around the above the Pandas data frame have a number of non-avoidable column type promotion rules and cell type promotion rules. These promotion rules can introduce their own complexity.\n",
16+
"\n",
17+
"Let's take a look at a Pandas data frame."
1618
]
1719
},
1820
{
@@ -198,25 +200,22 @@
198200
"metadata": {},
199201
"outputs": [
200202
{
201-
"data": {
202-
"text/plain": [
203-
"{'b': {int},\n",
204-
" 'q': None,\n",
205-
" 'r': {float},\n",
206-
" 's': {float},\n",
207-
" 'x': {float},\n",
208-
" 'y': {str},\n",
209-
" 'z': {bool, float, int}}"
210-
]
211-
},
212-
"execution_count": 5,
213-
"metadata": {},
214-
"output_type": "execute_result"
203+
"name": "stdout",
204+
"output_type": "stream",
205+
"text": [
206+
"{'b': {<class 'int'>},\n",
207+
" 'q': None,\n",
208+
" 'r': {<class 'float'>},\n",
209+
" 's': {<class 'float'>},\n",
210+
" 'x': {<class 'float'>},\n",
211+
" 'y': {<class 'str'>},\n",
212+
" 'z': {<class 'bool'>, <class 'float'>, <class 'int'>}}\n"
213+
]
215214
}
216215
],
217216
"source": [
218217
"# report non-null (not None, NaN, or NaT) found in cells\n",
219-
"non_null_types_in_frame(d)"
218+
"pprint(non_null_types_in_frame(d))"
220219
]
221220
},
222221
{
@@ -299,7 +298,9 @@
299298
"If you are not sure all of your code base (and its dependencies) are consistently only using columns or only using the values attribute, you may experience incompatible mixed types even on uniform data. We know one is not supposed to use \"`.values`\" [from the Pandas documention](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html):\n",
300299
"\n",
301300
"\n",
302-
"\n",
301+
"<dd>\n",
302+
"pandas.DataFrame.values\n",
303+
"property DataFrame.values\n",
303304
"<dd><p>Return a Numpy representation of the DataFrame.</p>\n",
304305
"<div class=\"admonition warning\">\n",
305306
"<p class=\"admonition-title\">Warning</p>\n",
@@ -311,11 +312,11 @@
311312
"<dt class=\"field-odd\">Returns<span class=\"colon\">:</span></dt>\n",
312313
"<dd class=\"field-odd\"><dl class=\"simple\">\n",
313314
"<dt>numpy.ndarray</dt><dd><p>The values of the DataFrame.</p>\n",
314-
"</dd></dd></dd>\n",
315+
"</dd></dd></dd></dd>\n",
315316
"\n",
316317
"So, presumably, Pandas `.values` is not in fact the attribute it syntactically presents as, but in fact a method interface.\n",
317318
"\n",
318-
"The type recommended `.to_numpy()` seems to return the same `numpy.float64`, which presumably is *not* what is inside the Pandas data frame columns are Series representations."
319+
"The type the recommended method `.to_numpy()` seems to return the same `numpy.float64`, which presumably is *not* what is inside the Pandas data frame columns or Series representations. In any case, what types you see in a cell is dependent on what types are in related cells, and what path you use to access the value."
319320
]
320321
},
321322
{
@@ -342,7 +343,7 @@
342343
"cell_type": "markdown",
343344
"metadata": {},
344345
"source": [
345-
"Any and all of the above inconsistencies can be fairly hazardous to any system that tries to export Pandas to other type sensitive systems (such as databases, JSON, arrow and so on)."
346+
"Any and all of the above inconsistencies can be fairly hazardous to any insufficiently careful system that tries to export Pandas to other type sensitive systems (such as databases, JSON, arrow and so on)."
346347
]
347348
},
348349
{
@@ -351,38 +352,15 @@
351352
"metadata": {},
352353
"outputs": [
353354
{
354-
"data": {
355-
"text/plain": [
356-
"'2.0.3'"
357-
]
358-
},
359-
"execution_count": 10,
360-
"metadata": {},
361-
"output_type": "execute_result"
362-
}
363-
],
364-
"source": [
365-
"pd.__version__"
366-
]
367-
},
368-
{
369-
"cell_type": "code",
370-
"execution_count": 11,
371-
"metadata": {},
372-
"outputs": [
373-
{
374-
"data": {
375-
"text/plain": [
376-
"'1.25.2'"
377-
]
378-
},
379-
"execution_count": 11,
380-
"metadata": {},
381-
"output_type": "execute_result"
355+
"name": "stdout",
356+
"output_type": "stream",
357+
"text": [
358+
"{'np': '1.25.2', 'pd': '2.0.3'}\n"
359+
]
382360
}
383361
],
384362
"source": [
385-
"np.__version__"
363+
"pprint({'np': np.__version__, 'pd': pd.__version__})"
386364
]
387365
}
388366
],

Examples/data_schema/schema_check.ipynb

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
"source": [
77
"The Pandas data frame is one tool used to model tabular data in Python. It serves the role an relational database might server for in-memory data, using methods instead of a relational query language.\n",
88
"\n",
9-
"The difference of Pandas from relational tables is not fundamental, and can even extend Pandas to accept or different operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra). However, a missing component is: [data schema](https://en.wikipedia.org/wiki/Database_schema) definition and invariant enforcement.\n",
9+
"The difference of Pandas from relational tables is not fundamental, and can even extend Pandas to accept or different operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra). However, a common missing component is: [data schema](https://en.wikipedia.org/wiki/Database_schema) definition and invariant enforcement.\n",
1010
"\n",
11-
"It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schema. However, it can be very useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants.\n",
11+
"It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schema. However, it can be *very* useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from schema documentation and enforcement.\n",
1212
"\n",
1313
"In this note I will demonstrate the how to add schema documentation and enforcement to Python functions working over data frames using Python decorators.\n",
1414
"\n",
@@ -27,7 +27,7 @@
2727
"import pandas as pd\n",
2828
"import polars as pl\n",
2929
"import data_algebra as da\n",
30-
"from data_algebra.data_schema import non_null_types_in_frame, SchemaCheckSwitch"
30+
"from data_algebra.data_schema import SchemaCheckSwitch"
3131
]
3232
},
3333
{
@@ -769,7 +769,7 @@
769769
"cell_type": "markdown",
770770
"metadata": {},
771771
"source": [
772-
"And that is a simple tool to add schemas to some of you data analysis functions."
772+
"The `SchemaCheck` decoration is simple and effective tool to add schema documentation and enforcement to your analytics projects."
773773
]
774774
},
775775
{
@@ -787,7 +787,11 @@
787787
],
788788
"source": [
789789
"# show some relevant versions\n",
790-
"pprint({'pd': pd.__version__, 'pl': pl.__version__, 'np': np.__version__, 'da': da.__version__})"
790+
"pprint({\n",
791+
" 'pd': pd.__version__, \n",
792+
" 'pl': pl.__version__, \n",
793+
" 'np': np.__version__, \n",
794+
" 'da': da.__version__})"
791795
]
792796
}
793797
],

data_algebra.egg-info/requires.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
numpy>=1.25.2
2-
pandas>=2.0.3
1+
numpy>=1.25.0
2+
pandas>=2.0.0
33
lark
44

55
[BigQuery]
-1 Bytes
Binary file not shown.

dist/data_algebra-1.6.9.tar.gz

1 Byte
Binary file not shown.

rebuild.bash

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ bash ./clean.bash
44
# pytest --cov data_algebra tests > coverage.txt
55
pip install --no-deps -e "$(pwd)"
66
pytest --cov-report term-missing --cov data_algebra tests > coverage.txt
7-
pdoc3 -o ./docs ./data_algebra
7+
pdoc3 --force -o ./docs ./data_algebra
88
cat coverage.txt
99
python3 setup.py sdist bdist_wheel
1010
# pip install dist/data_algebra-*.tar.gz

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727
url='https://github.com/WinVector/data_algebra',
2828
packages=setuptools.find_packages(exclude=['tests', 'Examples']),
2929
install_requires=[
30-
"numpy>=1.25.2",
31-
"pandas>=2.0.3",
30+
"numpy>=1.25.0",
31+
"pandas>=2.0.0",
3232
"lark"
3333
],
3434
extras_require={

0 commit comments

Comments
 (0)