|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "The Pandas data frame is one tool used to model tabular data in Python. It serves the role an relational database might server for in-memory data, using methods instead of a relational query language.\n", |
| 7 | + "The Pandas data frame is the most popular tool used to model tabular data in Python. It serves the role an relational database might server for in-memory data, using methods instead of a relational query language.\n", |
8 | 8 | "\n", |
9 | | - "The difference of Pandas from relational tables is not fundamental, and can even extend Pandas to accept or different operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra). However, a common missing component is: [data schema](https://en.wikipedia.org/wiki/Database_schema) definition and invariant enforcement.\n", |
| 9 | + "The differences of Pandas from relational tables are not fundamental. One can even extend Pandas to accept or different operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra). However, a common missing component is: [data schema](https://en.wikipedia.org/wiki/Database_schema) definition, documentation, and invariant enforcement.\n", |
10 | 10 | "\n", |
11 | | - "It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schema. However, it can be *very* useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from schema documentation and enforcement.\n", |
| 11 | + "It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schemas. However, it can be *very* useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from schema documentation and enforcement.\n", |
12 | 12 | "\n", |
13 | 13 | "I propose a simple check criteria for both function signatures and data frames that applies to both inputs and outputs:\n", |
14 | 14 | "\n", |
15 | 15 | " * Data must have *at least* the set of argument names or column names specified.\n", |
16 | | - " * Each column must *no more* types (for non-null values) than the types specified.\n", |
| 16 | + " * Each column must have *no more* types (for non-null values) than the types specified.\n", |
17 | 17 | "\n", |
18 | 18 | "In this note I will demonstrate the how to add schema documentation and enforcement to Python functions working over data frames using Python decorators.\n", |
19 | 19 | "\n", |
|
39 | 39 | "cell_type": "markdown", |
40 | 40 | "metadata": {}, |
41 | 41 | "source": [ |
42 | | - "We supply a lightweight implementation of a schema enforcement system.\n", |
| 42 | + "I will supply a lightweight implementation of a schema enforcement system.\n", |
43 | 43 | "\n", |
44 | | - "This is a bit different that having a type system. We are not interested what is and what is not a data frame. But instead interested in documenting that the data frames we work with have:\n", |
| 44 | + "This is a bit different that having a type system. We are interested in documenting that the data frames we work with have:\n", |
45 | 45 | "\n", |
46 | 46 | " * At least the columns we expect.\n", |
47 | 47 | " * No types we don't expect in those columns.\n", |
48 | 48 | "\n", |
49 | | - "These two covariant constraints are what we want to ensure we can write the operations over columns (which we need to know exist) and not get unexpected results (from unexpected types). The idea is: can we document and enforce (at least partial) schemas both on function signatures and data frames? This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation.\n", |
| 49 | + "These two covariant constraints are what we want to ensure we can write the operations over columns (which we need to know exist) and not get unexpected results (from unexpected types). Instead of getting down-stream signalling nor non-signalling errors during column operations, we get useful assertions on columns and values relative to our documented data model. This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation.\n", |
50 | 50 | "\n", |
51 | | - "I've started experimenting with a Python module to automate this task as debugging feature. As it is a debugging feature we want to be able to turn the feature on or off in an entire code base easily. To do this we define a indirect importer called [`schema_check.py`](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/schema_check.py). It's code looks like the following:\n", |
| 51 | + "I've started experimenting with a Python module to automate this task as debugging feature. As it is a debugging feature, we want to be able to turn the feature on or off in an entire code base easily. To do this we define a indirect importer called [`schema_check.py`](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/schema_check.py). It's code looks like the following:\n", |
52 | 52 | "\n", |
53 | 53 | "```\n", |
54 | 54 | " from data_schema import SchemaCheckSwitch\n", |
|
77 | 77 | "cell_type": "markdown", |
78 | 78 | "metadata": {}, |
79 | 79 | "source": [ |
80 | | - "What we have imported is a decorator that shows the types schemas of at least a subset of positional and named arguments. Declarations are either Python types, or sets of types. A special case is dictoinaries, which specify a subset of the column structure of data frames. \"return_spec\" is reserved to name the return schema of the function.\n" |
| 80 | + "\n", |
| 81 | + "The usual way to define a function in Python is as follows." |
81 | 82 | ] |
82 | 83 | }, |
83 | 84 | { |
|
97 | 98 | "metadata": {}, |
98 | 99 | "source": [ |
99 | 100 | "\n", |
100 | | - "Let's decorate a function with `SchemaCheck`. The details of this decorator are documented [here](https://github.com/WinVector/Examples/tree/main/arg_types#readme)." |
| 101 | + "Let's decorate the same function with `SchemaCheck`. The details of this decorator are documented [here](https://github.com/WinVector/Examples/tree/main/arg_types#readme)." |
101 | 102 | ] |
102 | 103 | }, |
103 | 104 | { |
|
122 | 123 | "cell_type": "markdown", |
123 | 124 | "metadata": {}, |
124 | 125 | "source": [ |
125 | | - "This declaring that `fn()` expects at least:\n", |
| 126 | + "The decorator defines the types schemas of at least a subset of positional and named arguments. Declarations are either Python types, or sets of types. A special case is dictionaries, which specify a subset of the column structure of function signatures or data frames. \"return_spec\" is reserved to name the return schema of the function.\n", |
| 127 | + "\n", |
| 128 | + "Our `fn()` expects at least:\n", |
126 | 129 | "\n", |
127 | 130 | " * an argument `a` of type `int`.\n", |
128 | 131 | " * an argument `b` of type `int` or `float`.\n", |
129 | 132 | " * an argument `c` that is a data frame (implied by the dictionary argument), and that data frame contains a column `x` that has no non-null elements of type other than `int`.\n", |
130 | | - " * The functions returns a data frame (indicated by the dictionary argument) that has at least a column `z` that contains no non-null elements of type other than `float`.\n", |
| 133 | + " * The function returns a data frame (indicated by the dictionary argument) that has at least a column `z` that contains no non-null elements of type other than `float`.\n", |
131 | 134 | "\n", |
132 | 135 | "This gives us some enforceable invariants that can improve our code.\n", |
133 | 136 | "\n", |
134 | | - "We can somewhat see this repeated back in the decorator altered `help()`." |
| 137 | + "We can see this repeated back in the decorator altered `help()`." |
135 | 138 | ] |
136 | 139 | }, |
137 | 140 | { |
|
168 | 171 | "cell_type": "markdown", |
169 | 172 | "metadata": {}, |
170 | 173 | "source": [ |
171 | | - "It is a learnable schema specification convention.\n", |
| 174 | + "This is a learnable schema specification convention.\n", |
172 | 175 | "\n", |
173 | 176 | "Let's see it catch an error. We show what happens if we call `fn()` with none of the expected arguments." |
174 | 177 | ] |
|
201 | 204 | "assert threw" |
202 | 205 | ] |
203 | 206 | }, |
204 | | - { |
205 | | - "cell_type": "markdown", |
206 | | - "metadata": {}, |
207 | | - "source": [ |
208 | | - "We can try with just one argument missing." |
209 | | - ] |
210 | | - }, |
211 | | - { |
212 | | - "cell_type": "code", |
213 | | - "execution_count": 7, |
214 | | - "metadata": {}, |
215 | | - "outputs": [ |
216 | | - { |
217 | | - "name": "stdout", |
218 | | - "output_type": "stream", |
219 | | - "text": [ |
220 | | - "\n", |
221 | | - "function fn(), issues:\n", |
222 | | - "expected arg c missing\n" |
223 | | - ] |
224 | | - } |
225 | | - ], |
226 | | - "source": [ |
227 | | - "# catch schema mismatch\n", |
228 | | - "try:\n", |
229 | | - " fn(1, 2)\n", |
230 | | - "except TypeError as e:\n", |
231 | | - " print(e)\n", |
232 | | - " threw = True\n", |
233 | | - "assert threw" |
234 | | - ] |
235 | | - }, |
236 | 207 | { |
237 | 208 | "cell_type": "markdown", |
238 | 209 | "metadata": {}, |
|
242 | 213 | }, |
243 | 214 | { |
244 | 215 | "cell_type": "code", |
245 | | - "execution_count": 8, |
| 216 | + "execution_count": 7, |
246 | 217 | "metadata": {}, |
247 | 218 | "outputs": [ |
248 | 219 | { |
|
270 | 241 | "cell_type": "markdown", |
271 | 242 | "metadata": {}, |
272 | 243 | "source": [ |
273 | | - "And we show that this checking pushes down into the structure of data frame arguments! \n", |
274 | | - "\n", |
275 | | - "These sort of checks are not for generic utility methods (such as `pd.merge()`), which are designed to work over a larger variety of schema. However, they are very useful near client interfaces, APIs, and database tables. This is a place where there is fixed schema information, and one can benefit from preserving it for just a bit longer. This technique and [the data algebra](https://github.com/WinVector/data_algebra) may naturally live near data sources.\n", |
276 | | - "\n", |
277 | | - "In data science the natural types are data frame schemas, knowing the type of the outer variables just isn't and interesting invariant." |
| 244 | + "And we show that this checking pushes down into the structure of data frame arguments! In our next example we see the argument is missing a required column.\n" |
278 | 245 | ] |
279 | 246 | }, |
280 | 247 | { |
281 | 248 | "cell_type": "code", |
282 | | - "execution_count": 9, |
| 249 | + "execution_count": 8, |
283 | 250 | "metadata": {}, |
284 | 251 | "outputs": [ |
285 | 252 | { |
|
312 | 279 | }, |
313 | 280 | { |
314 | 281 | "cell_type": "code", |
315 | | - "execution_count": 10, |
| 282 | + "execution_count": 9, |
316 | 283 | "metadata": {}, |
317 | 284 | "outputs": [ |
318 | 285 | { |
|
340 | 307 | "cell_type": "markdown", |
341 | 308 | "metadata": {}, |
342 | 309 | "source": [ |
343 | | - "And we check return types." |
| 310 | + "And we can check return types." |
344 | 311 | ] |
345 | 312 | }, |
346 | 313 | { |
347 | 314 | "cell_type": "code", |
348 | | - "execution_count": 11, |
| 315 | + "execution_count": 10, |
349 | 316 | "metadata": {}, |
350 | 317 | "outputs": [ |
351 | 318 | { |
|
393 | 360 | "0 7.0" |
394 | 361 | ] |
395 | 362 | }, |
396 | | - "execution_count": 11, |
| 363 | + "execution_count": 10, |
397 | 364 | "metadata": {}, |
398 | 365 | "output_type": "execute_result" |
399 | 366 | } |
|
424 | 391 | "source": [ |
425 | 392 | "Notice the rejected return value is attached to the `TypeError` to help with diagnosis and debugging.\n", |
426 | 393 | "\n", |
| 394 | + "Again, sort of checks are not for generic utility methods (such as `pd.merge()`), which are designed to work over a larger variety of schemas. However, they are very useful near client interfaces, APIs, and database tables. This is a place where there is fixed schema information, and one can benefit from preserving it for just a bit longer. This technique and [data algebra](https://github.com/WinVector/data_algebra) processing may naturally live near data sources. In data science the natural types are data frame schemas, knowing the type of the outer variables just isn't and interesting invariant.\n", |
| 395 | + "\n", |
427 | 396 | "Now, let's show a successful call." |
428 | 397 | ] |
429 | 398 | }, |
430 | 399 | { |
431 | 400 | "cell_type": "code", |
432 | | - "execution_count": 12, |
| 401 | + "execution_count": 11, |
433 | 402 | "metadata": {}, |
434 | 403 | "outputs": [ |
435 | 404 | { |
|
470 | 439 | "0 7.0" |
471 | 440 | ] |
472 | 441 | }, |
473 | | - "execution_count": 12, |
| 442 | + "execution_count": 11, |
474 | 443 | "metadata": {}, |
475 | 444 | "output_type": "execute_result" |
476 | 445 | } |
|
492 | 461 | }, |
493 | 462 | { |
494 | 463 | "cell_type": "code", |
495 | | - "execution_count": 13, |
| 464 | + "execution_count": 12, |
496 | 465 | "metadata": {}, |
497 | 466 | "outputs": [], |
498 | 467 | "source": [ |
|
509 | 478 | }, |
510 | 479 | { |
511 | 480 | "cell_type": "code", |
512 | | - "execution_count": 14, |
| 481 | + "execution_count": 13, |
513 | 482 | "metadata": {}, |
514 | 483 | "outputs": [ |
515 | 484 | { |
|
550 | 519 | "0 7.0" |
551 | 520 | ] |
552 | 521 | }, |
553 | | - "execution_count": 14, |
| 522 | + "execution_count": 13, |
554 | 523 | "metadata": {}, |
555 | 524 | "output_type": "execute_result" |
556 | 525 | } |
|
570 | 539 | "source": [ |
571 | 540 | "The return value has is missing the required `z` column, but with checks off the function is not interfered with.\n", |
572 | 541 | "\n", |
573 | | - "The idea is: when checks are on failures are detected much closer to causes, making debugging and diagnosis much easier. Also the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.\n", |
| 542 | + "When checks are on: failures are detected much closer to causes, making debugging and diagnosis much easier. Also, the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.\n", |
574 | 543 | "\n", |
575 | | - "And the input and output schema are attached to the function as objects." |
| 544 | + "And, the input and output schema are attached to the function as objects." |
576 | 545 | ] |
577 | 546 | }, |
578 | 547 | { |
579 | 548 | "cell_type": "code", |
580 | | - "execution_count": 15, |
| 549 | + "execution_count": 14, |
581 | 550 | "metadata": {}, |
582 | 551 | "outputs": [ |
583 | 552 | { |
|
597 | 566 | }, |
598 | 567 | { |
599 | 568 | "cell_type": "code", |
600 | | - "execution_count": 16, |
| 569 | + "execution_count": 15, |
601 | 570 | "metadata": {}, |
602 | 571 | "outputs": [ |
603 | 572 | { |
|
617 | 586 | "cell_type": "markdown", |
618 | 587 | "metadata": {}, |
619 | 588 | "source": [ |
620 | | - "This makes the schema data available for other use, even some automatic checking of function composition conditions!\n", |
| 589 | + "This makes the schema data available for other uses.\n", |
621 | 590 | "\n", |
622 | | - "\n", |
623 | | - "The technique can run into what I call \"the first rule of meta-programming\". Meta-programming only works as long as it doesn't run into other meta-programming. That being said, I feel these decorators can be very valuable in Python data science projects.\n", |
| 591 | + "A downside is, the technique *can* run into what I call \"the first rule of meta-programming\". Meta-programming only works as long as it doesn't run into other meta-programming (also called the \"its only funny when I do it\" theorem). That being said, I feel these decorators can be very valuable in Python data science projects.\n", |
624 | 592 | "\n", |
625 | 593 | "The implementation, documentation, and demo of this methodology can be found [here](https://github.com/WinVector/data_algebra/tree/main/Examples/data_schema).\n" |
626 | 594 | ] |
|
629 | 597 | "cell_type": "markdown", |
630 | 598 | "metadata": {}, |
631 | 599 | "source": [ |
632 | | - "Note, the system works about the same with Polars instead of Pandas as the data frame realization." |
| 600 | + "Note, the system also works with Polars data frames instead of Pandas as the data frame realization." |
633 | 601 | ] |
634 | 602 | }, |
635 | 603 | { |
636 | 604 | "cell_type": "code", |
637 | | - "execution_count": 17, |
| 605 | + "execution_count": 16, |
638 | 606 | "metadata": {}, |
639 | 607 | "outputs": [], |
640 | 608 | "source": [ |
|
644 | 612 | }, |
645 | 613 | { |
646 | 614 | "cell_type": "code", |
647 | | - "execution_count": 18, |
| 615 | + "execution_count": 17, |
648 | 616 | "metadata": {}, |
649 | 617 | "outputs": [ |
650 | 618 | { |
|
670 | 638 | }, |
671 | 639 | { |
672 | 640 | "cell_type": "code", |
673 | | - "execution_count": 19, |
| 641 | + "execution_count": 18, |
674 | 642 | "metadata": {}, |
675 | 643 | "outputs": [ |
676 | 644 | { |
|
703 | 671 | "└─────┘" |
704 | 672 | ] |
705 | 673 | }, |
706 | | - "execution_count": 19, |
| 674 | + "execution_count": 18, |
707 | 675 | "metadata": {}, |
708 | 676 | "output_type": "execute_result" |
709 | 677 | } |
|
730 | 698 | }, |
731 | 699 | { |
732 | 700 | "cell_type": "code", |
733 | | - "execution_count": 20, |
| 701 | + "execution_count": 19, |
734 | 702 | "metadata": {}, |
735 | 703 | "outputs": [ |
736 | 704 | { |
|
756 | 724 | "└─────┘" |
757 | 725 | ] |
758 | 726 | }, |
759 | | - "execution_count": 20, |
| 727 | + "execution_count": 19, |
760 | 728 | "metadata": {}, |
761 | 729 | "output_type": "execute_result" |
762 | 730 | } |
|
774 | 742 | "cell_type": "markdown", |
775 | 743 | "metadata": {}, |
776 | 744 | "source": [ |
777 | | - "The `SchemaCheck` decoration is simple and effective tool to add schema documentation and enforcement to your analytics projects." |
| 745 | + "And we also have simple \"types in data frame\" inspection tools [here](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/df_types.ipynb)." |
| 746 | + ] |
| 747 | + }, |
| 748 | + { |
| 749 | + "cell_type": "markdown", |
| 750 | + "metadata": {}, |
| 751 | + "source": [ |
| 752 | + "In conclusion: the `SchemaCheck` decoration is simple and effective tool to add schema documentation and enforcement to your analytics projects." |
778 | 753 | ] |
779 | 754 | }, |
780 | 755 | { |
781 | 756 | "cell_type": "code", |
782 | | - "execution_count": 21, |
| 757 | + "execution_count": 20, |
783 | 758 | "metadata": {}, |
784 | 759 | "outputs": [ |
785 | 760 | { |
|
0 commit comments