[DRAFT] Create a POC writer test for discussion purposes#26
[DRAFT] Create a POC writer test for discussion purposes#26
Conversation
wjones127
left a comment
There was a problem hiding this comment.
Thanks for drafting this. I used this to write down my thoughts on the different parts of the test.
| # append data to the reference table | ||
| rdd = spark.sparkContext.parallelize([("z", 9, 9.9)]) | ||
| df = rdd.toDF(["letter", "number", "a_float"]) | ||
| df.write.format("delta").mode("append").save("tmp/delta") |
There was a problem hiding this comment.
This will be one of the main challenges: how can we describe these operations in an engine agnostic way? There's two possible levels to do this at:
- A machine-readable description that could be automatically translated. This would be nice, but I'm not confident it would be feasible.
- A human-readable description that would have to be hand-written by implementors. I think this would be sufficient since the semantics of it are going to be checked by the
expectedresult.
There was a problem hiding this comment.
Yep, I vote for the "human-readable" approach as the first pass and then see if we can make something more clever down the road when we know the problem a bit better.
There was a problem hiding this comment.
+1 for human-readable as a first pass.
|
|
||
| # make assertions & cleanup | ||
| actual_df = spark.read.format("delta").load("tmp/delta") | ||
| chispa.assert_df_equality(actual_df, expected_df, ignore_row_order=True) |
There was a problem hiding this comment.
Besides asserting the result is right, we also need to validate the log. First, there are probably general invariants we want to check. Ones that aren't specific to this operation, but should always be true after every write. A few examples:
- Only one new log entry was added.
- All log entries prior are still there and unchanged.
- Old data files referenced by the log are still there and unchanged.
- The new log entry is valid according to the protocol. (Perhaps we can generate a JSON schema for the delta log entries and use that as a first pass?)
Beyond that, we might also want operation-specific validation. Like an "append" operation has dataChange=True on all add actions and doesn't contain any remove actions?
There was a problem hiding this comment.
Yea, let's brainstorm how to write this code. Parsing the transaction log from scratch in Python would seem a little tedious. Perhaps I can used delta-rs for this?
There was a problem hiding this comment.
Wouldn't using delta-rs in this way be circular? We could just compare the JSON files directly... some more thought needed for the checkpoint files.
There was a problem hiding this comment.
@nkarpov - yea, using delta-rs in that way would definitely be circular, especially for testing the delta-rs connector. Less so for testing the Flink, Hive, etc. connectors.
So I guess we just make a Python reader for parsing all the intricacies of the transaction log. That seems kind of complicated as well.
I really don't know the best path forward here.
|
This could be a follow up, but we should also consider how to represent writes that should fail. There will be a lot of test cases where we want to make sure the writers are enforcing rules, and stopping writes when they are supposed to. |
This is a POC writer test for discussion purposes.
Want to start brainstorming with the team how we could structure a write test.