-
Notifications
You must be signed in to change notification settings - Fork 0
ken farmer edited this page Jul 29, 2015
·
9 revisions
Q1: Why test data at rest rather than before it gets to Hadoop?
Answer: data should be tested in both places. But specifically:
- Testing of data distributions doesn't work any better when done in the ETL process since many bad distributions can't be detected until enough data has passed that some will have already been loaded.
- Hadoop has more ways of expressing a test against data at rest than ETL processes have
- What is at rest is what is ultimately loaded - and can reveal processes hidden from an ETL-based detection method: like data transformed, but never loaded; data loaded twice, etc.
- Certain types of tests don't apply to ETL at all - age of stats on tables, table naming conventions, table security, etc.
- Certain types of tests are difficult to write and get good performance for within ETL - like foreign key constraints across two tables.
- Certain types of tests are extremely difficult to write for ETL accurately, simply and in a performant fashion. For example, a test for uniqueness of data within a fact table can be a challenge if that table get loaded many times a day: since a test may have to span batches of data.
Q2: Why not write all test results directly into Hadoop rather than SQLite?
Answer: it will eventually, for now, for the sake of development speed, we're writing to SQLite