|
2032 | 2032 | "- __Data approximations__: Sometimes data are collected as ranges, e.g. `20-30`, or `<2`. This may for data anonymisation (we deliberately bucket people into ranges to ensure that\n", |
2033 | 2033 | " individuals can't be identified), or because our data are simply not sufficiently accurate (limits of the mechanism by which we recorded the data). This, also, is information. Forcing\n", |
2034 | 2034 | " these ranges to an absolute point may satisfy a data check, but at the risk of losing information on the limits of the data gathering process.\n", |
2035 | | - "- __Date formats__: When is 10 April also 4 April? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n", |
| 2035 | + "- __Date formats__: When is 10 April also 4 October? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n", |
2036 | 2036 | " can also be a problem. Should you force `2007-2008` to be a specific year?\n", |
2037 | 2037 | "\n", |
2038 | 2038 | "This type of data validation is certainly critical _at the point of use_, but is it important _at the point of publication_? How far should you go in validating data for publication?\n", |
|
2281 | 2281 | "\n", |
2282 | 2282 | "However, the hard work if getting data into a place where a few simple programmatic fixes can manipulate our data into any format that works for the user.\n", |
2283 | 2283 | "\n", |
| 2284 | + "<div class=\"alert alert-block alert-warning\">\n", |
| 2285 | + " <p><b>Never trust source data:</b> any data that comes from outside your work environment cannot be trusted until proven otherwise. It does not matter if the publisher is <i>trustworthy</i> or claims their data <i>validates</i>. Until definitively proven to validate by your own systems you can't simply import it into your systems untested.</p>\n", |
| 2286 | + " <p>A publisher supports their user's workflow by ensuring data is machine-readable, well-structured, that all terms are clearly defined, and there is metadata for everything. After that, trust, but verify.</p>\n", |
| 2287 | + "</div>\n", |
| 2288 | + "\n", |
2284 | 2289 | "As an exercise, describe what decisions you should make to either fix or leave these data as is.\n", |
2285 | 2290 | "\n", |
2286 | 2291 | "### 2.3.2 Data publication and citation\n", |
|
2387 | 2392 | "name": "python", |
2388 | 2393 | "nbconvert_exporter": "python", |
2389 | 2394 | "pygments_lexer": "ipython3", |
2390 | | - "version": "3.7.7" |
| 2395 | + "version": "3.6.8" |
2391 | 2396 | } |
2392 | 2397 | }, |
2393 | 2398 | "nbformat": 4, |
|
0 commit comments