You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-11-29-basic-libraries-cleaning.md.md
+13-14Lines changed: 13 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
-
title: Cleaning the basic libraries dataset
3
-
excerpt: Enhancing data on library locations
2
+
title: Basic library cleaning
3
+
excerpt: Cleaning and enhancing data on library locations
4
4
categories:
5
5
- Data
6
6
tags:
@@ -20,17 +20,17 @@ A public library dataset that has been getting recent attention is the [basic da
20
20
>
21
21
> _Public libraries in 'crisis' as councils cut services_ - **BBC News**
22
22
23
-
It really is the most basic of data - the locations of our libraries - but getting it right has been a challenge for over a decade. How do we collect this data and keep it up to date? An annual survey (like the Arts Council dataset) is useful but also time consuming, always out of date, and doesn't serve the public in tools like [LibraryOn](https://www.libraryon.org). Constantly updating the data is more efficient and less overall effort, but more of a challenge to coordinate and enforce.
23
+
It really is the most basic of data - the locations of our libraries - but getting it right has been a challenge for over a decade. How do we collect this data and keep it up to date? An annual survey (like the Arts Council dataset) is useful but also time consuming, always out of date, and doesn't effectively serve the public in tools like [LibraryOn](https://www.libraryon.org). Constantly updating the data is more efficient and less overall effort, but more of a challenge to coordinate and enforce.
24
24
25
-
Despite this, it's a credit to the quality of the data, and the Arts Council, that it is being used. It has always been difficult to prove the need for quality open data, without clear examples. A dataset that is published and seeing usage in important reports is a good message for the sector.
25
+
Despite this, it's a credit to the quality of the data, and the Arts Council, that it is being used. It has always been difficult to prove the need for quality open data, without clear examples. A dataset that is published and seeing usage in important reports and applications is a good message for the sector.
26
26
27
27
## Cleaning and enhancing the data
28
28
29
-
There are some issues with the data. That's not to throw any shade on the Arts Council - their job is tough enough sending requests and chasing over 150 library services, and they've done a lot of work tidying the data before publishing.
29
+
There are some issues with the data. That's not to throw any shade on the Arts Council - their job is tough enough chasing over 150 library services, and they've done a lot of work tidying the data before publishing.
30
30
31
-
Analysis from the ONS and BBC will have required effort to clean and enhance the data. A good example of data that often needs cleaning is postcodes. These are often manually typed - in this dataset there were many incorrect entries, and likely more that are harder to detect. Also the unique property reference numbers (UPRNs) were often missing or not correct. It may be that they're not a well understood identifier but they are mandated as a government standard for address/property data.
31
+
Analysis from the ONS and BBC will have required effort to clean and enhance the data. A good example of data that often needs cleaning is postcodes. These are often manually typed - in this dataset there were many incorrect entries, and likely more that are harder to detect. Also the unique property reference numbers (UPRNs) were often missing or not correct. It may be that they're not a well understood identifier, but they are a government standard for address/property data.
32
32
33
-
I've done that, plus the following changes to the data to make it more useful for others,and for linking to other datasets. Some of this is opinionated, but trying to keep to the spirit of the original data. This section is worth skipping if you find tedious data corrections a little boring.
33
+
I've applied the following changes to the data to make it more useful for others,and for linking to other datasets. Some of this is opinionated, but trying to keep to the spirit of the original data. This section is worth skipping if you find data corrections a little boring.
34
34
35
35
- Trimmed whitespace at either end of all data entries
36
36
- Corrected mismatches between the 'Reporting Service' and 'Upper Tier Local Authority'. On a few occasions these are legitimately different, but generally not.
@@ -82,13 +82,12 @@ Using these, I have added 4 columns for coordinates in British National Grid (Ea
82
82
83
83
Having a properly defined location for things gives lots of additional information: the population of the area, how rural/urban it is, deprivation levels, etc. There's too much to include in one dataset but a few key ones would be useful. I've added the following:
| Reporting authority code | A unique identifier for the reporting library service (an upper tier local authority). This uses the Government Statistical Service (GSS) code. |
88
-
| Rural/urban classification code | A set of codes, from 2011, to classify areas by how urban/rural they are. |
89
-
| Rural/urban classification description | A description for the rural/urban classification e.g. Urban Major Conurbation. |
90
-
| Index of Multiple Deprivation rank | The rank of the area in the Index of Multiple Deprivation. 1 is the most deprived, 32,844 is the least deprived. |
91
-
| Index of Multiple Deprivation decile | The decile of the Index of Multiple Deprivation. 1 will be among the most deprived, 10 among the least deprived. |
| Rural/urban classification code | A set of codes, from 2011, to classify areas by how urban/rural they are. |
88
+
| Rural/urban classification description | A description for the rural/urban classification e.g. Urban Major Conurbation. |
89
+
| Index of Multiple Deprivation rank | The rank of the area in the Index of Multiple Deprivation. 1 is the most deprived, 32,844 is the least deprived. |
90
+
| Index of Multiple Deprivation decile | The decile of the Index of Multiple Deprivation. 1 will be among the most deprived, 10 among the least deprived. |
92
91
93
92
These are taken from the [ONS Postcode Directory](https://geoportal.statistics.gov.uk/datasets/265778cd85754b7e97f404a1c63aea04/about) by matching with the library postcode. Because they are postcodes and inexact locations, they are 'best-fit' lookups. Using the UPRN coordinates would be better but quite a bit more hassle. Plus we don't have half the UPRNs anyway.
0 commit comments