Skip to content

Commit 249957b

Browse files
committed
First finish of basic libraries cleaning dataset
1 parent 6a1e3b7 commit 249957b

File tree

1 file changed

+34
-34
lines changed

1 file changed

+34
-34
lines changed

_posts/2024-11-29-basic-libraries-cleaning.md.md

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -10,47 +10,47 @@ published: true
1010

1111
A public library dataset that has been getting recent attention is the [basic dataset for libraries](https://www.artscouncil.org.uk/supporting-arts-museums-and-libraries/supporting-libraries) published by the Arts Council. This is _'intended to capture permanent instances of libraries, local history libraries, and archives from 1 April 2010 to 31 December 2023'_.
1212

13-
- The BBC published a report on [public libraries in crisis](https://www.bbc.co.uk/news/articles/cn9lexplel5o), highlighting the number of closures and saying that closures were far more likely to happen in deprived areas.
14-
- The Office for National Statistics published [number of libraries in local areas, England and Wales](https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/datasets/numberoflibrariesinlocalareasenglandandwales) - this used the data alongside other sources to analyse access to library services in different areas.
15-
- The data is also used in updating the library finder at [LibraryOn](https://libraryon.com/), the British Library's single digital presence project.
13+
- The BBC published a report on [public libraries in crisis](https://www.bbc.co.uk/news/articles/cn9lexplel5o), highlighting the number of closures and saying that closures had occured more in deprived areas.
14+
- The Office for National Statistics published [number of libraries in local areas, England and Wales](https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/datasets/numberoflibrariesinlocalareasenglandandwales) - this used the dataset alongside other sources to analyse access to library services in different areas.
15+
- The dataset is also used to update the library finder at [LibraryOn](https://libraryon.com/), the British Library's single digital presence project.
1616

1717
> More than 180 council-run libraries have either closed or been handed over to volunteer groups in the UK since 2016, BBC analysis has found.
1818
>
1919
> More deprived communities were four times more likely to have lost a publicly-funded library in that time, while 2,000 jobs have also been lost.
2020
>
21-
> **Public libraries in 'crisis' as councils cut services - BBC News**
21+
> _Public libraries in 'crisis' as councils cut services_ - **BBC News**
2222
23-
It really is the most basic of data - the locations of our libraries - but getting it right has been a challenge for over a decade. How do we collect this data and keep it up to date? An annual survey (like the Arts Council dataset) is useful but time consuming, always out of date, and doesn't serve real-time tools like [LibraryOn](https://www.libraryon.org). Constantly updating the data is more efficient and less effort, but more of a challenge to coordinate and maintain.
23+
It really is the most basic of data - the locations of our libraries - but getting it right has been a challenge for over a decade. How do we collect this data and keep it up to date? An annual survey (like the Arts Council dataset) is useful but also time consuming, always out of date, and doesn't serve the public in tools like [LibraryOn](https://www.libraryon.org). Constantly updating the data is more efficient and less overall effort, but more of a challenge to coordinate and enforce.
2424

25-
However, it's a credit to the quality of the data, and the Arts Council, that it is being used. It has always been difficult to prove the need for quality open data without examples. The fact that a dataset is published and seeing clear usage in important reports is a good message for the sector.
25+
Despite this, it's a credit to the quality of the data, and the Arts Council, that it is being used. It has always been difficult to prove the need for quality open data, without clear examples. A dataset that is published and seeing usage in important reports is a good message for the sector.
2626

2727
## Cleaning and enhancing the data
2828

29-
There were some issues with the data. That's not to throw any shade on the Arts Council - their job is to coordinate over 150 library services, and they still also need to do a lot of work to get the data tidied up before publishing.
29+
There are some issues with the data. That's not to throw any shade on the Arts Council - their job is tough enough sending requests and chasing over 150 library services, and they've done a lot of work tidying the data before publishing.
3030

31-
A good example of data that often needs cleaning is postcodes. These are often manually typed, so there were many changes to these, and likely more required.
31+
Analysis from the ONS and BBC will have required effort to clean and enhance the data. A good example of data that often needs cleaning is postcodes. These are often manually typed - in this dataset there were many incorrect entries, and likely more that are harder to detect. Also the unique property reference numbers (UPRNs) were often missing or not correct. It may be that they're not a well understood identifier but they are mandated as a government standard for address/property data.
3232

33-
I've done that, plus the following list of changes to the data to make it more useful for processing and linking to other datasets. Some of this is opinionated, but in trying to keep the spirit of the original data.
33+
I've done that, plus the following changes to the data to make it more useful for others,and for linking to other datasets. Some of this is opinionated, but trying to keep to the spirit of the original data. This section is worth skipping if you find tedious data corrections a little boring.
3434

35-
- Trimmed extra whitespace at either end of all data entries
35+
- Trimmed whitespace at either end of all data entries
3636
- Corrected mismatches between the 'Reporting Service' and 'Upper Tier Local Authority'. On a few occasions these are legitimately different, but generally not.
3737
- Suffolk reported that the Prison Library HMP Bure was in Norwich upper tier local authority. The upper tier authority should be Norfolk, but it's correct that Suffolk libraries operate the prison library, and are the reporting service.
3838
- Standardised 10 of the names used in the 'Reporting service' column to easier match them to unique identifiers
3939
- Standardised 10 of the names used in the 'Upper tier local authority' column to easier match these to unique identifiers.
4040
- Cleared non-postcode text from the postcode column e.g. 'No registered public address'
41-
- Ensured the closed field has an entry for libraries that have otherwise been set to closed
4241
- Updated postcode entries to be uppercase
4342
- Updated invalid postcodes from closed libraries
4443
- Updated invalid postcodes from open libraries
45-
- Updated valid but incorrect postcodes
44+
- Updated postcodes that are valid but actually incorrect
4645
- Removed the leading zeros from unique property reference numbers.
4746
- Removed UPRNs that are not numbers
4847
- Removed UPRNs that are over 5 miles away from the postcode location (and likely wrong)
4948
- Standardised the Type column to go from 10 to 5 distinct variations
50-
- Removed entries that were too unclear e.g. old book drops where the current status is unknown
49+
- Removed a small number of entries that were too unclear e.g. old book drops where the current status is unknown
5150
- Ensured statutory fields are Yes or No
52-
- Ensured closed year is set for entries that have closed in the operation field
5351
- Ensured operation fields are one of 'LA', 'LAU', 'C', 'CR', 'ICL' or not set
52+
- Ensured closed year is set for entries that have closed in the operation field
53+
- Ensured the closed year has an entry that have otherwise been marked as closed
5454
- Ensured that if the closed year was completed it was a 4-digit year
5555
- Cleared some entries from the operating organisation column (e.g. 'N/A')
5656
- Standardised the 'No' entry for the new build question
@@ -62,14 +62,14 @@ I've done that, plus the following list of changes to the data to make it more u
6262

6363
### Adding coordinates
6464

65-
There are no location coordinates in the original data. This is a good thing for data collection - there's no need to collect what can be added later.
65+
There are no location coordinates in the original data. This is a good thing for data collection - there's no need to request data that can be easily appended later.
6666

67-
There are two open data sources that can help here:
67+
There are two open data sources that can help:
6868

69-
- [ONS Postcode Directory](https://geoportal.statistics.gov.uk/datasets/265778cd85754b7e97f404a1c63aea04/about) - Coordinates and other various lookups for around 2.7 million postcodes (both current and historic)
69+
- [ONS Postcode Directory](https://geoportal.statistics.gov.uk/datasets/265778cd85754b7e97f404a1c63aea04/about) - Coordinates and other lookups for around 2.7 million postcodes (both current and historic)
7070
- [OS Open UPRN](https://www.ordnancesurvey.co.uk/products/os-open-uprn) - Coordinates for approximately 40 million addressable locations (unique property reference numbers) in Great Britain
7171

72-
Using these, I have added 4 columns. First trying to obtain coordinates from the UPRN, which will give the exact location of the library building. However, as many UPRNs aren't in the data, the next step is to use the postcode. This will be less accurate, being only the centre of the postcode. However, in the cases of libraries, they will often be small postcodes, or even have their own dedicated postcode.
72+
Using these, I have added 4 columns for coordinates in British National Grid (Easting/Northing) and the World Geodetic System (Longitude/Latitude). I first obtained coordinates from the UPRN, which gives the exact location of the library building. However, as many UPRNs aren't in the data, a fallback is to use the postcode. This will be less accurate, being the centre of the postcode. However, in the cases of libraries, the postcode extent will often be small, or they'll even have their own dedicated postcode.
7373

7474
| Column name | Description |
7575
| ----------- | --------------------------------------- |
@@ -78,27 +78,27 @@ Using these, I have added 4 columns. First trying to obtain coordinates from the
7878
| Longitude | The longitude coordinate of the library |
7979
| Latitude | The latitude coordinate of the library |
8080

81-
This additional data changes attribution requirements. The licence can remain the [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/), but requires a few additional statements:
81+
### Additional location data
8282

83-
- Contains OS data © Crown copyright and database right 2024
84-
- Contains Royal Mail data © Royal Mail copyright and database right 2024
85-
- Source: Office for National Statistics licensed under the Open Government Licence v.3.0
86-
- Source: Arts Council England
83+
Having a properly defined location for things gives lots of additional information: the population of the area, how rural/urban it is, deprivation levels, etc. There's too much to include in one dataset but a few key ones would be useful. I've added the following:
8784

88-
### Additional location data
85+
| Column | Description |
86+
| -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
87+
| Reporting authority code | A unique identifier for the reporting library service (an upper tier local authority). This uses the Government Statistical Service (GSS) code. |
88+
| Rural/urban classification code | A set of codes, from 2011, to classify areas by how urban/rural they are. |
89+
| Rural/urban classification description | A description for the rural/urban classification e.g. Urban Major Conurbation. |
90+
| Index of Multiple Deprivation rank | The rank of the area in the Index of Multiple Deprivation. 1 is the most deprived, 32,844 is the least deprived. |
91+
| Index of Multiple Deprivation decile | The decile of the Index of Multiple Deprivation. 1 will be among the most deprived, 10 among the least deprived. |
8992

90-
Having a properly defined location for things gives lots of additional information: the population of the area, how rural/urban it is, deprivation levels, etc. There is too much to include in one dataset but a few key ones would be useful. I've added the following:
93+
These are taken from the [ONS Postcode Directory](https://geoportal.statistics.gov.uk/datasets/265778cd85754b7e97f404a1c63aea04/about) by matching with the library postcode. Because they are postcodes and inexact locations, they are 'best-fit' lookups. Using the UPRN coordinates would be better but quite a bit more hassle. Plus we don't have half the UPRNs anyway.
9194

92-
| Column | Description |
93-
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
94-
| Reporting authority code | A unique identifier for the reporting library service (an upper tier local authority). This uses the Government Statistical Service (GSS) code |
95-
| Rural/urban classification code | A set of codes, from 2011, to classify areas by how urban/rural they are. |
96-
| Rural/urban classification description | A description for the rural/urban classification e.g. Urban Major Conurbation |
97-
| Index of Multiple Deprivation rank | The rank of the area in the Index of Multiple Deprivation. 1 is the most deprived, 32,844 is the least deprived. |
98-
| Index of Multiple Deprivation decile | The decile of the Index of Multiple Deprivation. 1 will be among the most deprived, 10 among the least deprived. |
95+
All this additional data means we need to acknowledge other sources for the dataset. The licence can remain the [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/), but will require a few additional statements:
9996

100-
These are taken from the [ONS Postcode Directory](https://geoportal.statistics.gov.uk/datasets/265778cd85754b7e97f404a1c63aea04/about) by matching with the library postcode. Because they are postcodes and inexact locations, they are 'best-fit' lookups. Using the UPRN coordinates would be more accurate but I couldn't really be bothered. Plus we don't have half the UPRNs anyway.
97+
- Contains OS data © Crown copyright and database right 2024
98+
- Contains Royal Mail data © Royal Mail copyright and database right 2024
99+
- Source: Office for National Statistics licensed under the Open Government Licence v.3.0
100+
- Source: Arts Council England
101101

102-
Enjoy! There will likely be mistakes and then further updates to this data but all being well it could be streamlined into a more automated annual process.
102+
Enjoy! And if using it please feedback any issues or requests. There will likely be mistakes and then further updates to this data but in future it could be streamlined into a more automated annual process.
103103

104104
Download [the basic libraries dataset - enhanced](/files/basic-dataset-for-libraries-2023-enhanced.csv) (CSV, 1.5MB)

0 commit comments

Comments
 (0)