Skip to content

Commit c8ccb2b

Browse files
authored
V1dot2 (#38)
* Version 1.2.0 * Fix links and stats file. * Fix link and clean up.
1 parent 94c95ee commit c8ccb2b

File tree

4 files changed

+13022
-1
lines changed

4 files changed

+13022
-1
lines changed

README.md

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,68 @@ Senzing has been working on updates of the libpostal data model for some time. T
1010
We have included regular updates to the model in our [Senzing API] product.
1111
We are now making our data models available to broader audience.
1212

13+
## Version 1.2.0
14+
15+
This version is composed of 3 files
16+
17+
- language_classifier.tar.gz - This is the same file used in version 1.1.0.
18+
- libpostal_data.tar.gz - This file is also the same as used by version 1.1.0.
19+
- parser.tar.gz - This is the new updated parser model.
20+
21+
To download click on the links below and the browser will download them for you:
22+
23+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/language_classifier.tar.gz
24+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/libpostal_data.tar.gz
25+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/parser.tar.gz
26+
27+
or you can use curl
28+
29+
```
30+
curl https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/language_classifier.tar.gz -o language_classifier.tar.gz
31+
curl https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/libpostal_data.tar.gz -o libpostal_data.tar.gz
32+
curl https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/parser.tar.gz -o parser.tar.gz
33+
```
34+
35+
To install extract the file in the libpostal data directory:
36+
37+
```
38+
tar -zxvf language_classifier.tar.gz
39+
tar -zxvf libpostal_data.tar.gz
40+
tar -zxvf parser.tar.gz
41+
```
42+
43+
### Changes and improvements from version 1.1.0
44+
45+
- Latest data from all sources
46+
- Improved parses for Chinese, Taiwanese and Korean addresses
47+
- Correction of labeling of tokens, which were truncated in certain cases
48+
49+
### Quality
50+
51+
We have overhauled our test data, both to correct errors and increase variation. It now has 12982 addresses from 88 countries.
52+
The over all parsing accuracy improved by 0.68%.
53+
You can find [statistical comparison between v1.1 and 1.2 here](./files/stats/v1.2.0/Parsing_comparison_v1_2_0.md).
54+
The bulk of the [1.2 test data is located here](./files/tests/v1.2.0/test_data.csv). We removed just over 100 records from the test set, because we don't have permissions to publish them.
55+
56+
A spreadsheet with more details about the results can be downloaded here: [Parsing_comparison_1.1_vs_1.2.xlsx](https://github.com/Senzing/libpostal-data/blob/main/files/stats/v1.2.0/Parsing_comparison_1.1_vs_1.2.xlsx).
57+
58+
### Training data
59+
60+
The data we used for creating the data model is available for download. To download click the links below:
61+
62+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/formatted_addresses_tagged.tsv.tgz
63+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/formatted_places_tagged.tsv.tgz
64+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/formatted_ways_tagged.tsv.tgz
65+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/geoplanet_formatted_addresses_tagged.tsv.tgz
66+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/openaddresses_formatted_addresses_tagged.tsv.tgz
67+
- https://public-read-libpostal-data.s3.amazonaws.com/v1.2.0/training_data/senzing_formatted_random.tsv.tgz
68+
69+
Once downloaded, extract them with
70+
71+
```
72+
tar -zxvf <file name>
73+
```
74+
1375
## Version 1.1.0
1476

1577
This version is composed of 3 files
@@ -18,7 +80,7 @@ This version is composed of 3 files
1880
- libpostal_data.tar.gz - This file is also the same as used by default libpostal datamodel.
1981
- parser.tar.gz - This is the new updated parser model.
2082

21-
To download click on the links below and the browser with download them for you:
83+
To download click on the links below and the browser will download them for you:
2284

2385
- https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/language_classifier.tar.gz
2486
- https://public-read-libpostal-data.s3.amazonaws.com/v1.1.0/libpostal_data.tar.gz
23.1 KB
Binary file not shown.
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
|country|% improvement|Total records|Failures - v1.1.0|Percent_failed - v1.1.0|Failures - v1.2.0|Percent_failed - v1.2.0|
2+
|---|---|---|---|---|---|---|
3+
|ae|-6.00|50|10|20|13|26|
4+
|am|0.00|50|2|4|2|4|
5+
|ar|1.33|75|12|16|11|14.67|
6+
|at|-0.44|227|12|5.29|13|5.73|
7+
|au|-1.32|76|0|0|1|1.32|
8+
|az|-10.00|50|7|14|12|24|
9+
|ba|0.00|50|8|16|8|16|
10+
|bd|-4.00|50|31|62|33|66|
11+
|be|0.00|121|20|16.53|20|16.53|
12+
|bg|-4.00|50|2|4|4|8|
13+
|br|-1.96|51|3|5.88|4|7.84|
14+
|by|20.00|50|22|44|12|24|
15+
|ca|3.25|432|44|10.19|30|6.94|
16+
|ch|3.68|136|13|9.56|8|5.88|
17+
|cl|0.00|50|33|66|33|66|
18+
|cn|44.32|88|74|84.09|35|39.77|
19+
|co|3.70|54|13|24.07|11|20.37|
20+
|cy|0.00|50|6|12|6|12|
21+
|cz|0.00|50|0|0|0|0|
22+
|de|0.84|2020|185|9.16|168|8.32|
23+
|dk|0.00|303|0|0|0|0|
24+
|ec|-2.00|50|12|24|13|26|
25+
|ee|0.00|52|0|0|0|0|
26+
|eg|4.00|50|5|10|3|6|
27+
|es|4.54|88|19|21.59|15|17.05|
28+
|fi|0.00|52|0|0|0|0|
29+
|fo|-18.00|50|0|0|9|18|
30+
|fr|-0.73|136|3|2.21|4|2.94|
31+
|gb|-1.55|258|55|21.32|59|22.87|
32+
|ge|-8.00|50|22|44|26|52|
33+
|gr|0.00|50|1|2|1|2|
34+
|gt|-6.00|50|26|52|29|58|
35+
|hr|2.00|50|1|2|0|0|
36+
|hu|0.00|101|4|3.96|4|3.96|
37+
|id|7.85|51|20|39.22|16|31.37|
38+
|ie|3.92|51|28|54.9|26|50.98|
39+
|il|0.00|52|7|13.46|7|13.46|
40+
|im|6.00|50|34|68|31|62|
41+
|in|-1.86|54|35|64.81|36|66.67|
42+
|is|-4.00|50|1|2|3|6|
43+
|it|0.00|343|26|7.58|26|7.58|
44+
|jm|2.00|50|3|6|2|4|
45+
|jp|-5.77|52|36|69.23|39|75|
46+
|kg|4.00|50|3|6|1|2|
47+
|kr|23.08|52|38|73.08|26|50|
48+
|kw|0.00|50|0|0|0|0|
49+
|kz|-1.72|58|0|0|1|1.72|
50+
|li|0.00|51|2|3.92|2|3.92|
51+
|lt|-0.79|126|3|2.38|4|3.17|
52+
|lu|0.00|50|6|12|6|12|
53+
|lv|-46.00|50|2|4|25|50|
54+
|md|-20.00|50|0|0|10|20|
55+
|mk|0.00|50|2|4|2|4|
56+
|mx|-3.51|57|16|28.07|18|31.58|
57+
|my|1.85|54|22|40.74|21|38.89|
58+
|nc|1.96|51|4|7.84|3|5.88|
59+
|nl|-0.26|1151|2|0.17|5|0.43|
60+
|no|0.00|287|0|0|0|0|
61+
|nz|0.00|51|1|1.96|1|1.96|
62+
|pe|2.00|50|6|12|5|10|
63+
|ph|2.00|50|12|24|11|22|
64+
|pk|0.00|50|24|48|24|48|
65+
|pl|0.00|628|10|1.59|10|1.59|
66+
|pr|-70.00|50|12|24|47|94|
67+
|pt|0.00|50|2|4|2|4|
68+
|re|2.00|50|4|8|3|6|
69+
|ro|0.00|52|1|1.92|1|1.92|
70+
|rs|-6.00|50|0|0|3|6|
71+
|ru|0.00|531|15|2.82|15|2.82|
72+
|se|-3.18|63|2|3.17|4|6.35|
73+
|sg|0.00|54|2|3.7|2|3.7|
74+
|si|0.00|50|1|2|1|2|
75+
|sk|6.00|50|9|18|6|12|
76+
|sm|0.00|50|0|0|0|0|
77+
|sr|0.00|50|0|0|0|0|
78+
|tj|0.00|50|12|24|12|24|
79+
|tr|2.00|50|29|58|28|56|
80+
|tt|2.00|50|2|4|1|2|
81+
|tw|40.11|177|93|52.54|22|12.43|
82+
|tz|0.00|50|1|2|1|2|
83+
|ua|3.70|54|7|12.96|5|9.26|
84+
|us|0.57|2270|117|5.15|104|4.58|
85+
|uy|-1.13|88|1|1.14|2|2.27|
86+
|uz|-2.00|50|5|10|6|12|
87+
|vn|-2.00|50|28|56|29|58|
88+
|xk|-2.00|50|48|96|49|98|
89+
|za|1.89|53|19|35.85|18|33.96|
90+
|total|0.68|12981|1398|10.77|1309|10.09|

0 commit comments

Comments
 (0)