The map is powered by pre calculated pixels, not coordinates.
These tables only need to be created once...
CREATE TABLE IF NOT EXISTS wmde_wikidata_map.wikidata_map_item_pixels (
`id` string,
`posx` int,
`posy` int
)
PARTITIONED BY (
`snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
CREATE TABLE IF NOT EXISTS wmde_wikidata_map.wikidata_map_item_relation_pixels (
`forId` string,
`posx1` int,
`posy1` int,
`posx2` int,
`posy2` int
)
PARTITIONED BY (
`snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64You can read more about the WMF spark setup here.
SET hivevar:WIKIDATA_MAP_SNAPSHOT='2021-10-18';
SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=wmde_wikidata_map.wikidata_map_item_coordinates;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=wmde_wikidata_map.wikidata_map_item_relations;You also need to set this:
SET hive.exec.dynamic.partition.mode=nonstrict;From here was want to calculate pixel locations for a canvas, to avoid doing any computation on the client.
The primary target canvas size is 1920 x 1080.
The old "huge" map rendered at 8000 x 4000.
In order to get to a similar quality we will x4 the target size, to 7680 x 4320.
- TODO add ids to the pixel entries, so they can be displayed on the map..
- TODO the below query is for earth only...
INSERT INTO wmde_wikidata_map.wikidata_map_item_pixels
PARTITION(snapshot)
SELECT
id,
cast((cast(longitude as decimal(15, 10)) + 180) / 361 * 7680 as int) as posx,
cast(abs((cast(latitude as decimal(15, 10)) - 90) / 181 * 4320) as int) as posy,
snapshot
FROM ${WIKIDATA_MAP_ITEM_COORD_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
AND globe = "http://www.wikidata.org/entity/Q2";Then figure out how the relations relate to our pixel map:
INSERT INTO wmde_wikidata_map.wikidata_map_item_relation_pixels
PARTITION(snapshot)
SELECT
x.forId as forId,
a.posx as posx1,
a.posy as posy1,
b.posx as posx2,
b.posy as posy2,
x.snapshot as snapshot
FROM (
SELECT fromId, toId, forId, snapshot
FROM ${WIKIDATA_MAP_ITEM_RELATION_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
) x
JOIN wmde_wikidata_map.wikidata_map_item_pixels a ON (a.id = x.fromId) AND a.snapshot=x.snapshot
JOIN wmde_wikidata_map.wikidata_map_item_pixels b ON (b.id = x.toId) AND b.snapshot=x.snapshot
WHERE x.snapshot=${WIKIDATA_MAP_SNAPSHOT}
GROUP BY
x.forId,
a.posx,
a.posy,
b.posx,
b.posy,
x.snapshot
LIMIT 100000000;You should have rows for the correct snapshot in all of the tables...
SELECT COUNT(*) FROM wmde_wikidata_map.wikidata_map_item_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};
SELECT COUNT(*) FROM wmde_wikidata_map.wikidata_map_item_relation_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};If you add duplicate stuff by accident, you can clean it up!
INSERT OVERWRITE TABLE wmde_wikidata_map.wikidata_map_item_pixels SELECT DISTINCT * FROM wmde_wikidata_map.wikidata_map_item_pixels;
INSERT OVERWRITE TABLE wmde_wikidata_map.wikidata_map_item_relation_pixels SELECT DISTINCT * FROM wmde_wikidata_map.wikidata_map_item_relation_pixels;Exit spark and do the rest!
Set an environment variable with the snapshot date:
WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17" "P36" "P47" "P138" "P150" "P190" "P197" "P403")And write the files...
// TODO tail -n +6?
tail -n +2removes the firt line of output, which will bePYSPARK_PYTHON=python3.7sed 's/[\t]/,/g'turns the TSV into a CSV
spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx, posy, COUNT(*) as count FROM wmde_wikidata_map.wikidata_map_item_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' GROUP BY posx, posy ORDER BY count DESC LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g' > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
echo $PROPERTY
spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx1, posy1, posx2, posy2 FROM wmde_wikidata_map.wikidata_map_item_relation_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' AND forId = '${PROPERTY}' LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g' > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
doneYou should find the new files on disk in your current working directory. You can check how many lines they have.
cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv | wc -l
for PROPERTY in ${PropertyArray[*]}; do
echo $PROPERTY
cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv | wc -l
doneAnd they will look something like this
Pixel location and entity count
posx,posy,count
3933,898,7144
3761,812,6263Relations between pixel locations
posx1,posy1,posx2,posy2
6111,1644,6126,1766
6078,1911,6126,1766If everything went well, you are ready to publish the data.
Set an environment variable with the snapshot date:
WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17" "P36" "P47" "P138" "P150" "P190" "P197" "P403")And move them into the /srv/published directory (making sure the dir exists first)
mkdir -p /srv/published/datasets/one-off/wikidata/wmde_wikidata_map
cp -v map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv /srv/published/datasets/one-off/wikidata/wmde_wikidata_map/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
cp -v map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv /srv/published/datasets/one-off/wikidata/wmde_wikidata_map/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
done
published-syncThis can take a little while to show up...
Make sure the files appears: https://analytics.wikimedia.org/published/datasets/one-off/wikidata/wmde_wikidata_map/