Data Pixels

The map is powered by pre calculated pixels, not coordinates.

Setting up the tables

These tables only need to be created once...

CREATE TABLE IF NOT EXISTS wmde_wikidata_map.wikidata_map_item_pixels (
    `id` string,
    `posx` int,
    `posy` int
)
PARTITIONED BY (
  `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

CREATE TABLE IF NOT EXISTS wmde_wikidata_map.wikidata_map_item_relation_pixels (
    `forId` string,
    `posx1` int,
    `posy1` int,
    `posx2` int,
    `posy2` int
)
PARTITIONED BY (
  `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

Generating data

Setup

spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64

You can read more about the WMF spark setup here.

Selecting snapshot & poking settings

SET hivevar:WIKIDATA_MAP_SNAPSHOT='2021-10-18';

SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=wmde_wikidata_map.wikidata_map_item_coordinates;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=wmde_wikidata_map.wikidata_map_item_relations;

You also need to set this:

SET hive.exec.dynamic.partition.mode=nonstrict;

Calculate pixel locations

From here was want to calculate pixel locations for a canvas, to avoid doing any computation on the client.

The primary target canvas size is 1920 x 1080.

The old "huge" map rendered at 8000 x 4000.

In order to get to a similar quality we will x4 the target size, to 7680 x 4320.

TODO add ids to the pixel entries, so they can be displayed on the map..
TODO the below query is for earth only...

INSERT INTO wmde_wikidata_map.wikidata_map_item_pixels
PARTITION(snapshot)
SELECT
    id, 
    cast((cast(longitude as decimal(15, 10)) + 180) / 361 * 7680 as int) as posx,
    cast(abs((cast(latitude as decimal(15, 10)) - 90) / 181 * 4320) as int) as posy,
    snapshot
FROM ${WIKIDATA_MAP_ITEM_COORD_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
AND globe = "http://www.wikidata.org/entity/Q2";

Calculate item relation pixel locations

Then figure out how the relations relate to our pixel map:

INSERT INTO wmde_wikidata_map.wikidata_map_item_relation_pixels
PARTITION(snapshot)
SELECT
    x.forId as forId,
    a.posx as posx1,
    a.posy as posy1,
    b.posx as posx2,
    b.posy as posy2,
    x.snapshot as snapshot
FROM (
    SELECT fromId, toId, forId, snapshot
    FROM ${WIKIDATA_MAP_ITEM_RELATION_TABLE}
    WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
) x
JOIN wmde_wikidata_map.wikidata_map_item_pixels a ON (a.id = x.fromId) AND a.snapshot=x.snapshot
JOIN wmde_wikidata_map.wikidata_map_item_pixels b ON (b.id = x.toId) AND b.snapshot=x.snapshot
WHERE x.snapshot=${WIKIDATA_MAP_SNAPSHOT}
GROUP BY
    x.forId,
    a.posx,
    a.posy,
    b.posx,
    b.posy,
    x.snapshot
LIMIT 100000000;

Check the amount of data;

You should have rows for the correct snapshot in all of the tables...

SELECT COUNT(*) FROM wmde_wikidata_map.wikidata_map_item_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};
SELECT COUNT(*) FROM wmde_wikidata_map.wikidata_map_item_relation_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};

Cleanup?

If you add duplicate stuff by accident, you can clean it up!

INSERT OVERWRITE TABLE wmde_wikidata_map.wikidata_map_item_pixels SELECT DISTINCT * FROM wmde_wikidata_map.wikidata_map_item_pixels;
INSERT OVERWRITE TABLE wmde_wikidata_map.wikidata_map_item_relation_pixels SELECT DISTINCT * FROM wmde_wikidata_map.wikidata_map_item_relation_pixels;

Generate the CSVs

Exit spark and do the rest!

Set an environment variable with the snapshot date:

WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17"  "P36"  "P47"  "P138"  "P150"  "P190"  "P197"  "P403")

And write the files...

// TODO tail -n +6?

tail -n +2 removes the firt line of output, which will be PYSPARK_PYTHON=python3.7
sed 's/[\t]/,/g' turns the TSV into a CSV

spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx, posy, COUNT(*) as count FROM wmde_wikidata_map.wikidata_map_item_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' GROUP BY posx, posy ORDER BY count DESC LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g'  > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
    echo $PROPERTY
    spark3-sql --master yarn --executor-memory 16G --executor-cores 4 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx1, posy1, posx2, posy2 FROM wmde_wikidata_map.wikidata_map_item_relation_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' AND forId = '${PROPERTY}' LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g'  > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
done

You should find the new files on disk in your current working directory. You can check how many lines they have.

cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv | wc -l
for PROPERTY in ${PropertyArray[*]}; do
    echo $PROPERTY
    cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv | wc -l
done

And they will look something like this

Pixel location and entity count

posx,posy,count
3933,898,7144
3761,812,6263

Relations between pixel locations

posx1,posy1,posx2,posy2
6111,1644,6126,1766
6078,1911,6126,1766

Publishing data

If everything went well, you are ready to publish the data.

Publishing the CSVs

Set an environment variable with the snapshot date:

WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17"  "P36"  "P47"  "P138"  "P150"  "P190"  "P197"  "P403")

And move them into the /srv/published directory (making sure the dir exists first)

mkdir -p /srv/published/datasets/one-off/wikidata/wmde_wikidata_map
cp -v map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv /srv/published/datasets/one-off/wikidata/wmde_wikidata_map/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
    cp -v map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv /srv/published/datasets/one-off/wikidata/wmde_wikidata_map/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv 
done
published-sync

This can take a little while to show up...

Make sure the files appears: https://analytics.wikimedia.org/published/datasets/one-off/wikidata/wmde_wikidata_map/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Pixels

Setting up the tables

Generating data

Setup

Selecting snapshot & poking settings

Calculate pixel locations

Calculate item relation pixel locations

Check the amount of data;

Cleanup?

Generate the CSVs

Publishing data

Publishing the CSVs

FilesExpand file tree

DATA-PIXELS.md

Latest commit

History

DATA-PIXELS.md

File metadata and controls

Data Pixels

Setting up the tables

Generating data

Setup

Selecting snapshot & poking settings

Calculate pixel locations

Calculate item relation pixel locations

Check the amount of data;

Cleanup?

Generate the CSVs

Publishing data

Publishing the CSVs