Command Line Interface (CLI) Overview

Summary

Pre-filter entities
Filter entities
Format entities
Other options
Tips

Pre-filter entities

⚠️ wikibase-dump-filter requires to parse each entity's JSON object and re-stringify it, which can take a considerable amount of time on a full dump. A lot of time can thus be saved by prefiltering the dump via tools operating on text pattern, such as grep: see prefilter documentation.

Filter entities

By claims

from a local file

cat entities.json | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
cat entities.json | wikibase-dump-filter --claim P18 > entities_with_an_image.ndjson
cat entities.json | wikibase-dump-filter --claim P31:Q5,Q6256 > humans_and_countries.ndjson

this command filters entities_dump.json into a subset where all lines are the json with an entity having Q5 in it's P31 claims

(where ndjson stands for newline-delimited json)

directly from a Wikidata dump

curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson

this can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.

Of course, this probably only make sense if the kind of entities you are looking for is somewhere above 2 000 000 units(?), given that under this level, it would probably be faster/more efficient to get the list of ids from your Wikibase Query Service (see Wikidata Query Service), then get the entities data from the API (which could easily be done using wikibase-cli wb data).

claims logical operators

and

# operator: &
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50' > books_with_an_author.ndjson

# operator: |
cat entities.json | wikibase-dump-filter --claim 'P31:Q146|P31:Q144' > cats_and_dogs.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --claim 'P31:Q146,Q144' > cats_and_dogs.ndjson

# the 'or' operator has priority on the 'and' operator:
# this claim filter is equivalent to (P31:Q571 && (P50 || P110))
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50|P110' > books_with_an_author_or_an_illustrator.ndjson

not

# operator: ~
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&~P50' > books_without_author.ndjson

Long claim option

If your claim is too long and triggers a Argument list too long error, you can pass a file instead:

echo 'P31:Q5,Q6256' > ./claim
cat entities.json | wikibase-dump-filter --claim ./claim > humans_and_countries.ndjson

By sitelinks

Keep only entities with a certain sitelink

# entities with a page on Wikimedia Commons
cat entities.json | wikibase-dump-filter --sitelink commonswiki > subset.ndjson
# entities with a Dutch Wikipedia article
cat entities.json | wikibase-dump-filter --sitelink nlwiki > subset.ndjson
# entities with a Russian Wikipedia articles or Wikiquote article
cat entities.json | wikibase-dump-filter --sitelink 'ruwiki|ruwikiquote' > subset.ndjson

You can even do finer filters by combining conditions with & (AND) / | (OR).

# entities with Chinese and French Wikipedia articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki' > subset.ndjson
# entities with Chinese and French Wikipedia articles, or Chinese and Spanish articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki|eswiki' > subset.ndjson

NB: A&B|C is interpreted as A AND (B OR C)

By type

Default: item

cat entities.json | wikibase-dump-filter --type item
cat entities.json | wikibase-dump-filter --type property
cat entities.json | wikibase-dump-filter --type both

By something else

Need another kind of filter? Just ask for it in the issues, or make a pull request!

Format entities

Filter attributes

Wikidata entities have the following attributes: id, type, labels, descriptions, aliases, claims, sitelinks. All in all, this whole takes a lot of place and might not be needed in your use case: for instance, if your goal is to do full text search on a subset of Wikidata, you just need to keep the labels, aliases and descriptions, and you can omit the claims and sitelinks that do take a lot of space.

This can be done with either the --keep or the --omit command:

cat entities.json | wikibase-dump-filter --omit claims,sitelinks > humans.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --keep id,type,labels,descriptions,aliases > humans.ndjson

Filter languages

Keep only the desired languages for labels, descriptions, aliases, and sitelinks.

cat entities.json | wikibase-dump-filter --languages en,fr,de,zh,eo > subset.ndjson

Simplify entity data

Uses wikidata-sdk simplify.entity function to parse the labels, descriptions, aliases, claims, and sitelinks.

# Default simplify options
cat entities.json | wikibase-dump-filter --simplify > simplified_dump.ndjson
# Custom options, see wdk.simplify.entity documentation https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_entities_data.md
# and specifically for claims options, see https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_claims.md#options
cat entities.json | wikibase-dump-filter --simplify '{"keepRichValues":"true","keepQualifiers":"true","keepReferences":"true"}' > simplified_dump.ndjson
# The options can also be passed in a lighter, urlencoded-like, key=value format
# that's simpler than typing all those JSON double quotes
cat entities.json | wikibase-dump-filter --simplify 'keepRichValues=true&keepQualifiers=true&keepReferences=true' > simplified_dump.ndjson

All the options (see wbk.simplify.entity documentation for more details):

claims simplification options:
- entityPrefix and propertyPrefix (string)
- keepRichValues (boolean)
- keepQualifiers (boolean)
- keepReferences (boolean)
- keepIds (boolean)
- keepHashes (boolean)
- keepNonTruthy (boolean)
sitelinks simplification options:
- addUrl (boolean)

Other options

-z, --include-size             include original entity byte length in output like { ..., "size": 548 }
-h, --help                     output usage information
-p, --progress                 enable the progress bar
-q, --quiet                    disable the progress bar
-V, --version                  output the version number

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Command Line Interface (CLI) Overview

Summary

Pre-filter entities

Filter entities

By claims

claims logical operators

Long claim option

By sitelinks

By type

By something else

Format entities

Filter attributes

Filter languages

Simplify entity data

Other options

Tips

Uh oh!

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

Command Line Interface (CLI) Overview

Summary

Pre-filter entities

Filter entities

By claims

claims logical operators

Long claim option

By sitelinks

By type

By something else

Format entities

Filter attributes

Filter languages

Simplify entity data

Other options

Tips