wikibase-dump-filter requires to parse each entity's JSON object and re-stringify it, which can take a considerable amount of time on a full dump. A lot of time can thus be saved by prefiltering the dump via tools operating on text pattern, such as grep: see prefilter documentation.
By claims
- from a local file
cat entities.json | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
cat entities.json | wikibase-dump-filter --claim P18 > entities_with_an_image.ndjson
cat entities.json | wikibase-dump-filter --claim P31:Q5,Q6256 > humans_and_countries.ndjsonthis command filters entities_dump.json into a subset where all lines are the json with an entity having Q5 in it's P31 claims
(where ndjson stands for newline-delimited json)
- directly from a Wikidata dump
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjsonthis can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.
Of course, this probably only make sense if the kind of entities you are looking for is somewhere above 2 000 000 units(?), given that under this level, it would probably be faster/more efficient to get the list of ids from your Wikibase Query Service (see Wikidata Query Service), then get the entities data from the API (which could easily be done using wikibase-cli wb data).
and
# operator: &
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50' > books_with_an_author.ndjsonor
# operator: |
cat entities.json | wikibase-dump-filter --claim 'P31:Q146|P31:Q144' > cats_and_dogs.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --claim 'P31:Q146,Q144' > cats_and_dogs.ndjson
# the 'or' operator has priority on the 'and' operator:
# this claim filter is equivalent to (P31:Q571 && (P50 || P110))
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50|P110' > books_with_an_author_or_an_illustrator.ndjsonnot
# operator: ~
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&~P50' > books_without_author.ndjsonIf your claim is too long and triggers a Argument list too long error, you can pass a file instead:
echo 'P31:Q5,Q6256' > ./claim
cat entities.json | wikibase-dump-filter --claim ./claim > humans_and_countries.ndjsonBy sitelinks
Keep only entities with a certain sitelink
# entities with a page on Wikimedia Commons
cat entities.json | wikibase-dump-filter --sitelink commonswiki > subset.ndjson
# entities with a Dutch Wikipedia article
cat entities.json | wikibase-dump-filter --sitelink nlwiki > subset.ndjson
# entities with a Russian Wikipedia articles or Wikiquote article
cat entities.json | wikibase-dump-filter --sitelink 'ruwiki|ruwikiquote' > subset.ndjsonYou can even do finer filters by combining conditions with & (AND) / | (OR).
# entities with Chinese and French Wikipedia articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki' > subset.ndjson
# entities with Chinese and French Wikipedia articles, or Chinese and Spanish articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki|eswiki' > subset.ndjsonNB: A&B|C is interpreted as A AND (B OR C)
By type
Default: item
cat entities.json | wikibase-dump-filter --type item
cat entities.json | wikibase-dump-filter --type property
cat entities.json | wikibase-dump-filter --type bothNeed another kind of filter? Just ask for it in the issues, or make a pull request!
Wikidata entities have the following attributes: id, type, labels, descriptions, aliases, claims, sitelinks.
All in all, this whole takes a lot of place and might not be needed in your use case: for instance, if your goal is to do full text search on a subset of Wikidata, you just need to keep the labels, aliases and descriptions, and you can omit the claims and sitelinks that do take a lot of space.
This can be done with either the --keep or the --omit command:
cat entities.json | wikibase-dump-filter --omit claims,sitelinks > humans.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --keep id,type,labels,descriptions,aliases > humans.ndjsonKeep only the desired languages for labels, descriptions, aliases, and sitelinks.
cat entities.json | wikibase-dump-filter --languages en,fr,de,zh,eo > subset.ndjsonUses wikidata-sdk simplify.entity function to parse the labels, descriptions, aliases, claims, and sitelinks.
# Default simplify options
cat entities.json | wikibase-dump-filter --simplify > simplified_dump.ndjson
# Custom options, see wdk.simplify.entity documentation https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_entities_data.md
# and specifically for claims options, see https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_claims.md#options
cat entities.json | wikibase-dump-filter --simplify '{"keepRichValues":"true","keepQualifiers":"true","keepReferences":"true"}' > simplified_dump.ndjson
# The options can also be passed in a lighter, urlencoded-like, key=value format
# that's simpler than typing all those JSON double quotes
cat entities.json | wikibase-dump-filter --simplify 'keepRichValues=true&keepQualifiers=true&keepReferences=true' > simplified_dump.ndjsonAll the options (see wbk.simplify.entity documentation for more details):
- claims simplification options:
entityPrefixandpropertyPrefix(string)keepRichValues(boolean)keepQualifiers(boolean)keepReferences(boolean)keepIds(boolean)keepHashes(boolean)keepNonTruthy(boolean)
- sitelinks simplification options:
addUrl(boolean)
-z, --include-size include original entity byte length in output like { ..., "size": 548 }
-h, --help output usage information
-p, --progress enable the progress bar
-q, --quiet disable the progress bar
-V, --version output the version number