Skip to content

Datasets

chengsijin0817 edited this page Jan 24, 2019 · 24 revisions

This document introduces the dataset that is used for the Linköping GraphQL Benchmark (LinGBM). The dataset of LinGBM is generated by the BSBM data generator, which is based on an e-commerce use case. The dataset contains 9 classes and 8 relationships. A set of products is offered by different vendors and different consumers have posted reviews about products.

Graph model

There are two representations for Graph model: RDF triple data model and Named Graphs data model

The namespaces in the diagram are shown as following:

Prefix Namespace
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
foaf: http://xmlns.com/foaf/0.1/
dc: http://purl.org/dc/elements/1.1/
xsd: http://www.w3.org/2001/XMLSchema#
rev: http://purl.org/stuff/rev#
bsbm: http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/
bsbm-inst: http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/

Relational model

The BSBM data generator is also able to output the benchmark dataset as MySQL dump.

Relational schema:

Vendor (nr, label, comment, homepage, country, publisher, publishDate)

Offer (nr, product, producer, vendor, price, validFrom, validTo, deliveryDays, offerWebpage, publisher, publishDate)

Producer (nr, label, comment, homepage, country, publisher, publishDate)

Product (nr, label, comment, producer, propertyNum1, propertyNum2, propertyNum3, propertyNum4, propertyNum5, propertyNum6, propertyTex1, propertyTex2, propertyTex3, propertyTex4, propertyTex5, propertyTex6, publisher, publishDate)

Person (nr, name, mbox_sha1sum, country, publisher, publishDate)

Review (nr, product, producer, person, reviewDate, title, text, language, rating1, rating2, rating3, rating4, publisher, publishDate, ratingSite)

ProductFeature (nr, label, comment, publisher, publishDate)

ProductType (nr, label, comment, parent, publisher, publishDate)

ProductTypeProduct (product, productType)

ProductFeatureProduct (product, productFeature)

RatingSite (name, mbox_sha1sum)

The data generator supports the following output formats:

Format Option
N-Triples -s nt
Turtle -s ttl
XML -s xml
TriG -s trig
(My-)SQL dump -s sql

Characteristics

The dataset is scalable to different sizes based on a given scale factor, the number of products. The tale gives an overview about the characteristics of BSBM datasets with different scale factors:

Scale Factor 1000 2000 5000 10,000
triples 27,886 377,241 1,620,320 3,421,251
Producers 22 42 106 206
Number of Vendors 12 23 48 99
Number of Offers 20,000 40,000 100,000
Number of Persons 503 1017 2553 5102
Number of Reviews 10,000 20,000 50,000 100,000
Number of Rating Sites 1 2 5 11
Product Features 4,745 4,745 10,519 10,519
Product Types 151 151 329 329

Generation Rules

Product

Attribute Data Type range note
Label String 1-3 words
Comment String 50-150 words
productPropertyTextualX Literal 3-15 words
productPropertyNumericX Integer 1-2000 normal distribution
productFeature Every Product has about 10 -20 features
publishDate date 2000-09-20 to 2006-12-23

Product Type

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
publishDate date 2000-05-20 to 2000-06-23

Product Features

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
publishDate date 2000-05-20 to 2000-06-23

Producer

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
foaf:homepage URI the namespace of the producer
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-07-20 to 2005-06-23

Offer

Attribute Data Type range
price US-$ 5 to 10000
validFrom date 0-180 days before the publication date
validTo date 7-180 days after the publication date
deliveryDays Integer 1-21
offerWebpage URI within the namespace of the producer
publishDate date today - 97 days) to today

Vendor

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
foaf:homepage URI the namespace of the vendor
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-09-20 to 2006-12-23

Person

Attribute Data Type range note
Name String 2-4 words
mbox_sha1sum literal random sha1 value email address
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-09-20 to 2006-12-23

Review

Attribute Data Type range note
Title String 4-15 words
Text String 50-200 lang: (EN 50%, JA 10%, ZH 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
Review Date date Random date within the last year
RatingX Integer 1 to 10 Reviews might include up to 4 types of ratings. The likelihood that a review has a rating of type X is 70%
publishDate date from Review Date to 2008-06-20

Relationship

Relationship cardinalities note
Producer-Product 1: N One producer per Product; 50 products on average per producer
Product-Review 1: N 10 reviews per product on average; 1 product per Review, selection follows a normal distribution
Product-Offer 1: N 20 Offers on average per product; one Product per offer, selection follows a normal distribution
Person- Review 1: N one author per Review; 20 reviews per person on average
Ratingsite-Review 1: N Every Review belongs to one rating site; A rating site generated 10000 reviews on average
Vendors-Offers 1: N one offer belongs to a vendor; 2000 offers on average per vendor
Product-ProductType N:1 1 ProductType per product (leaves only)
Product-ProductFeature M: N 10-20 ProductFeatures per product
Clone this wiki locally