-
Notifications
You must be signed in to change notification settings - Fork 4
Datasets
This document introduces the dataset that is used for the Linköping GraphQL Benchmark (LinGBM). The dataset of LinGBM is generated by the BSBM data generator, which is based on an e-commerce use case. The dataset contains 9 classes and 8 relationships. A set of products is offered by different vendors and different consumers have posted reviews about products.
There are two representations for Graph model: RDF triple data model and Named Graphs data model
The namespaces in the diagram are shown as following:
The BSBM data generator is also able to output the benchmark dataset as MySQL dump.
Vendor (nr, label, comment, homepage, country, publisher, publishDate)
Offer (nr, product, producer, vendor, price, validFrom, validTo, deliveryDays, offerWebpage, publisher, publishDate)
Producer (nr, label, comment, homepage, country, publisher, publishDate)
Product (nr, label, comment, producer, propertyNum1, propertyNum2, propertyNum3, propertyNum4, propertyNum5, propertyNum6, propertyTex1, propertyTex2, propertyTex3, propertyTex4, propertyTex5, propertyTex6, publisher, publishDate)
Person (nr, name, mbox_sha1sum, country, publisher, publishDate)
Review (nr, product, producer, person, reviewDate, title, text, language, rating1, rating2, rating3, rating4, publisher, publishDate, ratingSite)
ProductFeature (nr, label, comment, publisher, publishDate)
ProductType (nr, label, comment, parent, publisher, publishDate)
ProductTypeProduct (product, productType)
ProductFeatureProduct (product, productFeature)
RatingSite (name, mbox_sha1sum)
The data generator supports the following output formats:
Format | Option |
---|---|
N-Triples | -s nt |
Turtle | -s ttl |
XML | -s xml |
TriG | -s trig |
(My-)SQL dump | -s sql |
The dataset is scalable to different sizes based on a given scale factor, the number of products. The tale gives an overview about the characteristics of BSBM datasets with different scale factors:
Scale Factor | 1000 | 2000 | 5000 | 10,000 |
---|---|---|---|---|
triples | 27,886 | 377,241 | 1,620,320 | 3,421,251 |
Producers | 22 | 42 | 106 | 206 |
Number of Vendors | 12 | 23 | 48 | 99 |
Number of Offers | 20,000 | 40,000 | 100,000 | |
Number of Persons | 503 | 1017 | 2553 | 5102 |
Number of Reviews | 10,000 | 20,000 | 50,000 | 100,000 |
Number of Rating Sites | 1 | 2 | 5 | 11 |
Product Features | 4,745 | 4,745 | 10,519 | 10,519 |
Product Types | 151 | 151 | 329 | 329 |
Attribute | Data Type | range | note |
---|---|---|---|
Label | String | 1-3 words | |
Comment | String | 50-150 words | |
productPropertyTextualX | Literal | 3-15 words | |
productPropertyNumericX | Integer | 1-2000 | normal distribution |
productFeature | Every Product has about 10 -20 features | ||
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
publishDate | date | 2000-05-20 to 2000-06-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
publishDate | date | 2000-05-20 to 2000-06-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
foaf:homepage | URI | the namespace of the producer |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT |
publishDate | date | 2000-07-20 to 2005-06-23 |
Attribute | Data Type | range |
---|---|---|
price | US-$ 5 to 10000 | |
validFrom | date | 0-180 days before the publication date |
validTo | date | 7-180 days after the publication date |
deliveryDays | Integer | 1-21 |
offerWebpage | URI | within the namespace of the producer |
publishDate | date | today - 97 days) to today |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
foaf:homepage | URI | the namespace of the vendor |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT |
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range | note |
---|---|---|---|
Name | String | 2-4 words | |
mbox_sha1sum | literal | random sha1 value | email address |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT | |
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range | note |
---|---|---|---|
Title | String | 4-15 words | |
Text | String | 50-200 | lang: (EN 50%, JA 10%, ZH 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT) |
Review Date | date | Random date within the last year | |
RatingX | Integer | 1 to 10 | Reviews might include up to 4 types of ratings. The likelihood that a review has a rating of type X is 70% |
publishDate | date | from Review Date to 2008-06-20 |
Relationship | cardinalities | note |
---|---|---|
Producer-Product | 1: N | One producer per Product; 50 products on average per producer |
Product-Review | 1: N | 10 reviews per product on average; 1 product per Review, selection follows a normal distribution |
Product-Offer | 1: N | 20 Offers on average per product; one Product per offer, selection follows a normal distribution |
Person- Review | 1: N | one author per Review; 20 reviews per person on average |
Ratingsite-Review | 1: N | Every Review belongs to one rating site; A rating site generated 10000 reviews on average |
Vendors-Offers | 1: N | one offer belongs to a vendor; 2000 offers on average per vendor |
Product-ProductType | N:1 | 1 ProductType per product (leaves only) |
Product-ProductFeature | M: N | 10-20 ProductFeatures per product |