-
Notifications
You must be signed in to change notification settings - Fork 4
Datasets
This document describes the dataset that is used for the Linköping GraphQL Benchmark (LinGBM). The dataset of LinGBM is generated by the BSBM data generator, which is based on an e-commerce use case. The dataset contains 9 entities and 8 relationships. Producers produce a set of products, which is offered by different vendors and different consumers have posted reviews about products.
This part shows the Graph model of the dataset. The detailed information of cardinalities of the relationships refers to Relationships.
The namespaces in the diagram are shown as following:
There are two graph representations of the dataset: RDF triple representation and Named Graphs representation. The examples of the representations refer to here.
The dataset is also able to use relational representation. To make the graph model and relational model represent the same semantics, the BSBM data generator could output the dataset as a MySQL dump. This dump uses the following entity relationship and relational schema:
Vendor (nr, label, comment, homepage, country, publisher, publishDate)
Offer (nr, product, producer, vendor, price, validFrom, validTo, deliveryDays, offerWebpage, publisher, publishDate)
Producer (nr, label, comment, homepage, country, publisher, publishDate)
Product (nr, label, comment, producer, propertyNum1, propertyNum2, propertyNum3, propertyNum4, propertyNum5, propertyNum6, propertyTex1, propertyTex2, propertyTex3, propertyTex4, propertyTex5, propertyTex6, publisher, publishDate)
Person (nr, name, mbox_sha1sum, country, publisher, publishDate)
Review (nr, product, producer, person, reviewDate, title, text, language, rating1, rating2, rating3, rating4, publisher, publishDate, ratingSite)
ProductFeature (nr, label, comment, publisher, publishDate)
ProductType (nr, label, comment, parent, publisher, publishDate)
ProductTypeProduct (product, productType)
ProductFeatureProduct (product, productFeature)
The data generator supports the following output formats: TODO: detailed description
Format | Option |
---|---|
N-Triples | -s nt |
Turtle | -s ttl |
XML | -s xml |
TriG | -s trig |
(My-)SQL dump | -s sql |
The dataset is scalable to different sizes based on a given scale factor, the number of products. This tale gives an overview about the characteristics of BSBM datasets with different scale factors:
Scale Factor | 1000 | 2000 | 5000 | 10,000 |
---|---|---|---|---|
triples | 27,886 | 377,241 | 1,620,320 | 3,421,251 |
Producers | 22 | 42 | 106 | 206 |
Number of Vendors | 12 | 23 | 48 | 99 |
Number of Offers | 20,000 | 40,000 | 100,000 | |
Number of Persons | 503 | 1017 | 2553 | 5102 |
Number of Reviews | 10,000 | 20,000 | 50,000 | 100,000 |
Number of Rating Sites | 1 | 2 | 5 | 11 |
Product Features | 4,745 | 4,745 | 10,519 | 10,519 |
Product Types | 151 | 151 | 329 | 329 |
This part introduces the data generation rules that are used to populate the dataset according to the given scale factor.
Attribute | Data Type | range | note |
---|---|---|---|
Label | String | 1-3 words | |
Comment | String | 50-150 words | |
productPropertyTextualX | Literal | 3-15 words | |
productPropertyNumericX | Integer | 1-2000 | normal distribution |
productFeature | Every Product has about 10 -20 features | ||
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
publishDate | date | 2000-05-20 to 2000-06-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
publishDate | date | 2000-05-20 to 2000-06-23 |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
foaf:homepage | URI | the namespace of the producer |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT |
publishDate | date | 2000-07-20 to 2005-06-23 |
Attribute | Data Type | range |
---|---|---|
price | US-$ 5 to 10000 | |
validFrom | date | 0-180 days before the publication date |
validTo | date | 7-180 days after the publication date |
deliveryDays | Integer | 1-21 |
offerWebpage | URI | within the namespace of the producer |
publishDate | date | today - 97 days) to today |
Attribute | Data Type | range |
---|---|---|
Label | String | 1-3 words |
Comment | String | 20-50 words |
foaf:homepage | URI | the namespace of the vendor |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT |
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range | note |
---|---|---|---|
Name | String | 2-4 words | |
mbox_sha1sum | literal | random sha1 value | email address |
country | ISO3166 | US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT | |
publishDate | date | 2000-09-20 to 2006-12-23 |
Attribute | Data Type | range | note |
---|---|---|---|
Title | String | 4-15 words | |
Text | String | 50-200 | lang: (EN 50%, JA 10%, ZH 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT) |
Review Date | date | Random date within the last year | |
RatingX | Integer | 1 to 10 | Reviews might include up to 4 types of ratings. The likelihood that a review has a rating of type X is 70% |
publishDate | date | from Review Date to 2008-06-20 |
Relationship | cardinalities | note |
---|---|---|
Producer-Product | 1: N | One producer per Product; 50 products on average per producer |
Product-Review | 1: N | 10 reviews per product on average; 1 product per Review, selection follows a normal distribution |
Product-Offer | 1: N | 20 Offers on average per product; one Product per offer, selection follows a normal distribution |
Person- Review | 1: N | one author per Review; 20 reviews per person on average |
Ratingsite-Review | 1: N | Every Review belongs to one rating site; A rating site generated 10000 reviews on average |
Vendors-Offers | 1: N | one offer belongs to a vendor; 2000 offers on average per vendor |
Product-ProductType | N:1 | 1 ProductType per product (leaves only) |
Product-ProductFeature | M: N | 10-20 ProductFeatures per product |