Skip to content

Datasets

chengsijin0817 edited this page Jan 25, 2019 · 24 revisions

This document describes the dataset that is used for the Linköping GraphQL Benchmark (LinGBM). The dataset of LinGBM is generated by the BSBM data generator, which is based on an e-commerce use case. The dataset contains 9 entities and 8 relationships. Producers produce a set of products, which is offered by different vendors and different consumers have posted reviews about products.

Graph model

This part shows the Graph model of the dataset. The detailed information of cardinalities of the relationships refers to Relationships.

The namespaces in the diagram are shown as following:

Prefix Namespace
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
foaf: http://xmlns.com/foaf/0.1/
dc: http://purl.org/dc/elements/1.1/
xsd: http://www.w3.org/2001/XMLSchema#
rev: http://purl.org/stuff/rev#
bsbm: http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/
bsbm-inst: http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/

There are two graph representations of the dataset: RDF triple representation and Named Graphs representation. The examples of the representations refer to here.

Relational model

The dataset is also able to use relational representation. To make the graph model and relational model represent the same semantics, the BSBM data generator could output the dataset as a MySQL dump. This dump uses the following entity relationship and relational schema:

Relational schema:

Vendor (nr, label, comment, homepage, country, publisher, publishDate)
Offer (nr, product, producer, vendor, price, validFrom, validTo, deliveryDays, offerWebpage, publisher, publishDate)
Producer (nr, label, comment, homepage, country, publisher, publishDate)
Product (nr, label, comment, producer, propertyNum1, propertyNum2, propertyNum3, propertyNum4, propertyNum5, propertyNum6, propertyTex1, propertyTex2, propertyTex3, propertyTex4, propertyTex5, propertyTex6, publisher, publishDate)
Person (nr, name, mbox_sha1sum, country, publisher, publishDate)
Review (nr, product, producer, person, reviewDate, title, text, language, rating1, rating2, rating3, rating4, publisher, publishDate, ratingSite)
ProductFeature (nr, label, comment, publisher, publishDate)
ProductType (nr, label, comment, parent, publisher, publishDate)
ProductTypeProduct (product, productType)
ProductFeatureProduct (product, productFeature)

The data generator supports the following output formats: TODO: detailed description

Format Option
N-Triples -s nt
Turtle -s ttl
XML -s xml
TriG -s trig
(My-)SQL dump -s sql

Characteristics

The dataset is scalable to different sizes based on a given scale factor, the number of products. This tale gives an overview about the characteristics of BSBM datasets with different scale factors:

Scale Factor 1000 2000 5000 10,000
triples 27,886 377,241 1,620,320 3,421,251
Producers 22 42 106 206
Number of Vendors 12 23 48 99
Number of Offers 20,000 40,000 100,000
Number of Persons 503 1017 2553 5102
Number of Reviews 10,000 20,000 50,000 100,000
Number of Rating Sites 1 2 5 11
Product Features 4,745 4,745 10,519 10,519
Product Types 151 151 329 329

Generation Rules

This part introduces the data generation rules that are used to populate the dataset according to the given scale factor.

Product

Attribute Data Type range note
Label String 1-3 words
Comment String 50-150 words
productPropertyTextualX Literal 3-15 words
productPropertyNumericX Integer 1-2000 normal distribution
productFeature Every Product has about 10 -20 features
publishDate date 2000-09-20 to 2006-12-23

Product Type

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
publishDate date 2000-05-20 to 2000-06-23

Product Features

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
publishDate date 2000-05-20 to 2000-06-23

Producer

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
foaf:homepage URI the namespace of the producer
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-07-20 to 2005-06-23

Offer

Attribute Data Type range
price US-$ 5 to 10000
validFrom date 0-180 days before the publication date
validTo date 7-180 days after the publication date
deliveryDays Integer 1-21
offerWebpage URI within the namespace of the producer
publishDate date today - 97 days) to today

Vendor

Attribute Data Type range
Label String 1-3 words
Comment String 20-50 words
foaf:homepage URI the namespace of the vendor
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-09-20 to 2006-12-23

Person

Attribute Data Type range note
Name String 2-4 words
mbox_sha1sum literal random sha1 value email address
country ISO3166 US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT
publishDate date 2000-09-20 to 2006-12-23

Review

Attribute Data Type range note
Title String 4-15 words
Text String 50-200 lang: (EN 50%, JA 10%, ZH 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
Review Date date Random date within the last year
RatingX Integer 1 to 10 Reviews might include up to 4 types of ratings. The likelihood that a review has a rating of type X is 70%
publishDate date from Review Date to 2008-06-20

Relationship

Relationship cardinalities note
Producer-Product 1: N One producer per Product; 50 products on average per producer
Product-Review 1: N 10 reviews per product on average; 1 product per Review, selection follows a normal distribution
Product-Offer 1: N 20 Offers on average per product; one Product per offer, selection follows a normal distribution
Person- Review 1: N one author per Review; 20 reviews per person on average
Ratingsite-Review 1: N Every Review belongs to one rating site; A rating site generated 10000 reviews on average
Vendors-Offers 1: N one offer belongs to a vendor; 2000 offers on average per vendor
Product-ProductType N:1 1 ProductType per product (leaves only)
Product-ProductFeature M: N 10-20 ProductFeatures per product
Clone this wiki locally