Synthetic Dataset Generator

Synthetic Dataset Generator is part of Sprouts Project. It takes the Amazon product data datasets made by Julian McAuley and enrich the datasets by including customers and order data. It generates JSON files in BSON format for MongoDB.

Python dependencies

Python vesion is 2.7

faker - fake data generator
bson - library used for generating bson from Python dictionary and vice versa

Strategy

These scripts are prepared for generate synthetic data from the "Grocery and Gourmet Foods" dataset, provided by Julian McAuley. By following this strategy:

Get the grocery reviews dataset and generate BSON ObjectIds for customers, items and reviews. It creates reviews.json file.
Get reviews.json file and generate Customer for each unique referenced customer. The customer data is randomly generated by the faker library. It creates customer.json file.
Get the product metadata dataset and fixe it by formatting it as json. It creates items_fixed.json file.
Get items_fixed.json and reviews.json files and update each item in order to assign the item id. It creates de items.json file.
Get reviews.json and items.json files and create the orders. For each review, it creates an order for the reviewed item, and with a probability for each related item, as well as several orders with the related items.

TODO

Addapt the code for process the digital music dataset
Refactor the code

Credits to the provider of the Amazon reviews and product metadata datasets

Julian McAuley Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015 pdf

Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining, 2015 pdf

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
DigitalMusicDataset		DigitalMusicDataset
GroceryDataset		GroceryDataset
.gitignore		.gitignore
FixingJSON.py		FixingJSON.py
FixingJSON2.py		FixingJSON2.py
Main.py		Main.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Dataset Generator

Python dependencies

Strategy

TODO

Credits to the provider of the Amazon reviews and product metadata datasets

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Sprouts-Project/synthetic-dataset-generator

Folders and files

Latest commit

History

Repository files navigation

Synthetic Dataset Generator

Python dependencies

Strategy

TODO

Credits to the provider of the Amazon reviews and product metadata datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages