Synthetic Dataset Generator is part of Sprouts Project. It takes the Amazon product data datasets made by Julian McAuley and enrich the datasets by including customers and order data. It generates JSON files in BSON format for MongoDB.
Python vesion is 2.7
- faker - fake data generator
- bson - library used for generating bson from Python dictionary and vice versa
These scripts are prepared for generate synthetic data from the "Grocery and Gourmet Foods" dataset, provided by Julian McAuley. By following this strategy:
- Get the grocery reviews dataset and generate BSON ObjectIds for customers, items and reviews. It creates reviews.json file.
- Get reviews.json file and generate Customer for each unique referenced customer. The customer data is randomly generated by the faker library. It creates customer.json file.
- Get the product metadata dataset and fixe it by formatting it as json. It creates items_fixed.json file.
- Get items_fixed.json and reviews.json files and update each item in order to assign the item id. It creates de items.json file.
- Get reviews.json and items.json files and create the orders. For each review, it creates an order for the reviewed item, and with a probability for each related item, as well as several orders with the related items.
- Addapt the code for process the digital music dataset
- Refactor the code
Julian McAuley Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015 pdf
Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining, 2015 pdf