This repository contains the code and data to reproduce the experiments from the paper "Analysis and Performance Evaluation of Machine Learning Techniques for Product Matching." The study investigates various machine learning techniques applied to product matching through a systematic literature review and experiments using datasets from the WDC Product Data Corpus and Magellan Data Repository.
The repository offers implementations of methods from six key studies reviewed in the paper, including fine-tuning pre-trained language models and additional optimizations.
- Code: Implementations of machine learning techniques from the reviewed studies.
- Data: Links to the datasets used in the experiments, including subsets from the WDC Product Data Corpus and Magellan Data Repository.
- Experiments: Scripts and configurations to replicate the experiments and results presented in the study.
The evaluated methods are implemented based on the reviewed studies:
- Deep Entity Matching with Pre-Trained Language Models (2020)
- Intermediate Training of BERT for Product Matching (2020)
- Dual-Objective Fine-Tuning of BERT for Entity Matching (2021)
- Multilingual Transformers for Product Matching – Experiments and a New Benchmark in Polish (2022)
- Supervised Contrastive Learning for Product Matching (2022)
- Entity Resolution with Hierarchical Graph Attention Networks (2022)
To clone this repository, run:
git clone https://github.com/edfvalim/ml-product-matching
To conduct the experiments, navigate to the appropriate subdirectory for each implementation. Detailed instructions for setting up dependencies, downloading datasets, and executing scripts are provided within the corresponding README file.
The code in this repository is licensed under the MIT License. However, some subdirectories contain code that is licensed under different licenses, such as the BSD License and the Apache License 2.0. Please refer to the specific license files in those subdirectories for detailed information.
This repository includes code and data from various studies. We acknowledge the original authors for their contributions and licenses.