GitHub - luck-seu/GER-LLM: the code of 'GER-LLM: Efficient and Effective Geospatial Entity Resolution with Large Language Model'

🗺️ GER-LLM: Efficient and Effective Geospatial Entity Resolution with Large Language Model

🗺️ GER-LLM is a novel framework for Geospatial Entity Resolution using Large Language Models. The core challenge in Geospatial Entity Resolution (GER) is to accurately identify whether different textual descriptions or records refer to the same real-world geographical entity. To tackle this, GER-LLM leverages the powerful semantic understanding and reasoning capabilities of LLMs. Our framework first generates high-quality candidate pairs from geospatial datasets. Then, it employs a sophisticated LLM-based featurizer to extract rich semantic, spatial, and contextual cues from entity descriptions. Finally, a lightweight yet effective matching model determines the final correspondence. Extensive experiments on benchmark datasets demonstrate that GER-LLM achieves state-of-the-art performance in both efficiency and effectiveness. This repository hosts the official source code for GER-LLM.

Overview

The overall framework of GER-LLM is illustrated below:

The GER-LLM pipeline executes in three primary stages:

Perform AOI-aware spatial blocking to generate candidate pairs.
Use group-wise matching to jointly assess candidate groups with an LLM.
Apply a graph-based mechanism to resolve conflicts and ensure global consistency.

📋 Requirements

python==3.8.18
aiohttp==3.10.11
hdbscan==0.8.39
numpy==1.24.4
openai==1.93.0
pandas==2.0.3
python-dotenv==1.1.1
python_Levenshtein==0.27.1
PyYAML==6.0.2
scikit_learn==1.3.2
scipy==1.9.3
torch==2.1.1
transformers==4.46.3
wget==3.2

🚀 Quick Start

Code Structure

+---GER-LLM
|   +---Blocking
|   |   +---outputs
|   |   |       hz_candidate_pairs.pkl
|   |   |       nj_candidate_pairs.pkl
|   |   |       pit_candidate_pairs.pkl
|   |   +---processed_data
|   |   |   +---hz
|   |   |   |       hz_poi_id2se.pkl
|   |   |   +---nj
|   |   |   |       nj_poi_id2se.pkl
|   |   |   \---pit
|   |   |           pit_poi_id2se.pkl
|   |   \---src
|   |       +---AOI_classification
|   |       |       config.py
|   |       |       functions.py
|   |       |       model_functions.py
|   |       |       models.py
|   |       |       refinement.py
|   |       |       saved_models
|   |       |       train_AOI_Classifier.py
|   |       |       train.py
|   |       |   main_blocking.py
|   |       \---tools
|   |               dbscan.py
|   |               Quadtree.py
|   |               SE.py
|   |               utils.py
|   +---data
|   |   +---hz
|   |   |   |   aoi_107.csv
|   |   |   |   dp_poi_2959.csv
|   |   |   |   gd_poi_1982.csv
|   |   |   |   set_ground_truth_808.pkl
|   |   |   \---hz
|   |   |           test.csv
|   |   |           train.csv
|   |   |           valid.csv
|   |   +---nj
|   |   |   |   aoi_180.csv
|   |   |   |   dp_poi_12176.csv
|   |   |   |   mt_poi_828.csv
|   |   |   |   set_ground_truth_411.pkl
|   |   |   \---nj
|   |   |           test.csv
|   |   |           train.csv
|   |   |           valid.csv
|   |   \---pit
|   |       |   aoi_181.csv
|   |       |   fsq_poi_2474.csv
|   |       |   osm_poi_2383.csv
|   |       |   set_ground_truth_1237.pkl
|   |       \---pit
|   |               test.csv
|   |               train.csv
|   |               valid.csv
|   +---figure
|   |       framework.png
|   |       logo.png
|   +---Matching
|   |   +---batches_data
|   |   |       hz_batches.pkl
|   |   |       nj_batches.pkl
|   |   |       pit_batches.pkl
|   |   +---outputs
|   |   \---src
|   |       |   Batch_Prompting.py
|   |       |   Conflict_Resolution.py
|   |       |   Feature_Extractor.py
|   |       |   Interaction_with_LLM.py
|   |       |   main_matching.py
|   |       |   Pair_Batching.py
|   |       |   Pair_Clustering.py
|   |       |   Performance_Measure.py
|   |       \---tools
|   |               model.py
|   |               utils.py
|   |
|   |   README.md
|   |   requirements.txt

Reproducing the Main Results

To reproduce the main results for the GER-LLM pipeline, please follow the steps below. The process is divided into two main stages: (1) Generating candidate pairs via spatial blocking and (2) Performing entity matching with the LLM.

Step 1: AOI-aware Spatial Blocking

This stage processes the raw POI data to generate high-quality candidate pairs for matching. It involves classifying AOIs first, then running the blocking algorithm.

Run AOI Classification. This step trains a model to understand the functional areas of interest for the given city data.
- Navigate to the AOI classification directory:
```
cd Blocking/src/AOI_classification
```
- Execute the training script. The following command trains a model for the Nanjing (nj) dataset:
```
python train_AOI_Classifier.py \
  --city nj \
  --fe bert \
  --lr 3e-5 \
  --alpha 2.0 \
  --beta 1.0 \
  --n_epochs 10 \
  --batch_size 32 \
  --max_len 128 \
  --device cuda \
  --save_model
```
- The trained models will be saved in the Blocking/src/AOI_classification/saved_models/ directory.
Generate Candidate Pairs. Using the classified AOIs, this step runs the quadtree splitting algorithm to produce the final candidate pairs file.
- Navigate to the main blocking directory:
```
cd Blocking/src
```
- Run the main blocking script:
```
python main_blocking.py --city nj
```
- The generated candidate pairs (e.g., nj_candidate_pairs.pkl) will be saved in the Blocking/outputs/ directory. This file is required for the next stage.

Step 2: Group-wise Matching with LLM

This is the final stage where the LLM assesses the candidate pairs generated in Step 1 to produce the final entity resolution results.

Run the Matching Pipeline.
- Navigate to the matching directory:
```
cd Matching/src
```
- Execute the main matching script. This command runs the entire pipeline including feature extraction, clustering, group-wise prompting, and conflict resolution:
```
python main_matching.py \
  --city nj \
  --feature_strategy PROP_BASED \
  --clustering_method hdbscan \
  --batch_strategy diverse \
  --llm DeepSeek-V3
```
Check the Results. The final matching results and logs will be saved in the Matching/outputs/ directory. Please ensure this folder exists before running to avoid a 'file not found' error.

💾 Dataset

Geospatial entities in Nanjing:
- dp_poi_12176.csv: entities collected from Dianping.
- mt_poi_828.csv: entities collected from Meituan.
- aoi_180.csv: aois extracted from the entities above.
- set_ground_truth_411.pkl: the ground truth.
Geospatial entities in Hangzhou:
- gd_poi_1982.csv: entities collected from Amap.
- dp_poi_2959.csv: entities collected from Dianping.
- aoi_107.csv: aois extracted from the entities above.
- set_ground_truth_808.pkl: the ground truth.
Geospatial entities in Pittsburgh:
- osm_poi_2383.csv: entities collected from OpenStreetMap.
- fsq_poi_2474.csv: entities collected from Foursquare.
- aoi_181.csv: aois extracted from the entities above.
- set_ground_truth_1237.pkl: the ground truth.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
Blocking		Blocking
Matching		Matching
ParallelLLM		ParallelLLM
data		data
figure		figure
.gitignore		.gitignore
.uuid		.uuid
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗺️ GER-LLM: Efficient and Effective Geospatial Entity Resolution with Large Language Model

Overview

📋 Requirements

🚀 Quick Start

Code Structure

Reproducing the Main Results

Step 1: AOI-aware Spatial Blocking

Step 2: Group-wise Matching with LLM

💾 Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🗺️ GER-LLM: Efficient and Effective Geospatial Entity Resolution with Large Language Model

Overview

📋 Requirements

🚀 Quick Start

Code Structure

Reproducing the Main Results

Step 1: AOI-aware Spatial Blocking

Step 2: Group-wise Matching with LLM

💾 Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages