| Overview | Requirements | Quick Start | Dataset |
🗺️ GER-LLM is a novel framework for Geospatial Entity Resolution using Large Language Models. The core challenge in Geospatial Entity Resolution (GER) is to accurately identify whether different textual descriptions or records refer to the same real-world geographical entity. To tackle this, GER-LLM leverages the powerful semantic understanding and reasoning capabilities of LLMs. Our framework first generates high-quality candidate pairs from geospatial datasets. Then, it employs a sophisticated LLM-based featurizer to extract rich semantic, spatial, and contextual cues from entity descriptions. Finally, a lightweight yet effective matching model determines the final correspondence. Extensive experiments on benchmark datasets demonstrate that GER-LLM achieves state-of-the-art performance in both efficiency and effectiveness. This repository hosts the official source code for GER-LLM.
The overall framework of GER-LLM is illustrated below:
The GER-LLM pipeline executes in three primary stages:
- Perform AOI-aware spatial blocking to generate candidate pairs.
- Use group-wise matching to jointly assess candidate groups with an LLM.
- Apply a graph-based mechanism to resolve conflicts and ensure global consistency.
- python==3.8.18
- aiohttp==3.10.11
- hdbscan==0.8.39
- numpy==1.24.4
- openai==1.93.0
- pandas==2.0.3
- python-dotenv==1.1.1
- python_Levenshtein==0.27.1
- PyYAML==6.0.2
- scikit_learn==1.3.2
- scipy==1.9.3
- torch==2.1.1
- transformers==4.46.3
- wget==3.2
+---GER-LLM
| +---Blocking
| | +---outputs
| | | hz_candidate_pairs.pkl
| | | nj_candidate_pairs.pkl
| | | pit_candidate_pairs.pkl
| | +---processed_data
| | | +---hz
| | | | hz_poi_id2se.pkl
| | | +---nj
| | | | nj_poi_id2se.pkl
| | | \---pit
| | | pit_poi_id2se.pkl
| | \---src
| | +---AOI_classification
| | | config.py
| | | functions.py
| | | model_functions.py
| | | models.py
| | | refinement.py
| | | saved_models
| | | train_AOI_Classifier.py
| | | train.py
| | | main_blocking.py
| | \---tools
| | dbscan.py
| | Quadtree.py
| | SE.py
| | utils.py
| +---data
| | +---hz
| | | | aoi_107.csv
| | | | dp_poi_2959.csv
| | | | gd_poi_1982.csv
| | | | set_ground_truth_808.pkl
| | | \---hz
| | | test.csv
| | | train.csv
| | | valid.csv
| | +---nj
| | | | aoi_180.csv
| | | | dp_poi_12176.csv
| | | | mt_poi_828.csv
| | | | set_ground_truth_411.pkl
| | | \---nj
| | | test.csv
| | | train.csv
| | | valid.csv
| | \---pit
| | | aoi_181.csv
| | | fsq_poi_2474.csv
| | | osm_poi_2383.csv
| | | set_ground_truth_1237.pkl
| | \---pit
| | test.csv
| | train.csv
| | valid.csv
| +---figure
| | framework.png
| | logo.png
| +---Matching
| | +---batches_data
| | | hz_batches.pkl
| | | nj_batches.pkl
| | | pit_batches.pkl
| | +---outputs
| | \---src
| | | Batch_Prompting.py
| | | Conflict_Resolution.py
| | | Feature_Extractor.py
| | | Interaction_with_LLM.py
| | | main_matching.py
| | | Pair_Batching.py
| | | Pair_Clustering.py
| | | Performance_Measure.py
| | \---tools
| | model.py
| | utils.py
| |
| | README.md
| | requirements.txtTo reproduce the main results for the GER-LLM pipeline, please follow the steps below. The process is divided into two main stages: (1) Generating candidate pairs via spatial blocking and (2) Performing entity matching with the LLM.
This stage processes the raw POI data to generate high-quality candidate pairs for matching. It involves classifying AOIs first, then running the blocking algorithm.
-
Run AOI Classification. This step trains a model to understand the functional areas of interest for the given city data.
- Navigate to the AOI classification directory:
cd Blocking/src/AOI_classification - Execute the training script. The following command trains a model for the Nanjing (
nj) dataset:python train_AOI_Classifier.py \ --city nj \ --fe bert \ --lr 3e-5 \ --alpha 2.0 \ --beta 1.0 \ --n_epochs 10 \ --batch_size 32 \ --max_len 128 \ --device cuda \ --save_model
- The trained models will be saved in the
Blocking/src/AOI_classification/saved_models/directory.
- Navigate to the AOI classification directory:
-
Generate Candidate Pairs. Using the classified AOIs, this step runs the quadtree splitting algorithm to produce the final candidate pairs file.
- Navigate to the main blocking directory:
cd Blocking/src - Run the main blocking script:
python main_blocking.py --city nj
- The generated candidate pairs (e.g.,
nj_candidate_pairs.pkl) will be saved in theBlocking/outputs/directory. This file is required for the next stage.
- Navigate to the main blocking directory:
This is the final stage where the LLM assesses the candidate pairs generated in Step 1 to produce the final entity resolution results.
-
Run the Matching Pipeline.
- Navigate to the matching directory:
cd Matching/src - Execute the main matching script. This command runs the entire pipeline including feature extraction, clustering, group-wise prompting, and conflict resolution:
python main_matching.py \ --city nj \ --feature_strategy PROP_BASED \ --clustering_method hdbscan \ --batch_strategy diverse \ --llm DeepSeek-V3
- Navigate to the matching directory:
-
Check the Results. The final matching results and logs will be saved in the
Matching/outputs/directory. Please ensure this folder exists before running to avoid a 'file not found' error.
- Geospatial entities in Nanjing:
- Geospatial entities in Hangzhou:
- Geospatial entities in Pittsburgh:
osm_poi_2383.csv: entities collected from OpenStreetMap.fsq_poi_2474.csv: entities collected from Foursquare.aoi_181.csv: aois extracted from the entities above.set_ground_truth_1237.pkl: the ground truth.

