Skip to content

Fr4nz83/MAT-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Human Mobility Datasets Enriched With Contextual and Social Dimensions

This repository contains the code and documentation of the enrichment workflow used to generate the two semantically enriched trajectory datasets presented in the resource paper Human Mobility Datasets Enriched With Contextual and Social Dimensions by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

In the following, we first provide a brief description of the various Jupyter notebooks that implement our workflow (STEP 1). We suggest to execute the notebooks in the order they appear. Once all of them have been executed according to the instructions, our workflow continues with the use of the MAT-Builder system (STEP 2), which is the tool that actually semantically enriches the trajectories and creates the two final Paris and New York City datasets, both in tabular and RFD-based knowledge graph representations. The datasets can be found in our Zenodo repository.

A pre-print version of the paper Human Mobility Datasets Enriched With Contextual and Social Dimensions is available on arXiv. Stay tuned for future updates!

STEP 1: Execution of Jupyter Notebooks and Python scripts

1 - OSM NYC GPX traces downloader.ipynb: this notebook implements a multi-threaded downloader that slices the given New York City's bounding box into manageable tiles and fetches GPX trackpoints via the OSM API, complete with retry logic to handle transient failures. The resulting GPX files are organized by tile for processing by the notebooks 2.1.1 and 2.1.2.

2.1.1 - Multiple GPX to GeoPandas trajectory processing (NY case).ipynb: this notebook reads GPX files from notebook 1, extracts both track metadata and track points, and converts them into GeoPandas dataframes. It applies data cleaning steps, fixing nonstandard tags, parsing timestamps, merging metadata, and removing duplicates. Finally, it concatenates the per-tile results into a unified Parquet dataset with consistent categorical user IDs.

2.1.2 - Concatenate dataframes multiple bounding boxes (NY case).ipynb: once all per-tile Parquet files are generated by the notebook 2.1.1, this notebook loads each into GeoPandas, concatenates them into a single “mega” dataframe, converts timestamps to the America/New_York timezone, and de-identifies users. Ultimately writes out a merged nyc_merged.parquet.

2.2 - Single GPX to GeoPandas trajectory processing (Paris case).ipynb: this notebook preprocesses a single pre-existing GPX file containing trajectories moving around Paris. It is assumed that such GPX has been generated with the JOSM tool. This notebook reads track metadata and points from this GPX in streamed chunks, to handle large file sizes without exhausting memory. It filters and renames columns, merges metadata links into user IDs, and cleans timestamp formats before converting UTC times to Europe/Paris. The final GeoPandas dataframe is saved in the Parquet format.

4 - OSM raw trajectory preprocessing.ipynb: starting from the preprocessed trajectories coming from the notebooks 2.1.2 or 2.2 (hence either Paris or NYC), this notebook computes per-user summaries—total observations, time spans, and sampling rates. Then, it filters out trajectories that fall below duration or frequency thresholds. The preprocessed trajectories are saved to a new Parquet file.

5 - Ensure MAT-Builder compatibility.ipynb: to prepare trajectories for the MAT-Builder pipeline, this notebook reads the Parquet files outputted by notebook 4, and assigns a unique identifier for each user and trajectory, splits the geometry column into separate latitude and longitude fields, and drops the original geometry to match MAT-Builder’s expected schema. The notebook ultimately writes out a new Parquet file, containing a dataframe with the raw preprocessed trajectories, ready for ingestion by the MAT-Builder system.

6 - Generate dataset POI from OpenStreetMap.ipynb: this notebook uses the OSMnx Python library; it defines a function to download Points of Interest by tag within a specified bounding box, then standardizes the resulting dataframe by renaming fields, selecting essential columns, and filtering out entries without names. It fetches multiple POI categories such as amenities, shops, tourism sites, historic landmarks, and leisure spots, and saves the compiled dataset as a Parquet file, ready for ingestion by the MAT-Builder system.

7 - Meteostat daily data downloader.ipynb: this notebook automates the retrieval of historical daily weather records for Paris and New York City from Meteostat’s bulk CSV endpoints, filtering for data post-1990 and selecting key variables like average temperature and precipitation. It handles missing values and then classifies each day’s weather based on precipitation thresholds. The cleaned, labeled weather dataset is finally exported to Parquet, ready for ingestion by the MAT-Builder system.

8 - Generate Posts.py: standalone Python scripts that uses the Meta’s meta-llama/Llama-3.3-70B-Instruct large language model to generate synthetic, realistic social media posts based on enriched stop segments that include their closest POI information. To run it, you need preprocessed trajectory data (from notebook 5) and corresponding POI datasets (from notebook 6). Then, we actually need to anticipate some of the semantic enrichment steps by feeding both to the MAT‑Builder system. The system must first preprocess and compress the trajectories, then segment them into stops and moves, and finally augment each stop with the POIs located within 50 meters from their centroids. This ultimately generates a file named enriched_occasional.parquet, which is the input for 8 - Generate Posts.py. The script constructs positive and negative social media post prompts for each stop using the POI name and category, generates posts in batches with retry logic for short outputs, cleans the results, merges them with original stop metadata, and exports them to CSV, Parquet, and Excel formats.

9 - Prepare social media dataset for MAT-Builder.ipynb: In this final notebook, the social media metadata returned by 8 - Generate Posts.py is further processed to generate the final synthetic social media post dataset, stored in a Parquet file ready for ingestion by the MAT-Builder system.

STEP 2: semantic enrichment with the MAT-Builder system

Once all the notebooks belonging to STEP 1 are executed, we have all the ingredients necessary to generate the two datasets of semantically enriched trajectory with the MAT-Builder system. Details on how to install and execute MAT-Builder are provided in its GitHub repository. The operations we conducted with MAT-Builder are as follows:

  • Preprocessing step: this step takes as input the raw GPS trajectory datasets generated by notebook 5 in STEP 1. In this step, we:

    1. filter out trajectories having less than 2 samples;
    2. filter noisy samples inducing velocities above 300km/h;
    3. compress the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

    At the end of this step, MAT-Builder generates a Parquet file named traj_cleaned.parquet, representing the datasets of raw GPS trajectories that have been further preprocessed by MAT-Builder. For more information on the content of this file, please refer to our Zenodo repository.

  • Segmentation step: this step takes as input a dataset of raw GPS trajectories -- in our case, we provide the ones obtained by the MAT-Builder's preprocessing step -- and partitions the trajectories in stop and move segments, where a stop segment represents a moving object staying at some place for some time, while a move segment represents a moving object transitioning from one stop to another. The parameters we used for segmentation are:

    1. we set the minimum duration of a stop to 10 minutes;
    2. the maximum spatial radius a stop can have is set to 200 meters.

    At the end of this step, MAT-Builder generates two Parquet files named stops.parquet and moves.parquet, representing the stop and move segments that have been detected by the system. For more information on the content of these two files, please refer to our Zenodo repository.

  • Enrichment step: Finally, the enrichment module enriches the segmented trajectories with all the semantic dimensions supported by the module, i.e., regularity (i.e., distinction between occasional and systematic stops), move (with transportation means estimation), weather, and social media. To this end, the enrichment step takes as input various datasets:

    1. a Parquet file contaning a dataset of raw GPS trajectories, e.g., traj_cleaned.parquet;
    2. a Parquet file contaning a dataset of stop segments detected from the trajectories, e.g., stops.parquet;
    3. a Parquet file contaning a dataset of move segments detected from the trajectories, e.g., moves.parquet;
    4. a Parquet file contaning a dataset of POIs, e.g., pois.parquet;
    5. a Parquet file contaning a dataset of weather conditions, e.g., weather_conditions.parquet;
    6. a Parquet file contaning a dataset of social media posts, e.g., social_paris.parquet;

    The two parameters regulating the systematic and occasional stop detection are those of DBSCAN. The first parameter, which corresponds to the DBSCAN epsilon, determines the distance below which two stops are considered neighbours, and it is empirically set to 50 meters. The second one, which corresponds to the DBSCAN minPts parameters, determines the number of neighbouring stops a stop must have in order to be considered core, and it is empirically set to 5. Occasional and systematic stops are both augmented with POIs found to be less than 50 meters far from their centroids.

Ultimately, the enrichment step outputs a dataset of semantically enriched trajectories, in two different formats: tabular, and RDF-based knowledge graph. More information on these two representations are provided in our Zenodo repository, and for what specifically concerns the RDF-based knowledge graph representation also in the IEEE Access MAT-Builder paper.

Examples of querying our RDF-based knowledge graphs datasets

In the SPARQL folder we provide:

  • a few examples of SPARQL queries that can be used to query our RDF-based knowledge graph datasets. The examples have been tested with a popular triplestore, i.e., GraphDB; other triplestores are expected to work too.
  • the files containing the customized STEPv2 ontology, defining the internal structure of our knowledge graphs; these might be interesting for the readers that want to better understand how we structured information within the KGs. The files can be opened in the open source ontology editor Protege'. For more details on the ontology we used, please also have a look at our IEEE Access MAT-Builder paper and MAT-Builder's GitHub repository.

Cite us

Please cite our arXiv preprint if you have found our contributions useful, or you have used them within your work.

@misc{pugliese2025humanmobilitydatasetsenriched,
      title={Human Mobility Datasets Enriched With Contextual and Social Dimensions}, 
      author={Chiara Pugliese and Francesco Lettich and Guido Rocchietti and Chiara Renso and Fabio Pinelli},
      year={2025},
      eprint={2510.02333},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.02333}, 
}

About

Code and documentation for the resource paper "Human Mobility Datasets Enriched With Contextual and Social Dimensions"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors