This repository holds scripts, maps, machine learning models, and other data visualizations related to an ongoing project at the New Jersey Institute of Technology to model multiple aspects of the spotted lanternfly (Lycorma delicatula) infestation in the contiguous United States. This repository represents multiple stages of the project as it started out as simply a machine learning model used to predict the year-by-year natural propagation of the species across counties and states; it has since developed into an entire repository of the whole project, containing research deliverables, documentation of the models, and proposals and literature review documents by the NJIT undergraduate student Erica Keklak.
This original research began in the summer of 2025 with the with the project formally named "Predictive model of the spread of the spotted lanternfly in the continental United States using machine learning." The fellowship, and therefore updates to the repository, conclude on July 24, 2025, but the repository will be available indefinitely. This phase of research involved processing historical data on the amount of reported observations of the
and use classification-based machine learning techniques to predict the propagation of the invasion in certain areas of the continental United States. It builds on the prior work of other research groups to document and predict the spread of spotted lanternflies using multiple categories of machine learning model to develop predictions. While the full write-up of this project and the deliverables for the fellowship should be available on this repository, the research is not yet published in any journals.
You can simply fork the repository and pick out the components you'd like to build upon; nothing in the Deliverables subdirectory contains code. DataRaw will only contain a limited amount of the source data manipulated into the training data found in DataModified alongside intermediate data that has been processed but is not training data. Export-restricted raw data is only available to peer reviewers and U.S. nationals and is not available on this repository.
Navigating this repository (see next section) will allow you to look at the raw and modified data used in model development, the main public-facing content and easy-to-digest textual information that adds context to the contributors' research in invasive species propagation, and the scripts used to train the models and predict future invasive establishments in the contiguous 48 United States. We do not have a web or mobile app showing the models (a website may be coming soon), but running the scripts on your own environment will allow you to reproduce the results of this research.
The main contributors are not professional developers, so there are no guarantees that the code and data will make immediate sense to someone who uses high-level programming languages to process data or develop machine learning models. Users who attempt to replicate the results but find that reading data into the programs doesn't work should change the path of the data to match the path where it is added to your own device. If you cannot find the path to the data you desire, the documentation made in the Deliverables folder should help you. If this doesn't work, .
Note: this repository is going to undergo major changes to the structure of its subdirectories soon! Navigating the repository might be difficult after the changes are made, but there will be an explainer file uploaded to the repository following the changes.
The original slf-invasion-modeling repository is structured differently from the standard research directory because it is not self-hosted:
DataModified: data that was processed in a script, exported as some Python-compatible file format, and then stored hereDataRaw: files and subfolders of raw data collected from free, public and/or open sources that were modified, send to the DataModified folder, and used in model and data visualization creation; data dictionaries for some raw data are included. Climatological and traffic data are not available here.Ecological: observational (sightings) and abundance data provided mainly by(with data limited to 2014 to 2024, inclusive) but also by the United States Department of Agriculture
Geometry: map boundary, area, and locational data provided by a variety of sources; some geometry files are not available.4SLFObservations: some observational data of L. delicatula that were extracted shortly before the first models were fitted; these will be merged into theEcologicalsubdirectory soon
Deliverables: contains copies of deliverables submitted to the URI team as well as major "non-programmed" parts of the research process, including notes. There is andetailing a template data management plan that this research does not follow but is modeled after and is provided for future reference.
2025 Proposal Era: deliverables submitted before acceptance into the 2025 URI program. It contains a2026 Propsal Era: deliverables submitted before any decisions were made about 2026 research programs that are calling for proposals.During Research: deliverables the contributors had to submit for program requirements following entry into the program. Many are visually interesting, and
Scripts: containsand
programs that this project used in the development of its results
Old: files that were used in the project but are not necessary for final model and visualization development; some larger files had to be omittedFinal: all files necessary for model and visualization development except for data; if this folder is empty for some reason, see the contents inOld
Visualizations: contains images showing the results of this research project, mostly as static images inMapScreenshots: static images of maps; to be uploadedMapInteractive: interactive versions of maps, if possible; to be uploadedNonMaps: any data visualizations that are not displayed geospatially as in points or marks on a map of the lower 48 states; to be uploaded
.gitignore: the default template containingwhen cloning the repository; this one uses the template GitHub provided for the
language
LICENSE: details of theused to protect this research while giving the ability to the general public to use this repository in almost any way they'd like
README.md: this file
The primary investigator ran training, validation, and test data on Jupyter Notebook using Python 3 with some early data processing in R using a fork of included as a small section in this repository. 2025-era versions of NumPy, Pandas, and Geopandas were necessary for data processing and computations. Matplotlib, Seaborn, and Shapely were helpful in visualizing data points and trends. Model development outside of data cleaning and pre-processing required Scikitlearn (Sklearn) for decision tree modeling, Keras for neural networks, and TensorFlow for understanding model development fundamentals. While not formally used for this project, the primary investigator ran some data through MaxEnt.
TL;DR:
- Languages/Main Software:
,
,
,
,
,
- Utility Modules:
,
,
,
,
- Data/Model Processing Modules:
,
,
,
- Data Visualization Modules:
,
,
Most residents of the United States understand that they depend on plant products to survive, and the accidental introduction of spotted lanternflies to the United States has demonstrated both potential harm and existing harm to the production of plant-based food, wood and paper products, and the perceived beauty and health of ornamental plants. If the invasion continues, it is quite possible that spotted lanternflies become permanently established unless they evolve into a new species or become locally extinct, and being a permanent pest adds additional strain to agriculture industry workers' profits and the availability of plant-based resources produced locally. They will know that a lack of preventative and active measures against the spotted lanternfly have the ability to worsen the problem and possibly contribute to plant-based resource scarcity during times like these when certain resources are already scarce.
If contributors make reports based on the results of models for each county and state and send them to members of the general public, it will increase public awareness of the spotted lanternfly problem in the United States and allow those with special powers (legislators, agricultural managers, plant conservation organizations, pest control services, and more) to make informed business and lawmaking decisions due to how the spotted lanternfly affects many aspects of the economy. People outside of these demographics still benefit from "spotted lanternfly awareness" by giving them the knowledge and permission to remove spotted lanternflies from their property and take action to prevent spotted lanternflies from establishing themselves on their properties.
Hosting a GitHub repository is not mandatory for participation in this year's Undergraduate Research and Innovation (URI) program, but it is extremely helpful for publishing data and results when programs like these only publish abstracts (to be published in the ) and presentations. This project is not likely to make it to an actual journal unless the momentum continues and its participants get the opportunity to continue their research.
The main contributors to this repository understand that adding a would increase the ease of use for other people who want to run and use the models created in this project, but they are currently deliberating on whether or not it is necessary for this project or if it would provide a sufficient benefit to people who view or fork this repository. They are looking into whether or not they would have to change the license of the repository to reflect the intended uses of each model so that they can use this models feature.
No, was not used to script any of the programs, gather or modify data, make repository suggestions, or otherwise suggest/change aspects of the original repository. The main contributors to the project are unsure if GitHub Copilot is capable of improving the model-building experience for this specific application of machine learning, although those who fork the repository are welcome to use GitHub Copilot to "help" them on this and other projects.
, the environment where Copilot is used, was not involved in model building.
The main contributors to this project have yet to review data protection and other policies required of Hugging Face users, but we understand the potential benefits of making our modified datasets and models available to Hugging Face users who tend to be more skilled at machine learning usage and development. There is no guarantee that the contents of this repository will be uploaded to Hugging Face by the main contributors, but if the repository license permits it anyone may do so.
While some other URI participants proposed making an app as part of their project, this project was accepted without a proposal for an app. The value this project brings to the table does not fit the scope of a web or mobile app especially since there are several research (both academic- and government-led) projects by other experts who are trying to increase our knowledge of L. delicatula and how the problems the species poses can be resolved. Since there are already robust invasive species reporting apps, there is no need for this project to make one when the focus of the project is on machine learning model development.
If you would like to help out as a citizen scientist to gather data on the spotted lanternfly and you have timestamped photographs that you or a friend took of any trace of a spotted lanternfly (eggs, nymphs, adults, molts, honeydew, etc.) you may upload them as observations to (not sponsored). There are several mobile apps by the iNaturalist curators and developers, and they have a
as well as other repositories for their
and
.
Classification modeling is the process of using existing data, that may or may not contain target classes in each data point, to determine what classes apply to similar data if it was processed through the model. Most classification modeling is done on one class per data point, e.g. of a person's marital status "Never Married," "Currently Married," and "Formerly Married." This is opposed to multiple classes like tags (of an image file "Contains locational metadata," "Contains at least one black/#000000 pixel," "Has an author," etc). Classification is important to most people because applying labels to things is helpful, especially when things are not usually easy to determine. All the machine learning models in this project use data manipulated to contain up to five target classes at a time, and each existing data point and prediction has only one class.
"General spread risk modeling" is the main type of model used in this project which means each class relates to the expectation of a change in the number of L. delicatula present in an area from the current year to the next year regardless of the of specific host plant
. "Purpose-specific risk modeling" comprises three types of classification models with different goals each aligning with the risk that the expected change in L. delicatula abundance from the current to the next year will have on damage and fungal infection in host plants found in each area; purpose-specific risk modeling uses almost all the same features and data as general spread risk modeling but has additional features relating to the types and abundance of local host plants. To keep the process simple, the original contributors tagged each known/expected host plant as "able to provide food to humans" (the category is broad to encompass fruit, vegetable, spice, syrup, and other food production), "able to provide wood to humans" (this category is also broad so it can encompass all forms of timber and paper production), and "ornamental" (including landscaping and bonsai plants). The risk class for a county or state is based not on the risk of invasion to all possible species of that type but to only the species that are found in that area. Not all host plants from those known in literature and observation fit any of these tags, but some fit more than one tag and are considered for more than one purpose-specific risk model. This keeps the classification method used in this project as fair and as easy as possible for a non-academic person to understand.
The main citation list/references file is in . The proposal and active citation list is available for review in the Deliverables folder and may span several files. Some files have annotations about the reasons why sources were used and whether or not they have been cited at certain stages of the project's development.
If you would like to contribute to the development of this project, you can:
- report bugs and security issues to any of the listed contributors or try to resolve them by (1) forking this repository, (2) clone the fork, (3) make necessary changes that you would like to see in the next release, (4) commit and push, and (5) wait for your request to be merged. Severe security issues should be
ASAP.
- contribute features to the Python and Jupyter Notebook scripts using the same methods as with bug/security reports
- make forks that expand the purposes of this project to fit other locations where L. delicatula may be found next, use the scripts as a template for your own local invasive species (you may need data beyond that which is in this project), etc.
If you are a subject matter expert (SME) in spotted lanternfly biology and behavior and are not yet affiliated with (not sponsored), feel free to add your publication to the
. This repository's contributors read and cite these publications often!
Funding for this project was provided by the NJIT Provost Undergraduate Research and Innovation (URI) program and the . We are very thankful that they were able to provide us with the opportunity to perform compensated research for a good cause. Further funding for 2026 is pending, and we are thankful for any and all future support.
For inquiries related to the repository that don't need an issue or pull request, please contact Erica Keklak via .