Skip to content

Improvements to island detection and performance of OMF import#117

Merged
yamilbknsu merged 24 commits intomainfrom
yep/lakewood_tests
Mar 19, 2026
Merged

Improvements to island detection and performance of OMF import#117
yamilbknsu merged 24 commits intomainfrom
yep/lakewood_tests

Conversation

@yamilbknsu
Copy link
Collaborator

In this PR I'm addressing the issues encountered during a national run of the OMF import pipeline. In particular, this means:

  1. Changing the predicate of the extent matching right after download from contains to intersects. This slightly reduces the accuracy of the filter but allows for much faster operation.
  2. Implementing a process to remove orphan vertices after the island detection algorithm.

For point 2, an important assumption is that island detection is performed with all the edge lists together (something we want to expand in the future #116) and that we use double adjacency for component search. These assumptions mean that any vertex adjacent to an edge marked as "island" is necessarily an orphan vertex in the clean graph and that significantly reduces the complexity of the algorithm because we can just look for all the vertices that are adjacent to island edges to determine the complete list of vertices to remove.

I'm including a small tree generated after running an import over Lakewood and bambam from the generated output as evidence this is working correctly:

(red lines are isochrones, purple lines are drive traversals and yellow/green are walk/bike traversal (not sure which is which at the moment))

c252f597-e6cc-4b64-9bea-b24f8ac5d2d2

@yamilbknsu
Copy link
Collaborator Author

I'm keeping in mind the need to handle the potential conflicts with the other PRs

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the OMF import pipeline’s runtime characteristics and graph cleanliness for large (“national run”) imports by widening the spatial extent predicate for segment filtering and introducing vertex cleanup after island-edge removal.

Changes:

  • Switched extent filtering from contains to intersects (with a bbox short-circuit) to speed up segment filtering.
  • Added a vertex-remapping step to remove vertices associated with detected island edges and to renumber remaining vertices/edges.
  • Increased use of Rayon parallel iterators across several graph/vector operations for throughput improvements.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
rust/bambam-omf/src/graph/serialize_ops.rs Adds vertex remapping and updates edge-list cleanup to remap vertex IDs and re-index edges; introduces additional parallel iteration.
rust/bambam-omf/src/graph/omf_graph.rs Integrates island detection post-processing by applying vertex remapping and cleaning edge lists; adds step-level logging.
rust/bambam-omf/src/app/network.rs Updates extent application to segments using intersects (bbox + geometry) and changes connector filtering strategy to keep only connectors referenced by retained segments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +380 to +385
.par_iter()
.enumerate()
.filter_map(|(idx, edge)| mask[idx].then_some(*edge))
.collect::<Vec<Edge>>()
// correct indices after some edges were removed
.collect::<Vec<_>>()
.into_iter()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

par_iter().enumerate().filter_map(...).collect::<Vec<_>>() does not guarantee preserving the original order (the iterator is no longer indexed after filter_map). If order changes here, the subsequent enumerate() will assign new edge IDs in a non-deterministic order and can also desynchronize edges vs. geometries/classes/speeds/bearings/omf_segment_ids which are filtered similarly below. Consider using a stable-order approach (e.g., collect a Vec<Option<T>> in parallel and then flatten() sequentially), or keep this cleanup step sequential to preserve alignment.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot I will change this par_iter to iter. Strictly based on official rayon documentation, can we conclude this is the only iterator implemented in this pipeline that would not correctly maintain order of the original vector?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robfitzgerald bringing this to your attention. I keep finding evidence that par_iter should maintain order, but when it may have unexpected behaviors like this (when using enumerate), do you think we should just avoid it in these order critical operations?

I'm curious to see copilots research here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is a concern i've raised before, too. typically in these scenarios, you have to enumerate then scatter:

my_collection
  .into_iter()
  .enumerate()
  .into_par_iter()
  .map() ...

and then on the tail end, you're presented with the problem of re-sorting the result.

i read what the feedback was from copilot though. there is one par iterator that does maintain order, and the problem was filter_map, right? from the other PR:

Rayon's filter_map produces an unindexed ParallelIterator, which — per official Rayon docs — does not guarantee order preservation (only IndexedParallelIterator does)

since that could change the size of the collection, which breaks whatever mechanism they are using under-the-hood to reconstruct ordering. perhaps do the "map" with this indexed par iter, and then perform filter synchronously, after you have gathered the parallel result? filter is a pretty cheap operation; it's mostly whatever is in the .map that has the performance implications.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this makes sense. I'll go with the assumption that into_par_iter().enumerate() is a no-go. I'm not super concerned with the linear scans that we do to filter in these cases so I'll just remove the par_ to avoid headaches. I looked for other instances and we should be good since we don't use enumerate in other places.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI commented Mar 12, 2026

@yamilbknsu I've opened a new pull request, #118, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Mar 12, 2026

@yamilbknsu I've opened a new pull request, #119, to work on those changes. Once the pull request is ready, I'll request review from you.

@yamilbknsu
Copy link
Collaborator Author

@robfitzgerald Alright, I have patched the problematic iterators to separate the filter and map steps. I have run the before and after and compared the sha sum of the output and it matches. This also produces valid bambam output after adjusting to the new aggregate opportunities output format, so it should be ready for your review!

Copy link
Collaborator

@robfitzgerald robfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress. i have a concern about the final set of VertexIds stored in the Edges, could you read and reply to that? if that concern isn't correct, then i will go ahead and accept, but if the concern is valid, then let's find the minimal requirement to address the concern so we can wrap this up! thanks Yamil.

}

// Compute and apply vertex remapping
let vertex_remapping = compute_vertex_remapping(&vertices, &edge_lists, &island_edges)?;
Copy link
Collaborator

@robfitzgerald robfitzgerald Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have loaded all segments and connectors and built Edges and Vertices from them. here we drop island vertices and reassign a VertexId sequentially. this invalidates the previous list of vertex ids which was used to assign src/dst VertexIds in each Edge, but, i don't see a spot after this where those identifiers are updated accordingly.

i have been wondering, couldn't all of this island generation + pre-processing take place over the collection of SegmentSplits? with those, we haven't yet invented the VertexIds/EdgeIds/EdgeListIds. if we can remove SegmentSplits due to island detection, and then we build the final Vertices/Edges (and their associated ids), then we don't need to apply this kind of correction. edit: don't let this part of the comment send you down a rabbit hole; this could be a tech debt issue instead.

@yamilbknsu
Copy link
Collaborator Author

@robfitzgerald As discussed above and IRL, I have refactored the visit_edges function and created several issues for the tech debt milestone (they are all related):

#116: Island detection for individual edge lists
#123: Change the vertex removal if we modify the weak connectivity logic
#124: Refactor island detection to be performed at segment-connector level instead of fully qualified edge-vertex level

Also, to address the comment above, the vertex_remapping is applied to the edge lists in the clean_omf_list function. Another sign that a refactor would be beneficial.

The output of this import at a national scale is currently running on kestrel 🤞

Copy link
Collaborator

@robfitzgerald robfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! will 🙏 for our national dataset generation result. i just went ahead and clicked the "merge" button to merge in the GTFS Flex changes, sorry for going over the line there, it was impulsive! i'm giving this 👍 so when it passes CI it's good to merge. thanks Yamil, and thanks again for adding those tech debt issues as well.

@yamilbknsu yamilbknsu merged commit cd745b1 into main Mar 19, 2026
1 check passed
@yamilbknsu
Copy link
Collaborator Author

i just went ahead and clicked the "merge"

I'll make sure to get my revenge some day 😈

@robfitzgerald robfitzgerald deleted the yep/lakewood_tests branch March 20, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants