Improvements to island detection and performance of OMF import#117
Improvements to island detection and performance of OMF import#117yamilbknsu merged 24 commits intomainfrom
Conversation
…ere referenced by edges but were filtered out
|
I'm keeping in mind the need to handle the potential conflicts with the other PRs |
There was a problem hiding this comment.
Pull request overview
This PR improves the OMF import pipeline’s runtime characteristics and graph cleanliness for large (“national run”) imports by widening the spatial extent predicate for segment filtering and introducing vertex cleanup after island-edge removal.
Changes:
- Switched extent filtering from
containstointersects(with a bbox short-circuit) to speed up segment filtering. - Added a vertex-remapping step to remove vertices associated with detected island edges and to renumber remaining vertices/edges.
- Increased use of Rayon parallel iterators across several graph/vector operations for throughput improvements.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
rust/bambam-omf/src/graph/serialize_ops.rs |
Adds vertex remapping and updates edge-list cleanup to remap vertex IDs and re-index edges; introduces additional parallel iteration. |
rust/bambam-omf/src/graph/omf_graph.rs |
Integrates island detection post-processing by applying vertex remapping and cleaning edge lists; adds step-level logging. |
rust/bambam-omf/src/app/network.rs |
Updates extent application to segments using intersects (bbox + geometry) and changes connector filtering strategy to keep only connectors referenced by retained segments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| .par_iter() | ||
| .enumerate() | ||
| .filter_map(|(idx, edge)| mask[idx].then_some(*edge)) | ||
| .collect::<Vec<Edge>>() | ||
| // correct indices after some edges were removed | ||
| .collect::<Vec<_>>() | ||
| .into_iter() |
There was a problem hiding this comment.
par_iter().enumerate().filter_map(...).collect::<Vec<_>>() does not guarantee preserving the original order (the iterator is no longer indexed after filter_map). If order changes here, the subsequent enumerate() will assign new edge IDs in a non-deterministic order and can also desynchronize edges vs. geometries/classes/speeds/bearings/omf_segment_ids which are filtered similarly below. Consider using a stable-order approach (e.g., collect a Vec<Option<T>> in parallel and then flatten() sequentially), or keep this cleanup step sequential to preserve alignment.
There was a problem hiding this comment.
@copilot I will change this par_iter to iter. Strictly based on official rayon documentation, can we conclude this is the only iterator implemented in this pipeline that would not correctly maintain order of the original vector?
There was a problem hiding this comment.
@robfitzgerald bringing this to your attention. I keep finding evidence that par_iter should maintain order, but when it may have unexpected behaviors like this (when using enumerate), do you think we should just avoid it in these order critical operations?
I'm curious to see copilots research here.
There was a problem hiding this comment.
yes. this is a concern i've raised before, too. typically in these scenarios, you have to enumerate then scatter:
my_collection
.into_iter()
.enumerate()
.into_par_iter()
.map() ...and then on the tail end, you're presented with the problem of re-sorting the result.
i read what the feedback was from copilot though. there is one par iterator that does maintain order, and the problem was filter_map, right? from the other PR:
Rayon's filter_map produces an unindexed ParallelIterator, which — per official Rayon docs — does not guarantee order preservation (only IndexedParallelIterator does)
since that could change the size of the collection, which breaks whatever mechanism they are using under-the-hood to reconstruct ordering. perhaps do the "map" with this indexed par iter, and then perform filter synchronously, after you have gathered the parallel result? filter is a pretty cheap operation; it's mostly whatever is in the .map that has the performance implications.
There was a problem hiding this comment.
Ok this makes sense. I'll go with the assumption that into_par_iter().enumerate() is a no-go. I'm not super concerned with the linear scans that we do to filter in these cases so I'll just remove the par_ to avoid headaches. I looked for other instances and we should be good since we don't use enumerate in other places.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@yamilbknsu I've opened a new pull request, #118, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@yamilbknsu I've opened a new pull request, #119, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: yamilbknsu <1139432+yamilbknsu@users.noreply.github.com>
Fix stale `vertex_lookup` after island detection vertex remapping
|
@robfitzgerald Alright, I have patched the problematic iterators to separate the filter and map steps. I have run the before and after and compared the sha sum of the output and it matches. This also produces valid bambam output after adjusting to the new aggregate opportunities output format, so it should be ready for your review! |
robfitzgerald
left a comment
There was a problem hiding this comment.
good progress. i have a concern about the final set of VertexIds stored in the Edges, could you read and reply to that? if that concern isn't correct, then i will go ahead and accept, but if the concern is valid, then let's find the minimal requirement to address the concern so we can wrap this up! thanks Yamil.
| } | ||
|
|
||
| // Compute and apply vertex remapping | ||
| let vertex_remapping = compute_vertex_remapping(&vertices, &edge_lists, &island_edges)?; |
There was a problem hiding this comment.
we have loaded all segments and connectors and built Edges and Vertices from them. here we drop island vertices and reassign a VertexId sequentially. this invalidates the previous list of vertex ids which was used to assign src/dst VertexIds in each Edge, but, i don't see a spot after this where those identifiers are updated accordingly.
i have been wondering, couldn't all of this island generation + pre-processing take place over the collection of SegmentSplits? with those, we haven't yet invented the VertexIds/EdgeIds/EdgeListIds. if we can remove SegmentSplits due to island detection, and then we build the final Vertices/Edges (and their associated ids), then we don't need to apply this kind of correction. edit: don't let this part of the comment send you down a rabbit hole; this could be a tech debt issue instead.
|
@robfitzgerald As discussed above and IRL, I have refactored the #116: Island detection for individual edge lists Also, to address the comment above, the The output of this import at a national scale is currently running on kestrel 🤞 |
robfitzgerald
left a comment
There was a problem hiding this comment.
sounds good! will 🙏 for our national dataset generation result. i just went ahead and clicked the "merge" button to merge in the GTFS Flex changes, sorry for going over the line there, it was impulsive! i'm giving this 👍 so when it passes CI it's good to merge. thanks Yamil, and thanks again for adding those tech debt issues as well.
I'll make sure to get my revenge some day 😈 |
In this PR I'm addressing the issues encountered during a national run of the OMF import pipeline. In particular, this means:
containstointersects. This slightly reduces the accuracy of the filter but allows for much faster operation.For point 2, an important assumption is that island detection is performed with all the edge lists together (something we want to expand in the future #116) and that we use double adjacency for component search. These assumptions mean that any vertex adjacent to an edge marked as "island" is necessarily an orphan vertex in the clean graph and that significantly reduces the complexity of the algorithm because we can just look for all the vertices that are adjacent to island edges to determine the complete list of vertices to remove.
I'm including a small tree generated after running an import over Lakewood and bambam from the generated output as evidence this is working correctly:
(red lines are isochrones, purple lines are drive traversals and yellow/green are walk/bike traversal (not sure which is which at the moment))