Implement Event Collection API using Spanner#1811
Implement Event Collection API using Spanner#1811keyurva wants to merge 4 commits intodatacommonsorg:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Event Collection API, enabling the retrieval of event data from Spanner. It achieves feature parity with the existing V2 implementation, allowing users to query events by type, location, and date, with support for additional filtering capabilities. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements the Event Collection API using Spanner, achieving feature parity with the V2 implementation. The changes are well-structured, with a clear separation of concerns for query building, execution, and data processing. The addition of comprehensive tests and golden files is commendable. I've identified a couple of areas for improvement related to deterministic behavior and code clarity, aligning with best practices for robust and maintainable code.
| for k, v := range arc.Filter { | ||
| if k == "typeOf" || k == "date" { | ||
| continue | ||
| } | ||
| if len(v) != 1 { | ||
| return nil // Invalid filter | ||
| } | ||
| spec, err := v2e.ParseEventCollectionFilter(k, v[0]) | ||
| if err != nil { | ||
| return nil // Invalid filter format | ||
| } | ||
| res.FilterProp = spec.Prop | ||
| res.FilterLowerLimit = spec.LowerLimit | ||
| res.FilterUpperLimit = spec.UpperLimit | ||
| res.FilterUnit = spec.Unit | ||
| break // V2 supports at most one extra filter | ||
| } |
There was a problem hiding this comment.
The iteration over the arc.Filter map is non-deterministic because map iteration order is not guaranteed in Go. While the comment on line 436 mentions that V2 supports at most one extra filter, relying on this assumption can make the code fragile. If a request with multiple extra filters is ever processed, the behavior will be unpredictable as to which filter is applied. To make the implementation more robust and deterministic, I suggest sorting the filter keys before iterating.
// Handle standard filters (e.g. area).
filterKeys := make([]string, 0, len(arc.Filter))
for k := range arc.Filter {
filterKeys = append(filterKeys, k)
}
sort.Strings(filterKeys)
for _, k := range filterKeys {
if k == "typeOf" || k == "date" {
continue
}
v := arc.Filter[k]
if len(v) != 1 {
return nil // Invalid filter
}
spec, err := v2e.ParseEventCollectionFilter(k, v[0])
if err != nil {
return nil // Invalid filter format
}
res.FilterProp = spec.Prop
res.FilterLowerLimit = spec.LowerLimit
res.FilterUpperLimit = spec.UpperLimit
res.FilterUnit = spec.Unit
break // V2 supports at most one extra filter
}References
- This rule emphasizes ensuring deterministic order when the underlying system (like map iteration or SPARQL query results without ORDER BY) does not guarantee it. Sorting map keys before iteration aligns with this principle to prevent unpredictable behavior.
| func populateGeoLocation(event *pbv1.EventCollection_Event, value string) { | ||
| // Note: The startLocation value in Spanner is usually a latLong/ DCID (e.g. latLong/577521_-958960). | ||
| // We parse it here for performance to avoid an extra database roundtrip. | ||
| // | ||
| // TODO(task): Revisit this optimization if we encounter valid startLocation values | ||
| // that are NOT latLong/ DCIDs but still need to be resolved to points, or if the | ||
| // assumption that dcids always contain coordinates is not true. | ||
| if strings.HasPrefix(value, "latLong/") { | ||
| parts := strings.Split(strings.TrimPrefix(value, "latLong/"), "_") | ||
| if len(parts) == 2 { | ||
| lat, err1 := strconv.ParseFloat(parts[0], 64) | ||
| lon, err2 := strconv.ParseFloat(parts[1], 64) | ||
| if err1 == nil && err2 == nil { | ||
| event.GeoLocations = append(event.GeoLocations, &pbv1.EventCollection_GeoLocation{ | ||
| Geo: &pbv1.EventCollection_GeoLocation_Point_{ | ||
| Point: &pbv1.EventCollection_GeoLocation_Point{ | ||
| Latitude: proto.Float64(lat / 100000.0), | ||
| Longitude: proto.Float64(lon / 100000.0), | ||
| }, | ||
| }, | ||
| }) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This function uses several magic strings and numbers ("latLong/", "_", 100000.0). These should be defined as constants at the package level to improve readability and maintainability. This makes it clear what these values represent and makes them easier to change if needed. Since I can only suggest changes within the function, I'll add the scaling factor as a local constant.
func populateGeoLocation(event *pbv1.EventCollection_Event, value string) {
const latLonScalingFactor = 100000.0
// Note: The startLocation value in Spanner is usually a latLong/ DCID (e.g. latLong/577521_-958960).
// We parse it here for performance to avoid an extra database roundtrip.
//
// TODO(task): Revisit this optimization if we encounter valid startLocation values
// that are NOT latLong/ DCIDs but still need to be resolved to points, or if the
// assumption that dcids always contain coordinates is not true.
if strings.HasPrefix(value, "latLong/") {
parts := strings.Split(strings.TrimPrefix(value, "latLong/"), "_")
if len(parts) == 2 {
lat, err1 := strconv.ParseFloat(parts[0], 64)
lon, err2 := strconv.ParseFloat(parts[1], 64)
if err1 == nil && err2 == nil {
event.GeoLocations = append(event.GeoLocations, &pbv1.EventCollection_GeoLocation{
Geo: &pbv1.EventCollection_GeoLocation_Point_{
Point: &pbv1.EventCollection_GeoLocation_Point{
Latitude: proto.Float64(lat / latLonScalingFactor),
Longitude: proto.Float64(lon / latLonScalingFactor),
},
},
})
}
}
}
}
n-h-diaz
left a comment
There was a problem hiding this comment.
Thanks for getting this done so quickly! :)
generally looks good, just wanted to clarify a few comments on how dates are handled
on geoJson escaping, it looks like current v2 does the single escape, which is how we've handled throughout for v3, so I think okay to keep for now. we can change escaping later if needed
| @{force_join_order=true} | ||
| GRAPH DCGraph MATCH (event:Node)-[:Edge {predicate: 'affectedPlace', object_id: 'country/LBR'}]->(), (event)-[:Edge {predicate: 'typeOf', object_id: 'FireEvent'}]->(), (event)-[:Edge {predicate: 'startDate'}]->(dateNode:Node) | ||
| WHERE | ||
| SUBSTR(dateNode.value, 1, 7) = '2020-10' |
There was a problem hiding this comment.
what happens if the event occurs over multiple months? (ie. do we want the events to be returned if requesting another month? not sure how this is handled in current v2, but we should make sure we match the current behavior)
| // Pattern: <-location{typeOf:EventType, date:Date, filter_prop:filter_val} | ||
| func parseEventCollection(req *pbv2.EventRequest, arcs []*v2.Arc) *pbv1.EventCollectionRequest { | ||
| if len(arcs) != 1 { | ||
| return nil |
There was a problem hiding this comment.
(optional, since not backward supported: could we add some informative error messages to help understand the various reasons a request didn't parse)
| if !strings.HasPrefix(edge.Value, "s2CellId/") { | ||
| event.Places = append(event.Places, edge.Value) | ||
| } | ||
| case predStartDate: |
There was a problem hiding this comment.
just to confirm, how is event.Dates populated? should we also add observationDates? or does this just copy from startDate? (example: https://datacommons.org/browser/fireEvent/2022-03-05_0x4737f90000000000)
(we should just match whatever v2 does for this)
| }, | ||
| } | ||
|
|
||
| var eventCollectionDcidsTestCases = []struct { |
There was a problem hiding this comment.
optional: since events support flexible schema, could we test a few other event types? (since we've hardcoded a few properties in the queries, want to confirm they'll work for other events too)
| // TODO(task): Revisit this optimization if we encounter valid startLocation values | ||
| // that are NOT latLong/ DCIDs but still need to be resolved to points, or if the | ||
| // assumption that dcids always contain coordinates is not true. | ||
| if strings.HasPrefix(value, "latLong/") { |
There was a problem hiding this comment.
optional: since we're making this assumption, could we log an error if we find any examples otherwise? (so we can track and update if needed)
Known Discrepancies with V2
propValsorder: The order of keys inpropValsis different from V2.geoJsonCoordinatesescaping: In the V2 response, quotes are triple-escaped (\\\"), whereas in the Spanner response they are single-escaped (\").V2 geoJsonCoordinates
V3 geoJsonCoordinates
@n-h-diaz - let me know if this can cause issues and I can change the V3 escaping.