Skip to content

Decoupled Tracab dat / json from meta data file types#364

Merged
probberechts merged 13 commits intoPySport:masterfrom
UnravelSports:feat/tracab_meta
May 23, 2025
Merged

Decoupled Tracab dat / json from meta data file types#364
probberechts merged 13 commits intoPySport:masterfrom
UnravelSports:feat/tracab_meta

Conversation

@UnravelSports
Copy link
Contributor

@UnravelSports UnravelSports commented Nov 20, 2024

I noticed in the Tracab data-loader we were enforcing that whenever a .dat file was provided we could only provide a .xml meta data file, and when we provide a .json tracking file this would always need to be paired with a .json meta data file.

I found a situation where this was not the case (ie. we had .dat and .json meta data), and thus the parser broke.

Concretely this means:

  • Inside tracab.load (more specifically inside identify_deserializer) we no longer use the combination of meta_data_extension and raw_data_extension to see which Deserializer we need. This is now only dependents on the raw_data_extension
  • As a result we need to communicate the meta_data_extension to the deserializer. I chose to do this by adding a new parameter to both TRACABDatDeserializer and TRACABJSONDeserializer called meta_data_extension. (Note: I tried to do this differently, but because we already open the meta data inside load and pass it directly to the deserializer as the TRACABInputs() we can't retrieve the meta_data_extension after that step.
  • Inside both deserializers we now call a function load_meta_data which takes self.meta_data_extension and inputs.meta_data and returns this ugly thing. (I can clean this up by return a dictionary with these values or something, but not sure if that's necessary.
(
    pitch_size_height,
    pitch_size_width,
    teams,
    periods,
    frame_rate,
    date,
    game_id,
) = load_meta_data(self.meta_data_extension, inputs.meta_data)
  • Side note: Doing this load_meta_data step made me realize that Enriched metadata with date, game_week and game_id #340 had an oversight and did not include UTC date and game_id in the TracabDATDeserializer. This means I had to edit line 176 in kloppy/tests/test_tracab.py to make the test reflect UTC time. (This is the test that currently fails though)
  • load_meta_data lives in a newly created helpers.py file inside kloppy/infa/serializers/tracking/tracab/ and it acts a a simple switching board to grab either load_meta_data_xml or load_meta_data_json based the meta_data_extension. (This could be done differently, but not sure if that's necessary).
  • This new helpers.py file also contains both functions for parsing the correct meta data file.
  • I have added a simple test to check that everything still works when we provide the new combinations of tracking and meta data files.
  • I moved some of the tests into a general function we can call from multiple tests called meta_tracking_assertions. I did this because we're running the same asserts 4 times on 4 different combinations of tracking and meta data.

@UnravelSports
Copy link
Contributor Author

It seems my local version and the GitHub tests don't agree on what UTC time is...

@UnravelSports
Copy link
Contributor Author

Does anyone have any idea what could be causing this timezone/conversion mismatch? When I test it locally this asserting is true:

assert date == datetime(
                2023, 12, 15, 19, 32, 20, tzinfo=timezone.utc
            )

But here the automated tests seems to be in disagreement by 1 hour and want it to be (2023, 12, 15, 20, 32, 20). The failed test can be found here.

assert datetime.datetime(2023, 12, 15, 20, 32, 20, tzinfo=datetime.timezone.utc) == datetime.datetime(2023, 12, 15, 19, 32, 20, tzinfo=datetime.timezone.utc)
E            +  where datetime.datetime(2023, 12, 15, 19, 32, 20, tzinfo=datetime.timezone.utc) = datetime(2023, 12, 15, 19, 32, 20, tzinfo=datetime.timezone.utc)
E            +    where datetime.timezone.utc = timezone.utc

@probberechts
Copy link
Contributor

I think it is this line:

parse(meta_data.match.attrib["dtDate"]).astimezone(timezone.utc)

You first parse the date using your local timezone, then the .astimezone() operation switches the datetime to a different timezone. I suppose the timezone on your machine is UTC+1, while GHA is on UTC.

I think the correct implementation is

parse(meta_data.match.attrib["dtDate"]).replace(tzinfo=timezone.utc)

@UnravelSports
Copy link
Contributor Author

Thanks, totally didn't notice that!

@UnravelSports
Copy link
Contributor Author

FWIW, this should now be fixed

@UnravelSports
Copy link
Contributor Author

I noticed there was a merge conflict, I've resolved that now (in 2 attempts 🤣).

@probberechts
Copy link
Contributor

I've refactored this but I still have to test a few things. I was also wondering whether there is documentation about these (meta)data formats and whether Tracab uses a particular name / version number / code to refer to them.

@UnravelSports
Copy link
Contributor Author

UnravelSports commented Mar 3, 2025

I see the JSON file does not have a version, but the XML seems to have a version, although I'm not sure how consist it's use is and my only reference points are an old file from 2019 with <TracabMetaData sVersion="1.0"> and a test file with <OptaMetaData sVersion="1.2"> in it.

However, I wrote that test with <OptaMetaData sVersion="1.2"> but I can't remember where I got that (partial) file from.

Perhaps in general we can start to make a point of paying more attention to version numbers (if/when they are provided). Both at PySport (when talking with providers) and in the Common Data Format we're at least trying to get vendors to use versioning.

@koenvo koenvo added this to the 3.17.0 milestone May 13, 2025
@UnravelSports
Copy link
Contributor Author

UnravelSports commented May 21, 2025

@probberechts I made a small fix inside extract_orientation for the JSON and flat XML parser, because it can happen that Phase1HomeGKLeft doesn't exist.

Potentially, we should also discuss how we deal with this for games with extra time.

@probberechts probberechts merged commit e4320ec into PySport:master May 23, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants