Skip to content

Conversation

@mollyxu
Copy link
Contributor

@mollyxu mollyxu commented Jan 16, 2026

Add rotation metadata support

Context: #1084

Videos recorded on mobile devices often store rotation metadata indicating how the video should be displayed. TorchCodec now extracts this metadata and applies the rotation during decoding so frames are returned in the correct orientation.

This PR adds support for extracting and applying rotation metadata from video streams. The rotation angle (in degrees counter clockwise) is extracted from the display matrix side data and:

  1. Exposed via VideoStreamMetadata.rotation
  2. Automatically applied during decoding — frames are rotated so they display correctly
from torchcodec.decoders import VideoDecoder

decoder = VideoDecoder("rotated_video.mp4")
print(decoder.metadata.rotation)  # e.g., 90

The rotation property

  • None: The video has no rotation metadata (most videos)
  • float: The rotation angle in degrees (counter-clockwise) needed for correct display. Common values are 0, 90, 180, -90, or -180.

Implementation

Metadata extraction (FFMPEGCommon.cpp):

  • Added getRotationFromStream() which extracts the rotation from AV_PKT_DATA_DISPLAYMATRIX side data
  • Handles FFmpeg version differences:
    • FFmpeg ≥ 6.1: Uses av_packet_side_data_get() with codecpar->coded_side_data
    • FFmpeg 5/6: Uses av_stream_get_side_data() with deprecation warning suppression for FFmpeg 6
    • FFmpeg 4: Uses av_stream_get_side_data() with int* size parameter
  • Uses av_display_rotation_get() to compute the angle from the display matrix

Metadata propagation:

  • StreamMetadata struct now includes an std::optional<double> rotation field (Metadata.h)
  • SingleStreamDecoder populates this during stream scanning
  • Python VideoStreamMetadata dataclass exposes rotation: float | None (_metadata.py)
  • JSON metadata serialization includes the rotation value (custom_ops.cpp)

Test coverage

  • Added nasa_13013_rotated.mp4 test resource with 90° rotation metadata
  • generated video using
ffmpeg -y -display_rotation 90 -i test/resources/nasa_13013.mp4 \
  -c copy \
  test/resources/nasa_13013_rotated.mp4
  • test_rotation_metadata (in test_metadata.py): Verifies rotation is correctly extracted for rotated videos and is None for non-rotated videos
  • test_rotation_applied_to_frames (in test_decoders.py): Performs pixel-exact comparison with ffmpeg CLI output

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2026
@mollyxu mollyxu changed the title Add rotation metadata support Rotation metadata support Jan 20, 2026
// Extracts the rotation angle in degrees from the stream's display matrix
// side data. The display matrix is used to specify how the video should be
// rotated for correct display.
std::optional<double> getRotationFromStream(const AVStream* avStream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a huge difference between using optional rather than 0 to represent no rotation.

Would it be useful to distinguish the case of a video with no rotation metadata and a video that is known to not been rotated before (ie. rotation of 0 degrees)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this distinction is important to users or FFmpeg, but optional seems like a reasonable type to use here. Correct me if I'm wrong, the only difference is that we need to check rotation is not None?

#elif LIBAVFORMAT_VERSION_MAJOR >= 60 // FFmpeg 6.0
// FFmpeg 6.0: Use av_stream_get_side_data (deprecated but still available)
// Suppress deprecation warning for this specific call
#pragma GCC diagnostic push
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is a better way to handle the case of FFmpeg 6 deprecating av_stream_get_side_data while the replacement codecpar->coded_side_data not being available until FFmpeg 6.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awful 😄 . But I think you're doing the right thing!

I'd just add a comment at the top of the function, with a high-level explanation of what's going on. Something like this (please check me on the versions and the API names):

av_stream_get_side_data was deprecated in version FFmpeg 6.0, but its replacement (av_packet_side_data_get + codecpar->coded_side_data) is only available from version 6.1. We need some #pragma magic to silence the deprecation warning which our compile chain would otherwise treat as an error.

(CC @scotts, I'm sure you will enjoy this).

@mollyxu mollyxu marked this pull request as ready for review January 21, 2026 04:11
dimension_order=dimension_order,
)

frame = decoder[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets modify this test to check accuracy for multiple frames.

Also, the nasa_13013 video contains multiple video streams, so we could parametrize over stream_index as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on Daniel's suggestions. Kinda related but: do we actually need to check against FFmpeg? I'm hoping we can get away with just comparing the output of NASA_VIDEO and NASA_VIDEO_ROTATED (where we manually apply the 90 rotation to either, in order to compare them)?
To be clear I do agree that comparing against FFmpeg is the "golden standard" test... but I think we can get away with the simpler approach described above.

On parametrizating over streams: if we're going to check-in both json files for both stream 0 and stream 3, I agree, we might as well parametrize over both. But do we need to? Would a single stream be enough here?

# Frame is (H, W, C)
frame_h, frame_w = frame.shape[0], frame.shape[1]

# Pixel-exact comparison with ffmpeg CLI (autorotate is on by default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does autorotate refer to an ffmpeg CLI option? It's not used in the command below, so this comment is a bit confusing.

if self._dimension_order == "NCHW":
dims = (-2, -1)
else:
dims = (-3, -2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we use negative indices here? It works, but I had to think about it for a second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using negative indices to account for batch vs unbatched call but it might not be applicable after moving the logic to C++

# Precompute rotation parameters for torch.rot90
# k is the number of 90-degree counter-clockwise rotations
rotation = self.metadata.rotation
if rotation is not None and rotation != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking rotation != 0 shouldn't matter here, we will set self._rotation_k to 0 either way.

// Extracted from the display matrix side data. This indicates how the video
// should be rotated for correct display. Common for videos recorded on mobile
// devices.
std::optional<double> rotation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information from this comment is also in _metadata.py, it's probably not necessary here.

AV_PKT_DATA_DISPLAYMATRIX);
if (sideData != nullptr && sideData->size >= 9 * sizeof(int32_t)) {
displayMatrix = reinterpret_cast<const int32_t*>(sideData->data);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we verify the size of sideData here to be >= 9 * sizeof(int32_t)? I found one example in FFmpeg where there is only a nullptr check, but its possible I am missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I think I was being overly cautious to check that the display matrix has 9 elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we test that rotation is present in the metadata, it makes sense to check in a video.

test_rotation_applied_to_frames only needs a few frames, perhaps we can reduce the size of the video file and associated all_frames_info JSONs?


frame_data, *_ = core.get_frame_at_index(self._decoder, frame_index=key)
return frame_data
return self._apply_rotation(frame_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side effect of applying the rotation at the python level is that we have to update each frame retrieval API.

I wonder if this should be handled in C++, similar to maybePermuteHWC2CHW. We would still need to add it to every SingleStreamDecoder::getFrame ... method, but it keeps the complexity in the C++ implementation.
Let me know your thoughts @mollyxu @NicolasHug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I think it would make the public API more clean to do something similar to maybePermuteHWC2CHW

I was also considering applying the ffmpeg transpose filter as an option to consolidate the calls in one place. The con of that would be the pixels being moved instead of it just being a view operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call-out, I agree that we should try to implement this logic in C++. The main reason is that, while we don't want to have a public C++ decoding API, we should still aim to have a fully-featured C++ decoder. Ideally, we want the python layer to be as slim as possible (with no logic). We can't always do that, but that's what we should aim for.
It makes a lot of sense to me to treat this rotation logic in the same way we treat maybePermuteHWC2CHW.

On the implementation, fully agreed that we should prefer using torch::rot90 (which should still exist in C++) so that we can return a view instead of incurring a copy through filtergraph.

dimension_order=dimension_order,
)

frame = decoder[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on Daniel's suggestions. Kinda related but: do we actually need to check against FFmpeg? I'm hoping we can get away with just comparing the output of NASA_VIDEO and NASA_VIDEO_ROTATED (where we manually apply the 90 rotation to either, in order to compare them)?
To be clear I do agree that comparing against FFmpeg is the "golden standard" test... but I think we can get away with the simpler approach described above.

On parametrizating over streams: if we're going to check-in both json files for both stream 0 and stream 3, I agree, we might as well parametrize over both. But do we need to? Would a single stream be enough here?

Verifies pixel-exact match with ffmpeg CLI.
"""
import subprocess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the test against FFmpeg, this should be a top-level import.

display. Common values are 0, 90, 180, or -90, -180 degrees. Videos recorded on
mobile devices often have rotation metadata. TorchCodec automatically
applies this rotation during decoding, so the returned frames are
in the correct orientation (float or None)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment! We should indicate that we're rounding to the closest multiple of 90.

Also: The height and width that we report, are they pre-rotation or post-rotation? I think we'll want them to be post-rotation, since this is what we'll return from our decoding methods? No matter what we choose, we should make sure to document it!


frame_data, *_ = core.get_frame_at_index(self._decoder, frame_index=key)
return frame_data
return self._apply_rotation(frame_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call-out, I agree that we should try to implement this logic in C++. The main reason is that, while we don't want to have a public C++ decoding API, we should still aim to have a fully-featured C++ decoder. Ideally, we want the python layer to be as slim as possible (with no logic). We can't always do that, but that's what we should aim for.
It makes a lot of sense to me to treat this rotation logic in the same way we treat maybePermuteHWC2CHW.

On the implementation, fully agreed that we should prefer using torch::rot90 (which should still exist in C++) so that we can return a view instead of incurring a copy through filtergraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants