-
Notifications
You must be signed in to change notification settings - Fork 75
Create Python API for VideoEncoder #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
1e06ea5
1e7dc34
cf7b75c
ee2285e
d14deb8
7ed7f21
b10d80b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,2 @@ | ||
| from ._audio_encoder import AudioEncoder # noqa | ||
| from ._video_encoder import VideoEncoder # noqa |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,97 @@ | ||||||||||||
| from pathlib import Path | ||||||||||||
| from typing import Union | ||||||||||||
|
|
||||||||||||
| import torch | ||||||||||||
| from torch import Tensor | ||||||||||||
|
|
||||||||||||
| from torchcodec import _core | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class VideoEncoder: | ||||||||||||
| """A video encoder. | ||||||||||||
| Args: | ||||||||||||
| frames (``torch.Tensor``): The frames to encode. This must be a 4D | ||||||||||||
| tensor of shape ``(N, C, H, W)`` where N is the number of frames, | ||||||||||||
| C is 3 channels (RGB), H is height, and W is width. | ||||||||||||
| A 3D tensor of shape ``(C, H, W)`` is also accepted as a single RGB frame. | ||||||||||||
| Values must be uint8 in the range ``[0, 255]``. | ||||||||||||
| frame_rate (int): The frame rate to use when encoding the | ||||||||||||
| **input** ``frames``. | ||||||||||||
|
||||||||||||
| sample_rate (int): The sample rate of the **input** ``samples``. The | |
| sample rate of the encoded output can be specified using the | |
| encoding methods (``to_file``, etc.). |
torchcodec/src/torchcodec/encoders/_audio_encoder.py
Lines 66 to 67 in 28b0346
| sample_rate (int, optional): The sample rate of the encoded output. | |
| By default, the sample rate of the input ``samples`` is used. |
My original docstring suggestion was definitely wrong (thanks for catching!) but I'd still suggest to clarify that this is really the input frame rate. And, for now, it also happens to be what we use for the output frame rate:
The frame rate of the input frames. Also defines the encoded output frame rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying this - I'll make a note to add frame_rate to the encoding methods (to_file, etc) next.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the distinction between "format" and "container" here? I would just use one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terms are used interchangeably, so I included both here to be understandable to users of both terms. Let me know if that is actually more confusing that just naming one term.
I might have added the distinction after discovering the format mkv is defined as a matroska container. I have not encountered other formats where these are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like saying
determines the video format and container.
may suggest to some readers that "format" and "container" refer to different things. Especially because a) "format" is sometimes used to refer to the codec, and b) in the methods we only have one single "format" parameter and we never mention the "container".
How about using the term "container format" everywhere in the docstrings? I think that is the actual correct technical term for that (wikipedia):
Notable examples of container formats include [...] formats used for multimedia playback (such as Matroska, MP4, and AVI).
The parameter can still be format - I think this makes sense for consistency with the AudioEncoder, but in the docstrings of all methods we could clarify that format corresponds to the "container format"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q - Do we need to support that? I'm wondering if it makes a lot of sense to just encode a single image as a video. I suspect this was made to mimic the AudioEncoder behavior but that was a different use-case. In the AudioEncoder we want to allow for 1D audio to be supported as it's still a valid waveform. But I don't think we need to treat a single frame as a valid video.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the use case is for encoding an image as a video, but since FFmpeg allows encoding an image to video, I believe we can retain this functionality for a relatively low cost.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either - let's not do it then :). One principle we have is that we should only implement things that we know are useful. In doubt, it's best not to expose it, it makes our life easier as maintainers. Of course if users start to explicitly request this feature, we may reconsider.
In the near future we will start implementing proper image encoders in torchcodec, and another guiding principle is that there should be one-- and preferably only one --obvious way to do it. (ref). If we were to support 3D tensors (images) into the
VideoEncoderthen we'd potentially expose two ways to do the same thing.It doesn't mean we'll prevent users from doing it if they really want to: a user can still just call
unsqueeze(0)on their 3D tensor and pass it. It will work. But we don't need to explicitly support 3D tensors on our side.