This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and non-diegetic audio. Comments are welcome on the spatial-media-discuss mailing list or by filing an issue on GitHub.
Spatial audio metadata is stored in a new box, SA3D, defined in this RFC. Non-diegetic audio metadata is stored in a new box, SAND, defined in this RFC. The metadata is applicable to individual tracks in an MP4 container.
Box Type: SA3D
Container: Sound Sample Description box (e.g., mp4a, lpcm, sowt, etc.)
Mandatory: No
Quantity: Zero or one
When present, provides additional information about the spatial audio content contained in this audio track.
aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
unsigned int(8) version;
unsigned int(8) ambisonic_type;
unsigned int(32) ambisonic_order;
unsigned int(8) ambisonic_channel_ordering;
unsigned int(8) ambisonic_normalization;
unsigned int(32) num_channels;
for (i = 0; i < num_channels; i++) {
unsigned int(32) channel_map;
}
}
-
versionis an 8-bit unsigned integer that specifies the version of this box. Must be set to0. -
ambisonic_typeis an 8-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:
ambisonic_type |
Ambisonic Type Description |
|---|---|
0 |
Periphonic: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D). |
-
ambisonic_orderis a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If theambisonic_typeis0(periphonic), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value ofsqrt(n) - 1, wherenis the number of channels in the represented ambisonic audio data. For example, a periphonic ambisonic sound field withambisonic_order = 1requires(ambisonic_order + 1)^2 = 4ambisonic components. -
ambisonic_channel_orderingis an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:
ambisonic_channel_ordering |
Channel Ordering Description |
|---|---|
0 |
ACN: The channel ordering used is the Ambisonic Channel Number (ACN) system. In this, given a spherical harmonic of degree l and order m, the corresponding ordering index n is given by n = l * (l + 1) + m. |
ambisonic_normalizationis an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:
ambisonic_normalization |
Normalization Description |
|---|---|
0 |
SN3D: The normalization used is Schmidt semi-normalization (SN3D). In this, the spherical harmonic of degree l and order m is normalized according to sqrt((2 - δ(m)) * ((l - m)! / (l + m)!)), where δ(m) is the Kronecker delta function, such that δ(0) = 1 and δ(m) = 0 otherwise. |
-
num_channelsis a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track. -
channel_mapis a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the definedambisonic_channel_ordering. The sequence ofchannel_mapvalues should match the channel sequence within the given audio track.For example, consider a 4-channel audio track containing ambisonic components W, X, Y, Z at channel indexes 0, 1, 2, 3, respectively. For
ambisonic_channel_ordering = 0(ACN), the ordering of components should be W, Y, Z, X, so thechannel_mapsequence should be0,2,3,1.As a simpler example, for a 4-channel audio track containing ambisonic components W, Y, Z, X at channel indexes 0, 1, 2, 3, respectively, the
channel_mapsequence should be specified as0,1,2,3whenambisonic_channel_ordering = 0(ACN).
Here is an example MP4 box hierarchy for a file containing the SA3D box:
- moov
- trak
- mdia
- minf
- stbl
- stsd
- mp4a
- esds
- SA3D
- mp4a
- stsd
- stbl
- minf
- mdia
- trak
where the SA3D box has the following data:
| Field Name | Value |
|---|---|
version |
0 |
ambisonic_type |
0 |
ambisonic_order |
1 |
ambisonic_channel_ordering |
0 |
ambisonic_normalization |
0 |
num_channels |
4 |
channel_map |
0 |
channel_map |
2 |
channel_map |
3 |
channel_map |
1 |
Box Type: SAND
Container: Sound Sample Description box (e.g., mp4a, lpcm, sowt, etc.)
Mandatory: No
Quantity: Zero or one
When present, provides additional information about the non-diegetic audio content contained in this audio track. This can be used alongisde SA3D in a head-tracked virtual reality experience to provide audio which should remain unchanged by listener head rotation; e.g., narration or stereo music.
aligned(8) class NonDiegeticAudioBox extends Box(‘SAND’) {
unsigned int(8) version;
}
versionis an 8-bit unsigned integer that specifies the version of this box. Must be set to0.
Here is an example MP4 box hierarchy for a file containing the SA3D and SAND boxes, to mix spatial audio with non-diegetic audio:
- moov
- trak
- mdia
- minf
- stbl
- stsd
- mp4a
- esds
- SA3D
- mp4a
- stsd
- stbl
- minf
- mdia
- trak
- mdia
- minf
- stbl
- stsd
- mp4a
- esds
- SAND
- mp4a
- stsd
- stbl
- minf
- mdia
- trak
where the SAND box has the following data:
| Field Name | Value |
|---|---|
version |
0 |
The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the associated Legendre polynomials (without Condon-Shortley phase) as the basis functions. Thus, the spherical harmonic of degree l and order m at elevation E and azimuth A is given by:
N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)
where:
N(l, m)is the spherical harmonics normalization function used.P(l, m, x)is the (unnormalized) associated Legendre polynomial, without Condon-Shortley phase, of degreeland ordermevaluated atx.T(m, x)issin(-m * x)form < 0andcos(m * x)otherwise.
A = 0: The source is in front of the listener.Ain(0, pi/2): The source is in the forward-left quadrant.Ain(pi/2, pi): The source is in the back-left quadrant.Ain(-pi/2, 0): The source is in the forward-right quadrant.Ain(-pi, -pi/2): The source is in the back-right quadrant.
E = 0: The source is in the horizontal plane.Ein(0, pi/2]: The source is above the listener.Ein[-pi/2, 0): The source is below the listener.