perf: Trim the audio so that we only transcribe the portion within the clip.

Currently we always transcribe the entire audio, which can be a waste if the clip uses only a small portion of the audio.

To improve performance, we need to:
- (Short term) Trim the audio data from the server to only keep the portion that is visible within the clip
- (Long term) Trim the audio from the client so that we only send the visible portion to the server