-
-
Notifications
You must be signed in to change notification settings - Fork 141
Working end-to-end transcription integrations. #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
17fdca1
ca0f2bb
f16f46e
e054c49
5c4a21b
f1f947c
92fccb3
24474c6
c9d5a90
64475c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,26 @@ | ||
import * as functions from "firebase-functions" | ||
import { AssemblyAI } from "assemblyai" | ||
import { db } from "../firebase" | ||
import { db, Timestamp } from "../firebase" | ||
import { sha256 } from "js-sha256" | ||
|
||
const assembly = new AssemblyAI({ | ||
apiKey: process.env.ASSEMBLY_API_KEY ? process.env.ASSEMBLY_API_KEY : "" | ||
}) | ||
|
||
export const transcription = functions.https.onRequest(async (req, res) => { | ||
if ( | ||
req.headers["X-Maple-Webhook"] && | ||
req.headers["webhook_auth_header_value"] | ||
) { | ||
if (req.headers["x-maple-webhook"]) { | ||
if (req.body.status === "completed") { | ||
const transcript = await assembly.transcripts.get(req.body.transcript_id) | ||
if (transcript && transcript.webhook_auth) { | ||
const maybeEventInDb = await db | ||
.collection("events") | ||
.where("videoAssemblyId", "==", transcript.id) | ||
.get() | ||
|
||
if (maybeEventInDb.docs.length) { | ||
const authenticatedEventsInDb = maybeEventInDb.docs.filter( | ||
async e => { | ||
const hashedToken = sha256( | ||
String(req.headers["webhook_auth_header_value"]) | ||
) | ||
const hashedToken = sha256(String(req.headers["x-maple-webhook"])) | ||
|
||
const tokenInDb = await db | ||
.collection("events") | ||
|
@@ -33,25 +29,66 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |
.doc("webhookAuth") | ||
.get() | ||
const tokenInDbData = tokenInDb.data() | ||
|
||
if (tokenInDbData) { | ||
return hashedToken === tokenInDbData.videoAssemblyWebhookToken | ||
} | ||
return false | ||
} | ||
) | ||
|
||
const { id, text, audio_url, utterances, words } = transcript | ||
if (authenticatedEventsInDb) { | ||
try { | ||
await db | ||
const transcriptionInDb = await db | ||
.collection("transcriptions") | ||
.doc(transcript.id) | ||
.set({ _timestamp: new Date(), ...transcript }) | ||
.doc(id) | ||
|
||
await transcriptionInDb.set({ | ||
id, | ||
text, | ||
createdAt: Timestamp.now(), | ||
audio_url | ||
}) | ||
|
||
if (utterances) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the expected scale of utterances (i.e. several dozen per document), I believe this is fine. Firestore does generally caution against using sequential ids because it can lead to hotspotting - but given that we'll be disabling indexes for both If this does end up causing an issue for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Come to think of it, this would have a potential problem if we ever re-transcribe the same document (e.g. with different settings) - the utterance divisions and We don't have any plans for that right now - I just want to note a caveat that we need to delete any existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm just going to go ahead and preemptively change this to a start. If I just remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can test in the firebase-admin repl to be sure, but I think you might also need the more explicit method for a subcollection autogenerated id: e.g. db.collection("transcriptions").doc(`${transcript.id}`).collection("utterances").doc() There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. Reading this line of my last commit back and realizing it wouldn't work. |
||
const writer = db.bulkWriter() | ||
for (let utterance of utterances) { | ||
const { speaker, confidence, start, end, text } = utterance | ||
writer.set( | ||
db.doc( | ||
`/transcriptions/${transcript.id}/utterances/${utterance.start}` | ||
), | ||
{ speaker, confidence, start, end, text } | ||
) | ||
} | ||
|
||
authenticatedEventsInDb.forEach(async d => { | ||
await d.ref.update({ | ||
["webhook_auth_header_value"]: null | ||
}) | ||
await writer.close() | ||
} | ||
|
||
if (words) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given there are tens of thousands of words per document and we're using a sequential doc id, I'm worried this could potentially be problematically slow for a webhook. e.g.:
Switching to auto-generated doc ids might help to avoid the sequential index issue (there's no documented writes/second cap on those), and we could also partition multiple words into one doc to cut down on doc count (e.g. 1000 words per doc as an in-between for 1 word per doc and all words in one doc). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That said, this is getting more complicated than it's worth for a "nice-to-have" feature. How does the following sound for a rollout plan?
At this point, we'll either have confirmation that we're fine with just
This approach should let us unblock this PR and get a better sense for how well our current approach plays with real data. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds very sensible to me, I was starting to think the same thing. Pulling words out in the next commit. |
||
const writer = db.bulkWriter() | ||
for (let word of words) { | ||
writer.set( | ||
db.doc( | ||
`/transcriptions/${transcript.id}/words/${word.start}` | ||
), | ||
word | ||
) | ||
} | ||
|
||
await writer.close() | ||
} | ||
|
||
const batch = db.batch() | ||
batch.set(db.collection("transcriptions").doc(transcript.id), { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we've already saved the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RE the async filter, I haven't tested this with a mismatching webhook, but I see your point. Seems like firestore doesn't have a findunique api. Any suggestions for this? I'm worried about working off of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RE the timestamp, I think we can actually just delete that batch. I don't think we need a created at and a timestamp. My bad for leaving it in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I wouldn't expect the Firestore API itself to help us with the async filter issue. It looks like the easiest way to do that is to first Then you can do something like: async function getValidEvent(event: Event): Promise<Event|null> {
// whatever async/await logic
}
const authenticatedEvents = (await Promise.all(events.map(getValidEvent))).filter(e => e) I also wouldn't turn down something more iterative like: const authenticatedEvents = []
docs.forEach(async doc => {
const otherDoc = await getOtherDoc(whatever)
if (isValid(doc, otherDoc)) {
authenticatedEvents.push(doc)
}
}) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the iterative approach is more readable. I'll do that. |
||
_timestamp: Timestamp.now(), | ||
...transcript | ||
}) | ||
authenticatedEventsInDb.forEach(doc => { | ||
batch.update(doc.ref, { ["x-maple-webhook"]: null }) | ||
}) | ||
console.log("transcript saved in db") | ||
await batch.commit() | ||
} catch (error) { | ||
console.log(error) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking here - does this successfully filter out requests where the tokens don't match?
Now that I think about it, I would expect the
async
filter function to return a Promise that resolves to a boolean instead of a boolean - and since the Promise itself would always be truthy, this check would always return true and never filter anythingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so based on
return hashedToken === tokenInDbData.videoAssemblyWebhookToken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833: