-
-
Notifications
You must be signed in to change notification settings - Fork 141
Working end-to-end transcription integrations. #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
17fdca1
ca0f2bb
f16f46e
e054c49
5c4a21b
f1f947c
92fccb3
24474c6
c9d5a90
64475c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,23 +8,19 @@ const assembly = new AssemblyAI({ | |
}) | ||
|
||
export const transcription = functions.https.onRequest(async (req, res) => { | ||
if ( | ||
req.headers["X-Maple-Webhook"] && | ||
req.headers["webhook_auth_header_value"] | ||
) { | ||
if (req.headers["x-maple-webhook"]) { | ||
if (req.body.status === "completed") { | ||
const transcript = await assembly.transcripts.get(req.body.transcript_id) | ||
if (transcript && transcript.webhook_auth) { | ||
const maybeEventInDb = await db | ||
.collection("events") | ||
.where("videoAssemblyId", "==", transcript.id) | ||
.get() | ||
|
||
if (maybeEventInDb.docs.length) { | ||
const authenticatedEventsInDb = maybeEventInDb.docs.filter( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just double-checking here - does this successfully filter out requests where the tokens don't match? Now that I think about it, I would expect the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe so based on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833: const test = [1, 2, 3, 4]
// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)
// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0) |
||
async e => { | ||
const hashedToken = sha256( | ||
String(req.headers["webhook_auth_header_value"]) | ||
) | ||
const hashedToken = sha256(String(req.headers["x-maple-webhook"])) | ||
|
||
const tokenInDb = await db | ||
.collection("events") | ||
|
@@ -33,24 +29,62 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |
.doc("webhookAuth") | ||
.get() | ||
const tokenInDbData = tokenInDb.data() | ||
console.log("tokenInDbData", tokenInDbData) | ||
|
||
if (tokenInDbData) { | ||
return hashedToken === tokenInDbData.videoAssemblyWebhookToken | ||
} | ||
return false | ||
} | ||
) | ||
|
||
const { id, text, audio_url, utterances, words } = transcript | ||
if (authenticatedEventsInDb) { | ||
try { | ||
await db | ||
const transcriptionInDb = db | ||
.collection("transcriptions") | ||
.doc(transcript.id) | ||
.set({ _timestamp: new Date(), ...transcript }) | ||
|
||
authenticatedEventsInDb.forEach(async d => { | ||
await d.ref.update({ | ||
["webhook_auth_header_value"]: null | ||
transcriptionInDb.set({ | ||
id, | ||
text, | ||
timestamp: new Date(), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, come to think of it, should this be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure thing, updating in next commit. |
||
audio_url, | ||
words | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per our discussion the other day, we don't want There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oy of course. Fixing in the next commit. |
||
}) | ||
|
||
transcriptionInDb | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these I don't have a strong opinion on which - firestore shouldn't need the root document to exist to create the subcollections, and worst case if one of the writes fails, we'll need to re-run the webhook anyway and that will overwrite any partially saved values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. adding awaits in next commit |
||
.collection("timestamps") | ||
.doc("utterances") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... Based on our discussion the other day, I was expecting two separate sub-collections ( I am onboard with storing all of the I do have two concerns here, especially re:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy to make a doc for each word and utterance. Is there a more clever/efficient/firestorie way that you would suggest other than iterating over the arrays I already have and making a doc for each object in that loop? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really AFAIK - one caveat would be that we should probably use the BulkWriter (since some of these videos will have dozens of utterances and tens of thousands of words). I don't think we should need sequential ids on these subcollection docs since the If making There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I switched to one doc per utterance and word. I'm not sure if there is a more efficient way to do to all the db transactions together, so I left everything else as is. |
||
.set({ | ||
utterances: utterances?.map( | ||
({ speaker, confidence, start, end, text }) => ({ | ||
speaker, | ||
confidence, | ||
start, | ||
end, | ||
text | ||
}) | ||
) | ||
}) | ||
|
||
transcriptionInDb.collection("timestamps").doc("words").set({ | ||
words | ||
}) | ||
|
||
const batch = db.batch() | ||
|
||
batch.set(db.collection("transcriptions").doc(transcript.id), { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we've already saved the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RE the async filter, I haven't tested this with a mismatching webhook, but I see your point. Seems like firestore doesn't have a findunique api. Any suggestions for this? I'm worried about working off of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RE the timestamp, I think we can actually just delete that batch. I don't think we need a created at and a timestamp. My bad for leaving it in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I wouldn't expect the Firestore API itself to help us with the async filter issue. It looks like the easiest way to do that is to first Then you can do something like: async function getValidEvent(event: Event): Promise<Event|null> {
// whatever async/await logic
}
const authenticatedEvents = (await Promise.all(events.map(getValidEvent))).filter(e => e) I also wouldn't turn down something more iterative like: const authenticatedEvents = []
docs.forEach(async doc => {
const otherDoc = await getOtherDoc(whatever)
if (isValid(doc, otherDoc)) {
authenticatedEvents.push(doc)
}
}) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the iterative approach is more readable. I'll do that. |
||
_timestamp: new Date(), | ||
...transcript | ||
}) | ||
|
||
authenticatedEventsInDb.forEach(doc => { | ||
batch.update(doc.ref, { ["x-maple-webhook"]: null }) | ||
}) | ||
|
||
await batch.commit() | ||
|
||
console.log("transcript saved in db") | ||
} catch (error) { | ||
console.log(error) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change from
Hearing.check(eventData).startsAt.toDate()
? I don't thinkeventData
has aStartTime
field.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstand an earlier review comment. I'll change this in the next commit.