-
-
Notifications
You must be signed in to change notification settings - Fork 141
Working end-to-end transcription integrations. #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
17fdca1
ca0f2bb
f16f46e
e054c49
5c4a21b
f1f947c
92fccb3
24474c6
c9d5a90
64475c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |
const hearing = Hearing.check(eventData) | ||
const shouldScrape = withinCutoff(hearing.startsAt.toDate()) | ||
|
||
let payload: Hearing = { | ||
id: `hearing-${EventId}`, | ||
type: "hearing", | ||
content, | ||
...this.timestamps(content) | ||
} | ||
if (hearing) { | ||
payload = { | ||
...payload, | ||
videoURL: hearing.videoURL, | ||
videoFetchedAt: hearing.videoFetchedAt, | ||
videoAssemblyId: hearing.videoAssemblyId | ||
} | ||
} | ||
let maybeVideoURL = null | ||
let transcript = null | ||
|
||
|
@@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |
maybeVideoURL = firstVideoSource.src | ||
|
||
transcript = await assembly.transcripts.submit({ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that we've worked out the kinks here, could we refactor I feel like there's at least two functions here that could be broken out to make this more readable :
Which would leave the new code as something like: if (shouldScrapeVideo(EventId) {
const maybeVideoUrl = getHearingVideoUrl(EventId)
if (maybeVideoUrl) {
const transcriptId = await submitTranscription(maybeVideoUrl)
// add video/transcription data to Event
}
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure thing, I can take a pass at this. |
||
audio: | ||
// test with: "https://assemblyaiusercontent.com/playground/aKUqpEtmYmI.flac", | ||
firstVideoSource.src, | ||
webhook_url: | ||
// test with: "https://ngrokid.ngrok-free.app/demo-dtp/us-central1/transcription", | ||
process.env.NODE_ENV === "development" | ||
? "https://us-central1-digital-testimony-dev.cloudfunctions.net/transcription" | ||
: "https://us-central1-digital-testimony-prod.cloudfunctions.net/transcription", | ||
webhook_auth_header_name: "X-Maple-Webhook", | ||
webhook_auth_header_value: newToken, | ||
audio: firstVideoSource.src, | ||
auto_highlights: true, | ||
custom_topics: true, | ||
entity_detection: true, | ||
iab_categories: false, | ||
format_text: true, | ||
punctuate: true, | ||
speaker_labels: true, | ||
summarization: true, | ||
summary_model: "informative", | ||
summary_type: "bullets" | ||
webhook_auth_header_name: "x-maple-webhook", | ||
webhook_auth_header_value: newToken | ||
}) | ||
|
||
await db | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally, I think we shouldn't actually save the hearing event to Firestore in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would you like me to do here differently? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we scrape video, we can simply conditionally add the relevant video fields to the return object of As it stands, I think this would save the event twice - first in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I think I understand now, let me know if I got it right in my next commit |
||
.collection("events") | ||
.doc(`hearing-${String(EventId)}`) | ||
.set({ | ||
id: `hearing-${EventId}`, | ||
type: "hearing", | ||
content, | ||
...this.timestamps(content), | ||
videoURL: maybeVideoURL, | ||
videoFetchedAt: Timestamp.now(), | ||
videoAssemblyId: transcript.id | ||
}) | ||
|
||
await db | ||
.collection("events") | ||
.doc(`hearing-${String(EventId)}`) | ||
|
@@ -218,20 +211,17 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |
.set({ | ||
videoAssemblyWebhookToken: sha256(newToken) | ||
}) | ||
|
||
payload = { | ||
...payload, | ||
videoURL: maybeVideoURL, | ||
videoFetchedAt: Timestamp.now(), | ||
videoAssemblyId: transcript.id | ||
} | ||
} | ||
} | ||
} | ||
} | ||
|
||
const event: Hearing = payload | ||
return event | ||
return { | ||
id: `hearing-${EventId}`, | ||
type: "hearing", | ||
content, | ||
...this.timestamps(content) | ||
} as Hearing | ||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,23 +8,23 @@ const assembly = new AssemblyAI({ | |
}) | ||
|
||
export const transcription = functions.https.onRequest(async (req, res) => { | ||
if ( | ||
req.headers["X-Maple-Webhook"] && | ||
req.headers["webhook_auth_header_value"] | ||
) { | ||
console.log("req.headers", req.headers) | ||
if (req.headers["x-maple-webhook"]) { | ||
console.log("req.body.status", req.body.status) | ||
|
||
if (req.body.status === "completed") { | ||
const transcript = await assembly.transcripts.get(req.body.transcript_id) | ||
console.log("transcript.webhook_auth", transcript.webhook_auth) | ||
if (transcript && transcript.webhook_auth) { | ||
const maybeEventInDb = await db | ||
.collection("events") | ||
.where("videoAssemblyId", "==", transcript.id) | ||
.get() | ||
console.log("maybeEventInDb.docs.length", maybeEventInDb.docs.length) | ||
if (maybeEventInDb.docs.length) { | ||
const authenticatedEventsInDb = maybeEventInDb.docs.filter( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just double-checking here - does this successfully filter out requests where the tokens don't match? Now that I think about it, I would expect the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe so based on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833: const test = [1, 2, 3, 4]
// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)
// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0) |
||
async e => { | ||
const hashedToken = sha256( | ||
String(req.headers["webhook_auth_header_value"]) | ||
) | ||
const hashedToken = sha256(String(req.headers["x-maple-webhook"])) | ||
|
||
const tokenInDb = await db | ||
.collection("events") | ||
|
@@ -33,12 +33,16 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |
.doc("webhookAuth") | ||
.get() | ||
const tokenInDbData = tokenInDb.data() | ||
console.log("tokenInDbData", tokenInDbData) | ||
|
||
if (tokenInDbData) { | ||
return hashedToken === tokenInDbData.videoAssemblyWebhookToken | ||
} | ||
return false | ||
} | ||
) | ||
console.log("authenticatedEventsInDb", authenticatedEventsInDb) | ||
|
||
if (authenticatedEventsInDb) { | ||
try { | ||
await db | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:
I'm most interested in IMO
Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, would love to know what you think @mvictor55 and @mertbagt. I can proceed accordingly based on what you three align on. |
||
|
@@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |
|
||
authenticatedEventsInDb.forEach(async d => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just double-checking - is it possible that the firebase function will exit before the async function in the If so, it may be worth switching to a transaction for the writes here - something like const batch = db.batch()
batch.set(
db.collection("transcriptions").doc(transcript.id),
{ _timestamp: new Date(), ...transcript }
)
authenticatedEventsInDb.forEach(doc => {
batch.update(doc.ref, {["x-maple-webhook"]: null})
})
await batch.commit()
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Smart, will do. |
||
await d.ref.update({ | ||
["webhook_auth_header_value"]: null | ||
["x-maple-webhook"]: null | ||
}) | ||
}) | ||
console.log("transcript saved in db") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db,
eventData
would be undefined - doesHearing.check
blow up (and prevent us from returning the non-transcription related event data)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think I was drawing inspiration from this prior line:
maple/functions/src/events/scrapeEvents.ts
Line 142 in a18436a
How would you want to see this done differently?
I cal also try some things out if nothing comes to mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the chief value of that
HearingContent.check
there is about guarding against changes to the external Mass Legislature API (i.e. if the data they return deviates too much from what we expect, don't risk polluting our database - instead, throw an error to force a dev to investigate).For our purposes here:
If
eventInDb.exists()
is false (or equivalently AFAIK, ifeventData
is undefined), it should mean that we are scraping the present hearing for the first time. This means that we have definitely not scraped the hearing's video yet and it's valid to try scraping videos for it. The simplest, most robust solution is to just add this case to the logic of when to scrape video:Currently, we scrape video only if:
videoUrl
setWe should instead scrape video only if:
videoUrl
set)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this logic. Leaving the
.check
in there since that was there prior, and I think is still needed for the first run on a hearing. Do you agree?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait nevermind, I conflated two different
.check
s. Will leave the first and remove the second.