Skip to content
60 changes: 25 additions & 35 deletions functions/src/events/scrapeEvents.ts
Original file line number Diff line number Diff line change
Expand Up @@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
const hearing = Hearing.check(eventData)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db, eventData would be undefined - does Hearing.check blow up (and prevent us from returning the non-transcription related event data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think I was drawing inspiration from this prior line:

const content = HearingContent.check(await api.getHearing(EventId))

How would you want to see this done differently?

I cal also try some things out if nothing comes to mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the chief value of that HearingContent.check there is about guarding against changes to the external Mass Legislature API (i.e. if the data they return deviates too much from what we expect, don't risk polluting our database - instead, throw an error to force a dev to investigate).

For our purposes here:

If eventInDb.exists() is false (or equivalently AFAIK, if eventData is undefined), it should mean that we are scraping the present hearing for the first time. This means that we have definitely not scraped the hearing's video yet and it's valid to try scraping videos for it. The simplest, most robust solution is to just add this case to the logic of when to scrape video:

Currently, we scrape video only if:

  • there is a pre-existing hearing with this EventId in our DB that does not have a videoUrl set
  • AND
  • the hearing in question is within our scraping time window

We should instead scrape video only if:

  • EITHER (we have not scraped this hearing before) OR (there is a pre-existing hearing in our DB with this EventId that does not have a videoUrl set)
  • AND
  • the hearing in question is within our scraping time window

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating this logic. Leaving the .check in there since that was there prior, and I think is still needed for the first run on a hearing. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait nevermind, I conflated two different .checks. Will leave the first and remove the second.

const shouldScrape = withinCutoff(hearing.startsAt.toDate())

let payload: Hearing = {
id: `hearing-${EventId}`,
type: "hearing",
content,
...this.timestamps(content)
}
if (hearing) {
payload = {
...payload,
videoURL: hearing.videoURL,
videoFetchedAt: hearing.videoFetchedAt,
videoAssemblyId: hearing.videoAssemblyId
}
}
let maybeVideoURL = null
let transcript = null

Expand All @@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
maybeVideoURL = firstVideoSource.src

transcript = await assembly.transcripts.submit({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've worked out the kinks here, could we refactor getEvent a bit?

I feel like there's at least two functions here that could be broken out to make this more readable :

  • shouldScrapeVideo (handles fetching any previous event + checking for existingVideo + the date cutoff logic, returns a boolean yes/no)
  • getHearingVideoUrl(does the JSDOM wrangling, returns a string | undefined with the videoUrl`)
  • And maybe submitTranscription (takes a videoUrl, sends the AssemblyAI request + saves the webhookAuth in the DB, returns transcriptId)

Which would leave the new code as something like:

if (shouldScrapeVideo(EventId) {
  const maybeVideoUrl = getHearingVideoUrl(EventId)
  if (maybeVideoUrl) {
      const transcriptId = await submitTranscription(maybeVideoUrl)
      // add video/transcription data to Event
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I can take a pass at this.

audio:
// test with: "https://assemblyaiusercontent.com/playground/aKUqpEtmYmI.flac",
firstVideoSource.src,
webhook_url:
// test with: "https://ngrokid.ngrok-free.app/demo-dtp/us-central1/transcription",
process.env.NODE_ENV === "development"
? "https://us-central1-digital-testimony-dev.cloudfunctions.net/transcription"
: "https://us-central1-digital-testimony-prod.cloudfunctions.net/transcription",
webhook_auth_header_name: "X-Maple-Webhook",
webhook_auth_header_value: newToken,
audio: firstVideoSource.src,
auto_highlights: true,
custom_topics: true,
entity_detection: true,
iab_categories: false,
format_text: true,
punctuate: true,
speaker_labels: true,
summarization: true,
summary_model: "informative",
summary_type: "bullets"
webhook_auth_header_name: "x-maple-webhook",
webhook_auth_header_value: newToken
})

await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think we shouldn't actually save the hearing event to Firestore in getEvent - the abstract class already handles that for us. The webhookAuth is in a subcollection (which wouldn't be caught by the abstract class) and is the exception to that rule - this should be fine since Firestore lets you write to subcollections of documents that don't exist yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like me to do here differently?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we scrape video, we can simply conditionally add the relevant video fields to the return object of getEvent (those fields being videoURL, videoFetchedAt, and videoAssemblyId). So long as they're present on the returned object, the code in the abstract class will handle the work of saving the events to the DB. See the line writer.set(db.doc(/events/${event.id}), event, { merge: true }) in run() at the top of this file.

As it stands, I think this would save the event twice - first in getEvent() and then again in run().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I understand now, let me know if I got it right in my next commit

.collection("events")
.doc(`hearing-${String(EventId)}`)
.set({
id: `hearing-${EventId}`,
type: "hearing",
content,
...this.timestamps(content),
videoURL: maybeVideoURL,
videoFetchedAt: Timestamp.now(),
videoAssemblyId: transcript.id
})

await db
.collection("events")
.doc(`hearing-${String(EventId)}`)
Expand All @@ -218,20 +211,17 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
.set({
videoAssemblyWebhookToken: sha256(newToken)
})

payload = {
...payload,
videoURL: maybeVideoURL,
videoFetchedAt: Timestamp.now(),
videoAssemblyId: transcript.id
}
}
}
}
}

const event: Hearing = payload
return event
return {
id: `hearing-${EventId}`,
type: "hearing",
content,
...this.timestamps(content)
} as Hearing
}
}

Expand Down
20 changes: 12 additions & 8 deletions functions/src/webhooks/transcription.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,23 @@ const assembly = new AssemblyAI({
})

export const transcription = functions.https.onRequest(async (req, res) => {
if (
req.headers["X-Maple-Webhook"] &&
req.headers["webhook_auth_header_value"]
) {
console.log("req.headers", req.headers)
if (req.headers["x-maple-webhook"]) {
console.log("req.body.status", req.body.status)

if (req.body.status === "completed") {
const transcript = await assembly.transcripts.get(req.body.transcript_id)
console.log("transcript.webhook_auth", transcript.webhook_auth)
if (transcript && transcript.webhook_auth) {
const maybeEventInDb = await db
.collection("events")
.where("videoAssemblyId", "==", transcript.id)
.get()
console.log("maybeEventInDb.docs.length", maybeEventInDb.docs.length)
if (maybeEventInDb.docs.length) {
const authenticatedEventsInDb = maybeEventInDb.docs.filter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking here - does this successfully filter out requests where the tokens don't match?

Now that I think about it, I would expect the async filter function to return a Promise that resolves to a boolean instead of a boolean - and since the Promise itself would always be truthy, this check would always return true and never filter anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so based on return hashedToken === tokenInDbData.videoAssemblyWebhookToken

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we've manually tested with a mismatched token and ensured the request was rejected, I'll drop this - but I still believe the async filter may be an issue - sample code below, see also https://stackoverflow.com/a/71600833:

const test = [1, 2, 3, 4]

// sync filter - result is [2, 4]
test.filter(e => e % 2 == 0)

// async filter = result is [1, 2, 3, 4]
test.filter(async e => e % 2 == 0)

async e => {
const hashedToken = sha256(
String(req.headers["webhook_auth_header_value"])
)
const hashedToken = sha256(String(req.headers["x-maple-webhook"]))

const tokenInDb = await db
.collection("events")
Expand All @@ -33,12 +33,16 @@ export const transcription = functions.https.onRequest(async (req, res) => {
.doc("webhookAuth")
.get()
const tokenInDbData = tokenInDb.data()
console.log("tokenInDbData", tokenInDbData)

if (tokenInDbData) {
return hashedToken === tokenInDbData.videoAssemblyWebhookToken
}
return false
}
)
console.log("authenticatedEventsInDb", authenticatedEventsInDb)

if (authenticatedEventsInDb) {
try {
await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:

  • id
  • text
  • utterances
  • words
  • audio_url (though we already have it on the hearing event, it would still be helpful here)
  • audio_duration (not sure what we'd need it for though, but it's conceivably useful)
  • confidence

I'm most interested in text/utterances/words - since those will take up >99% of the data size (and words likely takes up >90%). We should chat with Matt V and design about the desired functionality, but I don't think we need all three. I goofed around with some of the example data: it looks like utterances gives us mostly sensible, mostly accurate divisions of the content, and text is good to have as a fallback.

IMO words is likely unneccessary - I believe it would only be useful if we either:

  • Find ourselves dissatisfied with the breakpoints/speaker split inherent to utterances and want to devise our own
  • Want to visually identify specific words that Assembly flagged as low confidence

Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut words, we should also probably cut the words array in utterances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would love to know what you think @mvictor55 and @mertbagt. I can proceed accordingly based on what you three align on.

Expand All @@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => {

authenticatedEventsInDb.forEach(async d => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking - is it possible that the firebase function will exit before the async function in the forEach completes?

If so, it may be worth switching to a transaction for the writes here - something like

const batch = db.batch()

batch.set(
  db.collection("transcriptions").doc(transcript.id),
  { _timestamp: new Date(), ...transcript }
)

authenticatedEventsInDb.forEach(doc => {
  batch.update(doc.ref, {["x-maple-webhook"]: null})
})

await batch.commit()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart, will do.

await d.ref.update({
["webhook_auth_header_value"]: null
["x-maple-webhook"]: null
})
})
console.log("transcript saved in db")
Expand Down