Skip to content

[extensions][py][hf] 2/n ASR model parser impl#780

Merged
Ankush-lastmile merged 2 commits intomainfrom
pr780
Jan 10, 2024
Merged

[extensions][py][hf] 2/n ASR model parser impl#780
Ankush-lastmile merged 2 commits intomainfrom
pr780

Conversation

@Ankush-lastmile
Copy link
Contributor

@Ankush-lastmile Ankush-lastmile commented Jan 5, 2024

[extensions][py][hf] 2/n ASR model parser impl

Model Parser for the Automatic Speech Recognition task on huggingface.

Decisions made while implementing:

  • manual impl to parse input attachments
    • threw exceptions on every unexpected step. Not sure if this is the direction we want to go with this.
  • This diff does not implement serialize() for the model parser (will implement on diff ontop)

Testplan

Created an mp3 file that says "hi". Used aiconfig to run asr on it.

|Screenshot 2024-01-09 at 7 14 47 PM

|Screenshot 2024-01-09 at 5 54 33 PM|
| ------------- | ------------- |


Stack created with Sapling. Best reviewed with ReviewStack.

@Ankush-lastmile Ankush-lastmile changed the title 2/n model parser impl [extensions][py][hf] 2/n ASR model parser impl Jan 5, 2024
@Ankush-lastmile Ankush-lastmile force-pushed the pr780 branch 3 times, most recently from b156c92 to aa6bf10 Compare January 9, 2024 17:19
@Ankush-lastmile Ankush-lastmile marked this pull request as ready for review January 9, 2024 17:28
Comment on lines +9 to +11
if TYPE_CHECKING:
from aiconfig import AIConfigRuntime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally separate discussion: do we have to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from before, this was needed. Keeping it for now since we don't have automated tests running

@Ankush-lastmile Ankush-lastmile force-pushed the pr780 branch 2 times, most recently from 27ec5ea to d14669c Compare January 9, 2024 21:24
@Ankush-lastmile
Copy link
Contributor Author

Ankush-lastmile commented Jan 9, 2024

Added a couple of changes:

  1. Defined AttachmentInputDataWithStringValue. This will need to be incorporated into the AIConfig sdk but for now we will define this inside this model parser and enforce it here. TODO: Task Support Multi Modal Input data types #829

1.5. AIConfig for this will look slightly different. See updated testplan

  1. Updated Model Parser to be able to customize pipeline creation. Note: for now this customizability is only possible the first time the pipeline is created in memory

  2. added a refine Model Parser completion params

  3. check for device in config settings

  4. handled multiple outputs as well as outputs with metadata

@Ankush-lastmile
Copy link
Contributor Author

rebase, and fix init.py

f"Error: {str(e)}. Please specify a kind and value for the attachment data."
)
# TODO: once previous todo gets resolved, modify this line to `.kind`` instead of .data.get("kind"). It will be a pydantic base class.
if not attachment.data.get("kind") == "file_uri":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to update the client rendering and prompt schema format to support this. And we'll need to be consistent with this type in the image-to-text parser as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have this be a string and migrate to the object with kind and value afterwards? My concerns here are:

  • client rendering is currently adding attachments as {data: , mime_type}
  • image-to-text parser is currently not doing this {kind/value} stuff and that inconsistency is going to cause a lot more work on the client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline with @rholinshead, reverted change to input data type. For ease of use we will have data as just a string and implement structured input later

@Ankush-lastmile
Copy link
Contributor Author

discussed offline with @rholinshead, reverted change to input data type. For ease of use we will have data as just a string and implement structured input later

updated testplan to show as well

Comment on lines +117 to +121
def _get_device(self) -> str:
if torch.cuda.is_available():
return "cuda"
# Mps backend is not supported for all asr models. Seen when spinning up a default asr pipeline which uses facebook/wav2vec2-base-960h 55bb623
return "cpu"
Copy link
Contributor

@rossdanlm rossdanlm Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment!

return "cpu"

def get_output_text(self, response: dict[str, Any]) -> str:
raise NotImplementedError("get_output_text is not implemented for HuggingFaceAutomaticSpeechRecognition")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not implementing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added implementation

Comment on lines +217 to +244
supported_keys = {
# inputs
"return_timestamps",
"generate_kwargs",
"max_new_tokens",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually way more because we're using generate_kwargs which takes in from the generalized text generation params. Pls add this to the task for unifying this in prompt schema, and link to task here

output = ExecuteResult(
**{
"output_type": "execute_result",
"data": result.get("text"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So response is always guaranteed to have the text attribute? I think it would be safe just to have another line above saying text_output: str = result.get("text") if "text" in result and isinstance(result, dict) else result or whatever just to future proof

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is without knowing whta result type is, we can't know for sure and better to be safe just to ensure it can't break

device = self._get_device()
if pipeline_creation_data.get("device", None) is None:
pipeline_creation_data["device"] = device
self.pipelines[model_name] = pipeline(task="automatic-speech-recognition", **pipeline_creation_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no change, just annoying that they called this automatic-speech-recognition instead of audio-to-text like everything else they did smh

"""


class HuggingFaceAutomaticSpeechRecognition(ParameterizedModelParser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add "Transformer" suffix at end to meet same format as the others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@rossdanlm rossdanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls link relevant comments to relevant issues so we can track in the future

@Ankush-lastmile
Copy link
Contributor Author

  • added some comments
  • renamed to HuggingFaceAutomaticSpeechRecognitionTransformer from HuggingFaceAutomaticSpeechRecognitionTransformer
  • added get output text impl (copied from other model parser impls

Ankush Pala ankush@lastmileai.dev added 2 commits January 9, 2024 19:18
Setting up the asr parser class
Model Parser for the Automatic Speech Recognition task on huggingface.


Decisions made while implementing:
- manual impl to parse input attachments
- - threw exceptions on every unexpected step. Not sure if this is the direction we want to go with this.
- This diff does not implement serialize() for the model parser (will implement on diff ontop)

## Testplan

Created an mp3 file that says "hi". Used aiconfig to run asr on it.

|<img width="922" alt="Screenshot 2024-01-09 at 7 14 47 PM" src="https://github.com/lastmile-ai/aiconfig/assets/141073967/fe68751d-e20b-41d9-9da5-cc9a32859cba">

|<img width="1461" alt="Screenshot 2024-01-09 at 5 54 33 PM" src="https://github.com/lastmile-ai/aiconfig/assets/141073967/78063a3e-2b9a-4a39-80d9-ef28a7d706cf">|
| ------------- | ------------- |
@Ankush-lastmile
Copy link
Contributor Author

Ankush-lastmile commented Jan 10, 2024

rebase.

Discussed with @rossdanlm offline, will fix forward any potentially remaining and pertaining issues. Tasks re: huggingface model parsers have been opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants