Skip to content

Could you help fix the backdoor vulnerability caused by a risky pre-trained models used in this repo? #58

@Rockstar292

Description

@Rockstar292

Hi, @gunyarakun, @fujimotos, I'd like to report that a potentially risky pretrained model is being used in this project, which may pose backdoor threats. Please check the following code example:

pkg/k2-asr/src/huggingface.py

if language == "ja":
        hf_repo_id = "reazon-research/reazonspeech-k2-v2"
        epochs = 99
try:
        basedir = hf.snapshot_download(hf_repo_id, local_files_only=True, resume_download=True)
    except hf.utils.LocalEntryNotFoundError:
        basedir = hf.snapshot_download(hf_repo_id, resume_download=True)
sherpa_onnx.OfflineRecognizer.from_transducer(
        tokens=os.path.join(basedir, files["tokens"]),
        encoder=os.path.join(basedir, files['encoder']),
        decoder=os.path.join(basedir, files['decoder']),
        joiner=os.path.join(basedir, files['joiner']),
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method="greedy_search",
        provider=device,
    )

pkg/nemo-asr/src/cli.py

    # Load audio data and model
    audio = audio_from_path(args[0])
    model = load_model()

    # Perform inference
    ret = transcribe(model, audio)

Issue Description

As shown above, in the pkg/k2-asr/src/huggingface.py file, the model "reazon-research/reazonspeech-k2-v2" is first downloaded by the snapshot_download method .Subsequently, the model is loaded via the sherpa_onnx.OfflineRecognizer.from_transducer method , and finally executed in pkg/nemo-asr/src/cli.py using the transcribe method.

This model has been flagged as risky on the HuggingFace platform. Specifically, its encoder-epoch-99-avg-1.onnxand encoder-epoch-99-avg-1.int8.onnx file is marked as malicious and may trigger backdoor threats. For certain inputs, the backdoor could be activated, effectively altering the model's behavior.

Image

Related Risk Reports:reazon-research/reazonspeech-k2-v2 risk report

Suggested Repair Methods

  1. Convert the model to safer safetensors format and re-upload
  2. Visually inspect the model using OSS tools like Netron. If no issues are found, report the false threat to the scanning platform

As one of the most popular machine learning projects(star:297), every potential risk could be propagated and amplified. Could you please address the above issues?

Thanks for your help~

Best regards,
Rockstars

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions