Skip to content

What does get_offset function do? (Insufficient info in docs) #3

@yuenherny

Description

@yuenherny

I ran into this assertion error in get_offset function when using models other than from cl-tohoku in JaQuAD.ipynb:

assert unk_pointer is not None, \
                'Normalized context and tokens are not matched'

I know this is something related to tokenization but I still can't quite figure it out even after going through the docstring:

'''The character-level start/end offsets of a token within a context.
    Algorithm:
    1. Make offsets of normalized context within the original context.
    2. Make offsets of tokens (input_ids) within the normalized context.

    Arguments:
    input_ids -- Token ids of tokenized context (by tokenizer).
    context -- String of context
    tokenizer
    norm_form

    Return:
        List[Tuple[int, int]]: Offsets of tokens within the input context.
        For each token, the offsets are presented as a tuple of (start
        position index, end position index). Both indices are inclusive.
    '''

What is the motivation behind this function and in what circumstance would you need it?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions