Skip to content

is it better x[-1] @ wte.T #25

@Sandy4321

Description

@Sandy4321

is it better to change
return x @ wte.T # [n_seq, n_embd] -> [n_seq, n_vocab]
by
x[-1] @ wte.T
?

then we can use
next_id = np.argmax(logits)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions