(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention)

### Bug description

Hi Sebastian,

I think that it is not a bug but possible enhancement - to apply mask we have two steps now:
- Creating lower triangular matrix ones and zeros:
```python
mask_simple = torch.tril(torch.ones(context_length, context_length))
```
- Multiply attention matrix with triangular matrix:
```python
masked_simple = attn_weights * mask_simple
```
However this function (`torch.tril`) can be applied directly to attention matrix to get the same result:
```python
torch.tril(attn_weights)
```

Thank you.

### What operating system are you using?

None

### Where do you run your code?

None

### Environment

```



```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention) #279

Bug description

What operating system are you using?

Where do you run your code?

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention) #279

Description

Bug description

What operating system are you using?

Where do you run your code?

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions