Skip to content

Commit 1adf4e2

Browse files
Hilly12recml authors
authored andcommitted
Add README for models.
PiperOrigin-RevId: 745456401
1 parent bdeb41c commit 1adf4e2

File tree

2 files changed

+149
-1
lines changed

2 files changed

+149
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
RecML
1+
test

RecML/layers/keras/README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
## Models
2+
3+
### SASRec
4+
5+
Uses self-attention to predict a user's next action based on their past
6+
activities. It aims to understand long-term user interests while also making
7+
good predictions based on just the most recent actions. It smartly adapts which
8+
past actions to focus on depending on how much history a user has. Built
9+
entirely with efficient attention blocks, SASRec avoids the complex structures
10+
of older RNN or CNN models, leading to faster training and better performance on
11+
diverse datasets.
12+
13+
#### Architecture Overview
14+
15+
- **Embedding Layer** - Converts item IDs into dense vectors. Adds a learnable
16+
absolute positional embedding to the item embedding to incorporate sequence
17+
order information. Dropout is applied to the combined embedding.
18+
- **Multi-Head Self-Attention Layer** Computes attention scores between all
19+
pairs of items within the allowed sequence window. Employs causality by
20+
masking out attention to future positions to prevent information leakage
21+
when training with a causal prediction objective.
22+
- **Feed-Forward Network** Applied independently to each embedding vector
23+
output by the attention layer. Uses two linear layers with a GeLU activation
24+
in between to add non-linearity.
25+
- **Residual Connections and Pre-Layernorm** Applied around both the
26+
self-attention and feed-forward network sub-layers for stable and faster
27+
training of deeper models. Dropout is also used within the block.
28+
- **Prediction Head** Decodes the sequence embeddings into logits using the
29+
input item embedding table and computes a causal categorical cross entropy
30+
loss between the inputs and the inputs shifted right.
31+
32+
### BERT4Rec
33+
34+
Models how user preferences change based on their past actions for
35+
recommendations. Unlike older methods that only look at history in chronological
36+
order, BERT4Rec uses a transformer based approach to look at the user's sequence
37+
of actions in both directions. This helps capture context better, as user
38+
behavior isn't always strictly ordered. To learn effectively, it is trained
39+
using a mask prediction objective: some items are randomly masked and the model
40+
learns to predict them based on the context.. BERT4Rec consistently performs
41+
better than many standard sequential models.
42+
43+
#### Architecture Overview
44+
45+
- **Embedding Layer** - Converts item IDs into dense vectors. Adds a learnable
46+
absolute positional embedding to the item embedding to incorporate sequence
47+
order information. An optional type embedding can be added to the item
48+
embedding. Embedding dropout is applied to the combined embedding. Uses a
49+
separate embedding for masked features to prevent other item tokens from
50+
attending to them.
51+
52+
- **Multi-Head Self-Attention Layer** Computes attention scores between all
53+
pairs of items within the allowed sequence window. Uses a separate embedding
54+
for masked features to prevent other item tokens from attending to them.
55+
56+
- **Feed-Forward Network** Applied independently to each embedding vector
57+
output by the attention layer. Uses two linear layers with a GeLU activation
58+
in between to add non-linearity.
59+
60+
- **Residual Connections and Post-Layernorm** Applied around both the
61+
self-attention and feed-forward network sub-layers for stable and faster
62+
training of deeper models. Dropout is also used within the block.
63+
64+
- **Masked Prediction Head** Gathers and projects the masked sequence
65+
embeddings, and decodes them using the item embedding layer. Computes a
66+
categorical cross entropy loss between the masked item ids and the predicted
67+
logits for the corresponding masked item embeddings.
68+
69+
### HSTU
70+
71+
HSTU is a novel architecture designed for sequential recommendation,
72+
particularly suited for high cardinality, non-stationary streaming data. It
73+
reformulates recommendation as a sequential transduction task within a
74+
generative modeling framework -"Generative Recommenders". HSTU aims to provide
75+
state-of-the-art results while being highly scalable and efficient, capable of
76+
handling models with up to trillions of parameters. It has demonstrated
77+
significant improvements over baselines in offline benchmarks and online A/B
78+
tests, leading to deployment on large-scale internet platforms.
79+
80+
#### Architecture Overview
81+
82+
- **Embedding Layer** Converts various action tokens into dense vectors in the
83+
same space. Optionally, adds a learnable absolute positional embedding to
84+
incorporate sequence order information. Embedding dropout is applied to the
85+
combined embedding.
86+
- **Gated Pointwise Aggregated Attention** - Uses a multi-head gated pointwise
87+
attention mechanism with a Layernorm on the attention outputs before
88+
projecting them. This captures the intensity of interactions between
89+
actions, which is lost in softmax attention.
90+
- **Relative Attention Bias** - Uses a T5 style relative attention bias
91+
computed using the positions and timestamps of the actions to improve the
92+
position encoding.
93+
- **Residual Connections and Pre-Layernorm** Applied around both the pointwise
94+
attention blocks for stable and faster training of deeper models.
95+
- **No Feedforward Network** - The feedforward network is removed.
96+
- **Prediction Head** - Decodes the sequence embeddings into logits using
97+
separately learnt weights and computes a causal categorical cross entropy
98+
loss between the inputs and the inputs shifted right.
99+
100+
### Mamba4Rec
101+
102+
A linear recurrent Mamba 2 architecture to model sequences of items for
103+
recommendations. This scales better on longer sequences than attention based
104+
methods due to its linear complexity compared to the former's quadratic
105+
complexity. Mamba4Rec performs better than RNNs and matches the quality of
106+
standard attention models while being more efficient at both training and
107+
inference time.
108+
109+
#### Architecture Overview
110+
111+
- **Embedding Layer** Converts item IDs into dense vectors. No position
112+
embedding is used since the recurrent nature of Mamba inherently encodes
113+
positional information as an inductive bias.
114+
- **Mamba SSD** Computes a causal interaction between different item
115+
embeddings in the sequence using the Mamba state space duality algorithm.
116+
- **Feedforward Network** Applied independently to each embedding vector
117+
output by the Mamba layer. Uses two linear layers with a GeLU activation in
118+
between to add non-linearity.
119+
- **Residual Connections and Post-Layernorm** Applied around both the Mamba
120+
and feed-forward network sub-layers for stable and faster training of deeper
121+
models. Dropout is also used within the block.
122+
- **Prediction Head** Decodes the sequence embeddings into logits using the
123+
input item embedding table and computes a causal categorical cross entropy
124+
loss between the inputs and the inputs shifted right.
125+
126+
## References
127+
128+
- SASRec Paper: Kang, W. C., & McAuley, J. (2018). Self-Attentive Sequential
129+
Recommendation. arXiv preprint arXiv:1808.09781v1.
130+
https://arxiv.org/abs/1808.09781
131+
- Transformer Paper: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
132+
Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you
133+
need. Advances in neural information processing systems, 30.
134+
- Mamba4Rec Paper: Liu, C., Lin, J., Liu, H., Wang, J., & Caverlee, J. (2024).
135+
Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State
136+
Space Models. arXiv preprint arXiv:2403.03900v2.
137+
https://arxiv.org/abs/2403.03900
138+
- Mamba Paper: Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling
139+
with Selective State Spaces. arXiv preprint arXiv:2312.00752.
140+
- BERT4Rec Paper: Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang,
141+
P. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder
142+
Representations from Transformer. arXiv preprint arXiv:1904.06690v2.
143+
https://arxiv.org/abs/1904.06690
144+
- BERT Paper: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT:
145+
Pre-training of Deep Bidirectional Transformers for Language Understanding.
146+
arXiv preprint arXiv:1810.04805.
147+
- HSTU Paper: Actions Speak Louder than Words: Trillion-Parameter Sequential
148+
Transducers for Generative Recommendations (arXiv:2402.17152)

0 commit comments

Comments
 (0)