|
| 1 | +## Models |
| 2 | + |
| 3 | +### SASRec |
| 4 | + |
| 5 | +Uses self-attention to predict a user's next action based on their past |
| 6 | +activities. It aims to understand long-term user interests while also making |
| 7 | +good predictions based on just the most recent actions. It smartly adapts which |
| 8 | +past actions to focus on depending on how much history a user has. Built |
| 9 | +entirely with efficient attention blocks, SASRec avoids the complex structures |
| 10 | +of older RNN or CNN models, leading to faster training and better performance on |
| 11 | +diverse datasets. |
| 12 | + |
| 13 | +#### Architecture Overview |
| 14 | + |
| 15 | +- **Embedding Layer** - Converts item IDs into dense vectors. Adds a learnable |
| 16 | + absolute positional embedding to the item embedding to incorporate sequence |
| 17 | + order information. Dropout is applied to the combined embedding. |
| 18 | +- **Multi-Head Self-Attention Layer** Computes attention scores between all |
| 19 | + pairs of items within the allowed sequence window. Employs causality by |
| 20 | + masking out attention to future positions to prevent information leakage |
| 21 | + when training with a causal prediction objective. |
| 22 | +- **Feed-Forward Network** Applied independently to each embedding vector |
| 23 | + output by the attention layer. Uses two linear layers with a GeLU activation |
| 24 | + in between to add non-linearity. |
| 25 | +- **Residual Connections and Pre-Layernorm** Applied around both the |
| 26 | + self-attention and feed-forward network sub-layers for stable and faster |
| 27 | + training of deeper models. Dropout is also used within the block. |
| 28 | +- **Prediction Head** Decodes the sequence embeddings into logits using the |
| 29 | + input item embedding table and computes a causal categorical cross entropy |
| 30 | + loss between the inputs and the inputs shifted right. |
| 31 | + |
| 32 | +### BERT4Rec |
| 33 | + |
| 34 | +Models how user preferences change based on their past actions for |
| 35 | +recommendations. Unlike older methods that only look at history in chronological |
| 36 | +order, BERT4Rec uses a transformer based approach to look at the user's sequence |
| 37 | +of actions in both directions. This helps capture context better, as user |
| 38 | +behavior isn't always strictly ordered. To learn effectively, it is trained |
| 39 | +using a mask prediction objective: some items are randomly masked and the model |
| 40 | +learns to predict them based on the context.. BERT4Rec consistently performs |
| 41 | +better than many standard sequential models. |
| 42 | + |
| 43 | +#### Architecture Overview |
| 44 | + |
| 45 | +- **Embedding Layer** - Converts item IDs into dense vectors. Adds a learnable |
| 46 | + absolute positional embedding to the item embedding to incorporate sequence |
| 47 | + order information. An optional type embedding can be added to the item |
| 48 | + embedding. Embedding dropout is applied to the combined embedding. Uses a |
| 49 | + separate embedding for masked features to prevent other item tokens from |
| 50 | + attending to them. |
| 51 | + |
| 52 | +- **Multi-Head Self-Attention Layer** Computes attention scores between all |
| 53 | + pairs of items within the allowed sequence window. Uses a separate embedding |
| 54 | + for masked features to prevent other item tokens from attending to them. |
| 55 | + |
| 56 | +- **Feed-Forward Network** Applied independently to each embedding vector |
| 57 | + output by the attention layer. Uses two linear layers with a GeLU activation |
| 58 | + in between to add non-linearity. |
| 59 | + |
| 60 | +- **Residual Connections and Post-Layernorm** Applied around both the |
| 61 | + self-attention and feed-forward network sub-layers for stable and faster |
| 62 | + training of deeper models. Dropout is also used within the block. |
| 63 | + |
| 64 | +- **Masked Prediction Head** Gathers and projects the masked sequence |
| 65 | + embeddings, and decodes them using the item embedding layer. Computes a |
| 66 | + categorical cross entropy loss between the masked item ids and the predicted |
| 67 | + logits for the corresponding masked item embeddings. |
| 68 | + |
| 69 | +### HSTU |
| 70 | + |
| 71 | +HSTU is a novel architecture designed for sequential recommendation, |
| 72 | +particularly suited for high cardinality, non-stationary streaming data. It |
| 73 | +reformulates recommendation as a sequential transduction task within a |
| 74 | +generative modeling framework -"Generative Recommenders". HSTU aims to provide |
| 75 | +state-of-the-art results while being highly scalable and efficient, capable of |
| 76 | +handling models with up to trillions of parameters. It has demonstrated |
| 77 | +significant improvements over baselines in offline benchmarks and online A/B |
| 78 | +tests, leading to deployment on large-scale internet platforms. |
| 79 | + |
| 80 | +#### Architecture Overview |
| 81 | + |
| 82 | +- **Embedding Layer** Converts various action tokens into dense vectors in the |
| 83 | + same space. Optionally, adds a learnable absolute positional embedding to |
| 84 | + incorporate sequence order information. Embedding dropout is applied to the |
| 85 | + combined embedding. |
| 86 | +- **Gated Pointwise Aggregated Attention** - Uses a multi-head gated pointwise |
| 87 | + attention mechanism with a Layernorm on the attention outputs before |
| 88 | + projecting them. This captures the intensity of interactions between |
| 89 | + actions, which is lost in softmax attention. |
| 90 | +- **Relative Attention Bias** - Uses a T5 style relative attention bias |
| 91 | + computed using the positions and timestamps of the actions to improve the |
| 92 | + position encoding. |
| 93 | +- **Residual Connections and Pre-Layernorm** Applied around both the pointwise |
| 94 | + attention blocks for stable and faster training of deeper models. |
| 95 | +- **No Feedforward Network** - The feedforward network is removed. |
| 96 | +- **Prediction Head** - Decodes the sequence embeddings into logits using |
| 97 | + separately learnt weights and computes a causal categorical cross entropy |
| 98 | + loss between the inputs and the inputs shifted right. |
| 99 | + |
| 100 | +### Mamba4Rec |
| 101 | + |
| 102 | +A linear recurrent Mamba 2 architecture to model sequences of items for |
| 103 | +recommendations. This scales better on longer sequences than attention based |
| 104 | +methods due to its linear complexity compared to the former's quadratic |
| 105 | +complexity. Mamba4Rec performs better than RNNs and matches the quality of |
| 106 | +standard attention models while being more efficient at both training and |
| 107 | +inference time. |
| 108 | + |
| 109 | +#### Architecture Overview |
| 110 | + |
| 111 | +- **Embedding Layer** Converts item IDs into dense vectors. No position |
| 112 | + embedding is used since the recurrent nature of Mamba inherently encodes |
| 113 | + positional information as an inductive bias. |
| 114 | +- **Mamba SSD** Computes a causal interaction between different item |
| 115 | + embeddings in the sequence using the Mamba state space duality algorithm. |
| 116 | +- **Feedforward Network** Applied independently to each embedding vector |
| 117 | + output by the Mamba layer. Uses two linear layers with a GeLU activation in |
| 118 | + between to add non-linearity. |
| 119 | +- **Residual Connections and Post-Layernorm** Applied around both the Mamba |
| 120 | + and feed-forward network sub-layers for stable and faster training of deeper |
| 121 | + models. Dropout is also used within the block. |
| 122 | +- **Prediction Head** Decodes the sequence embeddings into logits using the |
| 123 | + input item embedding table and computes a causal categorical cross entropy |
| 124 | + loss between the inputs and the inputs shifted right. |
| 125 | + |
| 126 | +## References |
| 127 | + |
| 128 | +- SASRec Paper: Kang, W. C., & McAuley, J. (2018). Self-Attentive Sequential |
| 129 | + Recommendation. arXiv preprint arXiv:1808.09781v1. |
| 130 | + https://arxiv.org/abs/1808.09781 |
| 131 | +- Transformer Paper: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., |
| 132 | + Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you |
| 133 | + need. Advances in neural information processing systems, 30. |
| 134 | +- Mamba4Rec Paper: Liu, C., Lin, J., Liu, H., Wang, J., & Caverlee, J. (2024). |
| 135 | + Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State |
| 136 | + Space Models. arXiv preprint arXiv:2403.03900v2. |
| 137 | + https://arxiv.org/abs/2403.03900 |
| 138 | +- Mamba Paper: Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling |
| 139 | + with Selective State Spaces. arXiv preprint arXiv:2312.00752. |
| 140 | +- BERT4Rec Paper: Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, |
| 141 | + P. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder |
| 142 | + Representations from Transformer. arXiv preprint arXiv:1904.06690v2. |
| 143 | + https://arxiv.org/abs/1904.06690 |
| 144 | +- BERT Paper: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: |
| 145 | + Pre-training of Deep Bidirectional Transformers for Language Understanding. |
| 146 | + arXiv preprint arXiv:1810.04805. |
| 147 | +- HSTU Paper: Actions Speak Louder than Words: Trillion-Parameter Sequential |
| 148 | + Transducers for Generative Recommendations (arXiv:2402.17152) |
0 commit comments