|
| 1 | +# Recommendation Pipeline |
| 2 | + |
| 3 | +A recommendation pipeline, designed to suggest items of potential |
| 4 | +interest to users based on their requests, is an integral part of any |
| 5 | +recommender system. Specifically, a user seeking recommendations submits |
| 6 | +a request that includes their user ID and the current context features, |
| 7 | +such as recently browsed items and browsing duration, to the inference |
| 8 | +service. The recommendation pipeline uses these user features and those |
| 9 | +of potential items as input for computation. It then derives a score for |
| 10 | +each candidate item, selects the highest-scoring items (ranging from |
| 11 | +dozens to hundreds) to form the recommendation result, and delivers this |
| 12 | +result back to the user. |
| 13 | + |
| 14 | +Given that a recommender system generally contains billions of potential |
| 15 | +items, using just a single model to compute the score of each item |
| 16 | +necessitates a trade-off between model accuracy and speed. In other |
| 17 | +words, opting for a simpler model may boost speed but potentially result |
| 18 | +in recommendations that fail to pique the user's interest due to |
| 19 | +diminished accuracy. On the other hand, using a more complex model may |
| 20 | +provide more accurate results but deter users due to longer waiting |
| 21 | +times. |
| 22 | + |
| 23 | + |
| 24 | +:label:`recommender pipeline` |
| 25 | + |
| 26 | +To mitigate this, contemporary recommender systems typically deploy |
| 27 | +multiple recommendation models as part of a pipeline, as illustrated in |
| 28 | +Figure :numref:`recommender pipeline`. The pipeline begins with the |
| 29 | +retrieval stage, employing fast, simple models to filter the entire pool |
| 30 | +of candidate items, identifying thousands to tens of thousands of items |
| 31 | +that the user may find appealing. Following this, in the ranking stage, |
| 32 | +slower, more complex models score and order the retrieved items. The |
| 33 | +top-scoring items (the exact number may vary depending on the specific |
| 34 | +service scenario), numbering in the dozens or hundreds, are returned as |
| 35 | +the final recommendation. If the ranking models are too intricate and |
| 36 | +cannot process all retrieved items within the given time frame, the |
| 37 | +ranking stage may be further divided into three sub-stages: pre-ranking, |
| 38 | +ranking, and re-ranking. |
| 39 | + |
| 40 | +## Retrieval Stage |
| 41 | + |
| 42 | +The retrieval stage is the initial phase of the recommendation process. |
| 43 | +The model takes user features as input and performs a rough filter of |
| 44 | +all candidate items to identify those the user might be interested in. |
| 45 | +These selected items form the output. The main goal of the retrieval |
| 46 | +stage is to reduce the pool of candidate items, thereby lightening the |
| 47 | +computational load on the ranking model in the subsequent stage. |
| 48 | + |
| 49 | +### Two-Tower Model |
| 50 | + |
| 51 | +To illustrate the retrieval process, let's consider the two-tower model |
| 52 | +as an example, as shown in Figure |
| 53 | +:numref:`two tower model`. The two-tower model contains two |
| 54 | +multilayer perceptrons (MLPs) which encode user features and item |
| 55 | +features, referred to as the user tower[^1] and item tower, |
| 56 | +respectively. |
| 57 | + |
| 58 | +Continuous features can be input directly into the MLPs, while discrete |
| 59 | +features must first be mapped into a dense vector using embedding tables |
| 60 | +before being fed into the MLPs. The user tower and item tower process |
| 61 | +these features to generate user vectors and item vectors, respectively, |
| 62 | +each representing a unique user or item. The two-tower model employs a |
| 63 | +scoring function to evaluate the similarity between user vectors and |
| 64 | +item vectors. |
| 65 | + |
| 66 | + |
| 67 | +:label:`two tower model` |
| 68 | + |
| 69 | +### Training |
| 70 | + |
| 71 | +During training, the model input consists of the user's feedback data on |
| 72 | +historical recommendation results, represented by the tuple \<user, |
| 73 | +item, label\>. The label denotes whether the user has clicked the item, |
| 74 | +with 1 and 0 typically representing a click and non-click, respectively. |
| 75 | +The two-tower model uses positive samples (i.e., samples where the label |
| 76 | +is 1) for training. To obtain negative samples, an intra-batch sampler |
| 77 | +that corrects sampling bias performs sampling within the batch. The |
| 78 | +details of the algorithm, while not the focus here, can be found in the |
| 79 | +original paper. |
| 80 | + |
| 81 | +The model's output consists of the click probabilities for different |
| 82 | +items. During training, a suitable loss function is chosen to ensure |
| 83 | +that the predicted results for positive samples are as close to 1 as |
| 84 | +possible, and as close to 0 as possible for negative samples. |
| 85 | + |
| 86 | +### Inference |
| 87 | + |
| 88 | +Before inference, item vectors for all items are computed and saved |
| 89 | +using the trained model. Given that item features are relatively stable, |
| 90 | +this step can reduce computational overhead during inference and speed |
| 91 | +up the process. User features, which are related to user behavior, are |
| 92 | +processed when user requests arrive. The two-tower model uses the user |
| 93 | +tower to compute current user features and generate the user vector. The |
| 94 | +same scoring function used during training is then used to measure |
| 95 | +similarity. This enables similarity search based on the user vector |
| 96 | +across all candidate item vectors. The most similar items are output as |
| 97 | +the retrieval result. |
| 98 | + |
| 99 | +### Evaluation Metrics |
| 100 | + |
| 101 | +A common evaluation metric of the retrieval model is the recall metric |
| 102 | +when $k$ items are recalled (Recall@k). This metric essentially |
| 103 | +quantifies the ability of a model to successfully retrieve the top $k$ |
| 104 | +items of interest. |
| 105 | + |
| 106 | +The mathematical definition of Recall@k is expressed as follows: |
| 107 | + |
| 108 | +$$\text{Recall@k} = \frac{\text{TP}}{\min(\text{TP} + \text{FN}, k)}$$ |
| 109 | + |
| 110 | +In this equation, the term \"True Positive\" (TP) refers to the count of |
| 111 | +items correctly identified by the model as relevant (i.e., with a true |
| 112 | +label of 1) among the $k$ items retrieved. On the other hand, \"False |
| 113 | +Negative\" (FN) denotes the count of relevant items (again, with a true |
| 114 | +label of 1) that the model failed to include among the $k$ retrieved |
| 115 | +items. |
| 116 | + |
| 117 | +Thus, the Recall@k metric serves as a measure of the model's ability to |
| 118 | +correctly identify and retrieve positive samples. Importantly, it is |
| 119 | +crucial to understand that if the total number of positive samples |
| 120 | +surpasses $k$, the maximum possible count of correctly retrieved items |
| 121 | +is $k$. This is due to the fact that the model is limited to retrieving |
| 122 | +only $k$ items. Consequently, the denominator in the Recall@k equation |
| 123 | +is defined as the lesser of two quantities: the sum of true positives |
| 124 | +and false negatives, or $k$. |
| 125 | + |
| 126 | +## Ranking Stage |
| 127 | + |
| 128 | +During the ranking phase, the model appraises the items gathered in the |
| 129 | +retrieval stage, evaluating each individually in terms of user features |
| 130 | +and item features. Each item's score is indicative of the probability |
| 131 | +that the user might be interested in that item. As a result, the |
| 132 | +highest-scoring items based on these rankings are then suggested to the |
| 133 | +user. |
| 134 | + |
| 135 | +If the number of candidate items evaluated by the recommendation model |
| 136 | +continually increases, or if the recommendation logic and rules become |
| 137 | +more complex, the entire ranking stage can be efficiently divided into |
| 138 | +three sub-stages: pre-ranking, ranking, and re-ranking. |
| 139 | + |
| 140 | +### Pre-ranking |
| 141 | + |
| 142 | +Acting as an intermediary between the retrieval and ranking stages, the |
| 143 | +pre-ranking stage serves as an additional layer of filtering. This |
| 144 | +becomes particularly useful when there's a large influx of candidate |
| 145 | +items from the retrieval stage, or when multi-channel retrieval methods |
| 146 | +are used to boost retrieval result diversity. If every retrieved item |
| 147 | +was directly fed into the ranking model, the subsequent process could |
| 148 | +become overly lengthy due to the sheer volume of items. Thus, |
| 149 | +introducing a pre-ranking stage to the recommendation pipeline reduces |
| 150 | +the number of items proceeding to the ranking stage, enhancing overall |
| 151 | +system efficiency. |
| 152 | + |
| 153 | +### Ranking |
| 154 | + |
| 155 | +Ranking, the second stage, is pivotal in the pipeline. In this phase, |
| 156 | +it's essential that the model precisely represents the user's |
| 157 | +preferences across varying items. When referring to the \"ranking |
| 158 | +model\" in subsequent sections, we are specifically addressing the model |
| 159 | +used during this ranking sub-stage. |
| 160 | + |
| 161 | +### Re-ranking |
| 162 | + |
| 163 | +In the final re-ranking stage, the preliminary outcomes derived from the |
| 164 | +ranking stage are further refined according to specific business logic |
| 165 | +and rules. The goal of this stage is to improve the holistic quality of |
| 166 | +the recommendation service, shifting the focus from the click-through |
| 167 | +rate (CTR) of a single item to the broader user experience. For |
| 168 | +instance, the applied business logic might include efforts to increase |
| 169 | +the visibility of new items, filter out previously purchased items or |
| 170 | +watched videos, and create rules to diversify the order and variety of |
| 171 | +recommended items, thereby decreasing the frequency of similar item |
| 172 | +recommendations. |
| 173 | + |
| 174 | +## Ranking with Deep Learning |
| 175 | + |
| 176 | +The ranking stage in a recommender system has largely benefited from the |
| 177 | +use of deep learning models. These models are often referred to as the |
| 178 | +Deep Learning Recommendation Model (DLRM). As depicted in Figure |
| 179 | +:numref:`dlrm model`, a |
| 180 | +DLRM consists of embedding tables, multi-layer perceptrons (MLPs) that |
| 181 | +include two layers, and an interaction layer.[^2] |
| 182 | + |
| 183 | + |
| 184 | +:label:`dlrm model` |
| 185 | + |
| 186 | +Similar to the two-tower model, the DLRM initially uses embedding tables |
| 187 | +to transform discrete features into corresponding embedding items, which |
| 188 | +are represented as dense vectors. The model then combines all continuous |
| 189 | +features into a single vector, which is introduced into the bottom MLP, |
| 190 | +generating an output vector with the same dimension as the embedding |
| 191 | +items. Both this output vector and all the embedding items are then |
| 192 | +forwarded to the interaction layer for further processing. |
| 193 | + |
| 194 | +As illustrated in Figure :numref:`interaction`, the interaction layer performs dot product |
| 195 | +operations on all features (encompassing all embedding items and the |
| 196 | +processed continuous features) to obtain second-order interactions. As |
| 197 | +the features interacted within the interaction layer are symmetric, the |
| 198 | +diagonal represents each feature's self-interaction result. In the |
| 199 | +non-diagonal section, every distinct pair of features interacts twice |
| 200 | +(e.g., for features $p$ and $q$, two results are acquired: $<p,q>$ and |
| 201 | +$<q,p>$). Therefore, only the lower triangular part of the result matrix |
| 202 | +is retained and flattened. This flattened interaction result is merged |
| 203 | +with the output from the bottom MLP, and the combined result is used as |
| 204 | +the input for the top MLP. After further processing by the top MLP, the |
| 205 | +final output score reflects the probability of a user clicking on the |
| 206 | +item. |
| 207 | + |
| 208 | + |
| 209 | +:label:`interaction` |
| 210 | + |
| 211 | +### Training Process |
| 212 | + |
| 213 | +The DLRM bases its training on \<user, item, label\> tuples. It takes in |
| 214 | +user and item features as inputs and interacts with these features to |
| 215 | +predict the likelihood of a user clicking an item. For positive samples, |
| 216 | +the model aims to approximate this probability as closely to 1 as |
| 217 | +possible, while for negative samples, the goal is to get this |
| 218 | +probability as near to 0 as possible. |
| 219 | + |
| 220 | +The ranking process can be considered a binary classification problem; |
| 221 | +the (user, item) pair can be classified either as click (label: 1) or no |
| 222 | +click (label: 0). Therefore, the method used to evaluate a ranking model |
| 223 | +is analogous to that employed for assessing a binary classification |
| 224 | +model. However, it's crucial to consider that recommender system |
| 225 | +datasets tend to be extremely imbalanced, meaning the proportion of |
| 226 | +positive samples is drastically different from that of negative samples. |
| 227 | +To minimize the influence of this data imbalance on metrics, we use the |
| 228 | +Area Under the Curve (AUC) and F1 score to evaluate ranking models. |
| 229 | + |
| 230 | +The AUC is the area under the Receiver Operating Characteristic (ROC) |
| 231 | +curve, a graph used to define classification thresholds, plotted with |
| 232 | +the True Positive Rate (TPR) against the False Positive Rate (FPR) --- |
| 233 | +with the TPR on the y-axis and the FPR on the x-axis. An appropriate |
| 234 | +classification threshold can be determined by calculating the AUC and |
| 235 | +the ROC curves. If the predicted probability exceeds the classification |
| 236 | +threshold, the prediction result is 1 (click); otherwise, it is 0 (no |
| 237 | +click). From the prediction result, recall and precision can be |
| 238 | +computed, which in turn allows for the calculation of the F1 score using |
| 239 | +the formula :eqref:`f1`. |
| 240 | + |
| 241 | +$$F1 = 2 \times \frac{recall \times precision}{recall + precision}$$ |
| 242 | +:eqlabel:`equ:f1` |
| 243 | + |
| 244 | +### Inference Process |
| 245 | + |
| 246 | +During the inference stage, the features of the retrieved items, along |
| 247 | +with their corresponding user features, are merged and inputted into the |
| 248 | +DLRM. The model then predicts scores, and the items with the highest |
| 249 | +probabilities are selected for output. |
| 250 | + |
| 251 | +[^1]: In the original paper, the user tower also uses the features of |
| 252 | + videos watched by users as seed features. |
| 253 | + |
| 254 | +[^2]: DLRM is designed for structural customization. This section will |
| 255 | + illustrate an example using the standard code implementation of |
| 256 | + DLRM. |
0 commit comments