Implicit feedback represents observable user actions rather than direct preference statements. So it inherently suffers from ambiguity as a signal of true user preference. To address this issue, this study reinterprets the ambiguity of implicit feedback signals as a problem of epistemic uncertainty regarding user preferences and proposes a latent factor model that incorporates this uncertainty within a Bayesian framework.
Specifically, the behavioral vector of a user, which is learned from implicit feedback, is restructured within the embedding space using attention mechanisms applied to the user’s interaction history, forming an implicit preference representation. Similarly, item feature vectors are reinterpreted in the context of the target user’s history, resulting in personalized item representations.
This study replaces the deterministic attention scores with stochastic attention weights treated as random variables whose distributions are modeled using a Bayesian approach. Through this design, the proposed model effectively captures the uncertainty stemming from implicit feedback within the vector representations of users and items.
the original BACF model uses a single item embedding matrix and applies separate linear transformations depending on the role (target or history):
however, we observed that using role-specific embedding matrices yields better performance than sharing a single embedding matrix. so, we corrected our model to:
-
$u=0,1,2,\cdots,M-1$ : target user -
$i=0,1,2,\cdots,N-1$ : target item -
$j \in R_{u}^{+} \setminus {i}$ : history items of target user (target item$i$ is excluded)
-
$p \in \mathbb{R}^{M \times K}$ : user id embedding vector (we define it as global behavior representation) -
$q \in \mathbb{R}^{N \times K}$ : target item id embedding vector (we define it as global behavior representation) -
$h \in \mathbb{R}^{N \times K}$ : history item id embedding vector -
$c_{u} \in \mathbb{R}^{M \times K}$ : user context vector (we define it as conditional preference representation) -
$c_{i} \in \mathbb{R}^{N \times K}$ : item context vector (we define it as conditional preference representation) -
$z_{u} \in \mathbb{R}^{M \times K}$ : user refined representation -
$z_{i} \in \mathbb{R}^{N \times K}$ : item refined representation -
$z_{u,i}$ :$(u,i)$ pair predictive vector -
$x_{u,i}$ :$(u,i)$ pair interaction logit -
$y_{u,i}, \hat{y}_{u,i}$ :$(u,i)$ pair interaction probability
-
$\mathrm{MLP}(\cdot)$ : multi-layer-perceptron -
$\mathrm{bam}(q,k,v)$ : bayesian attention module (only single head) -
$\mathrm{layernorm}(\cdot)$ : layer normalization -
$\odot$ : element-wise product -
$\oplus$ : vector concatenation -
$\mathrm{ReLU}$ : activation function, ReLU -
$\sigma$ : activation function, sigmoid -
$W$ : linear transformation matrix -
$h$ : linear trainsformation vector -
$b$ : bias term
- user global behavior:
- user conditional preference:
- user refined representation:
- item global behavior:
- item conditional preference:
- item refined representation:
- concatenation(agg) & mlp(matching):
- logit:
- prediction:
- apply
bceto pointwisenll:
- apply
bprto pairwisenll:
we use prod and concat as attention score functions, proposed by "NAIS: Neural attentive item similarity model for recommendation (He et al., 2018)".
concatfunction:
prodfunction:
if the number of key is too large, attention weights become flat. so, in simplex projection, we introduced smoothing factor
-
movielens latest small (2018)
link- interaction density is relatively high
- the reliability of individual observations is relatively low
-
last.fm 2k (2011)
link- interaction density is relatively low
- the reliability of individual observations is relatively high
-
amazon luxury beauty small 5-core (2018)
link- interaction density is relatively low
- the reliability of individual observations is relatively low
- user history length is relatively short
-
amazon digital music small 5-core (2014)
link- interaction density is relatively low
- the reliability of individual observations is relatively low
- user history length is relatively long
we divided the dataset into a ratio of 8:1:1 and used each for trn, val, and tst. negative sampling ratio @ trn, val is 1:4 (pointwise), 1:1 (pairwise). negative sampling ratio @ tst is 1:99.
additionally, a leave-one-out dataset was created to monitor early stopping epoch using performance evaluation metrics(ndcg). the reason validation loss was not used as a criterion is because of the perceived discrepancy between the performance evaluation metrics and the loss function.
initially, model performance for early stopping was evaluated on the leave-one-out dataset every five epochs to reduce computational cost. later, the evaluation procedure was refined by replacing repeated sampling and averaging with the expected attention scores, enabling performance validation at every epoch.
the maximum length of a user’s interaction history is about 2,000 items, and the top 10% of users have histories of around 400 items. to improve computational efficiency, each history was truncated up to 400 items according to their TF-IDF scores.
experimental results were obtained using only the log-normal dist. as the prob. dist. for the attention scores, with the standard deviation
-
pointwise learning (attention score function
concatis applied) -
pointwise learning (attention score function
prodis applied) -
pairwise learning (attention score function
concatis applied) -
pairwise learning (attention score function
prodis applied)

