The 4th Place Approach to the 2019 ACM Recsys Challenge by Team RosettaAI

Kung-Hsiang, Huang (Steeve)
Rosetta.ai Taiwan Blog
7 min readJul 10, 2019

--

https://tech.trivago.com/2019/03/11/recsys-challenge-2019/

The three-month recommender system competition, 2019 ACM RecSys Challenge, has finally come to an end. We, RosettaAI, eventually won the 4th place among 1564 teams on the private leaderboard, clinching the final spot in the prize zone! Big thanks to Professor Lin and our awesome teammates from National Taiwan University and the University of Texas at Dallas. Without the great teamwork, it’s impossible to accomplish such an achievement.

Abstract

The ACM Recsys Challenge this year, hosted by trivago.com, challenged data scientists to build a session-based and context-aware recommender system to solve a ranking problem. Given a sequence of user interactions, the goal is to rank a list of hotels such that the topper the hotel in the list, the more likely that it will be clicked-out, as shown in the following figures.

https://tech.trivago.com/2019/03/11/recsys-challenge-2019/
https://tech.trivago.com/2019/03/11/recsys-challenge-2019/

Our approach involves three different models, Neural Networks, LightGBM, and XGBoost. I was in charge of the whole data processing procedure, the implementation and design of the Neural Networks, as well as the majority of feature engineering. This blog post will summarize our approach. I’ll briefly illustrate the datasets, the loss function, the Neural Networks architecture, and feature engineering.

Datasets

Trivago provided two sets of data, the user-item interaction and the item metadata. In this blog post, the second one would not be discussed as it contains relatively less information. Below shows an example of a sequence of interactions in a session.

Each row represents an action of the user, the type of which is recorded in the action_type column. Its corresponding column, reference, tells us which object that action was operating on. For instance, in the first row, search for destination is matched with Barcelona, Spain. It means that the user enters Barcelona, Spain in the destination box. The integer value in the reference column represents the ID of hotels. The most important action is clickout item, implying a user click on a link that leads to an external site. Only in the rows associated with this action, impressions and prices are not null. Both of these two columns store a list of integers, indicating the list of hotels displayed to the user and the price of each hotel, respectively. In these clickout item rows, the reference column can be regarded as the target. The goal is to sort the impression list such the target is positioned in the front of the list.

The evaluation metric for this challenge is Mean Reciprocal Rank (MRR):

https://www.oreilly.com/library/view/c-machine-learning/9781788996402/assets/b7a61c4a-53bb-43ae-8140-4f81106476ac.png

Basically, this metric awards algorithms that rank the target higher than the other candidates. Trivago has provided an example to illustrate the metric in case you don’t understand:

https://recsys.trivago.cloud/challenge/dataset/

Loss Function

At the first glance of this challenge, I thought about using a ranking loss, which I believe that the majority of teams would opt for, such as Bayesian Personalized Ranking (BPR) or LambdaRank. However, eventually, I decided to treat it as a binary classification problem, breaking down each item in the impression list into an individual sample. The loss is defined as Binary Cross-Entropy (BCE). I did so for two reasons:

  1. I was inspired by this paper published by Youtube back in 2017. It appeared to me that use case of their second stage predictions resembles the context of this challenge.
  2. The data reminded me of a previous competition which I undertook, Avito Demand Prediction, held on Kaggle by Avito (we were 5 places away from the gold medal zone). I remembered lots of top scorers using BCE loss, even though it was not a binary classification problem. It seems that if the output conforms to such probabilistic distribution, BCE is a robust loss function for the model to learn properly.

Neural Network Architecture

My Neural Network is based on the previously mentioned Youtube paper and DeepFM.

The Proposed Network Architecture

The features can be divided into three parts: continuous features, categorical features, and sequential categorical features.

Continuous Feature

Processing continuous features is simple. The only thing that has to be aware of is that continuous features should be properly scaled in order for the Neural Nets to learn better. In our implementation, MinMaxScaler from Sklearn and the RankGaussScaler from this amazing Kaggle kernel work the best.

Categorical Feature

Categorical features were encoded with my self-implemented encoder rather than Sklearn’s LabelEncoder. The latter one performs sorting while fitting the data; thus, it takes a huge amount of time as the data is very large and some of the features are of string type. In addition, my encoder also allows me to directly access the forward and backward encoding-decoding mapping for every feature. After being encoded, these categorical features are then passed through embedding layers to obtain higher dimensional vectors that contain more information than the original value.

Sequential Categorical Feature

Similarly, sequential categorical features are also encoded and embedded with the same method. One thing to note is that same features share the same embedding weights. For instance, in the above figure, the blue, the red, and the green bars below are derived from the same item embedding matrix. Two of these features are applied bi-directional GRU to model the temporal dynamics in each of the sequences. Then, they are aggregated with Max Pooling to become vectors of the same dimension as the other categorical features.

Feature Interaction

As pointed out in the DeepFM paper, feature interaction allows the model to discover the implicit meaning behind user behaviors. Consider an app store recommender system, when interacting the app category and the timestamp feature, it might have a better understanding of the users’ intention. For instance, during meal-time, it may be a good idea to recommend food delivery app. Hence, the importance of effective feature interaction mechanism cannot be overestimated.

Typically, feature interaction refers to the dot products of the latent vector for each feature. Based on this terrific XNN repository by ChenglongChen, it turns out that we only need five lines of code to compute the aggregation of these dot products.

The idea is based on the following equations:

Essentially, to compute the sum of dot products of the embeddings, we simply minus the square of sum by the sum of square of these vectors and divide it by 2. In this way, the second-order feature interaction can be efficiently computed.

Feature Engineering

In the latter part of this competition, we spend the majority of time in feature engineering for LightGBM and XGBoost. In the end, we came up with more than 60 features. Some of the interesting features are illustrated below:

Predicted Next Interaction Position

This is based on the position of the last step’s item and second last step’s item in the current impression list. It is computed as the following:

predicted_pos = last_pos + (last_pos- second_last_pos)

On the right-hand side, the right part tells us the distance in the impression between the position of the second last step to that of the last step. Then, we add that distance onto the last position. We can think of it as extending the user’s last motion, and thus it can be viewed as the predicted position of the next interaction.

Time Difference

We took the first, second and third derivative of timestamp between each step. Specifically, the first derivative is obtained by computing the difference between the timestamp of two consecutive rows, while the second one is defined as the difference between the first derivative. The third derivative is computed in a similar manner.

Time Elapse

By aggregating the time difference between each step, we compute the (maximum/ average/ sum) of the amount of time the user spends on each item in the impressions.

Conclusion

Overall, we had a lot of fun in this challenge. Although the competition was ridiculously fierce, especially in the last few weeks, the results turned out to be pretty well. Sincere appreciation again for all our great teammates and everyone who has given us advice along the way. Thank you all for making this possible. The GitHub repository for our solution can be found here. Should you have any question or comment, please leave it below. If you are looking for product trials or collaboration opportunities, please contact me directly via steeve@rosetta.ai.

--

--