Online Learning

Online learning is a kind of learnings in which a learner repeats the following round trying to maximize cumulative reward: a prediction (or a selection) and a receipt of reward depending on his/her prediction. For example, let us consider the following game. There are two stocks A and B. A player invests 1 million yen in stocks A and B during one month and tries to maximize the total market value of the stocks one month later. Market values of stocks change everyday, which means that the total market value of the player’s stocks changes everyday. Every morning, the player is allowed to change the proportion between the two stocks by buying or selling without changing the total market value. Here, buying or selling stocks is assumed to be free of charge. In the case of this example, the player repeats prediction of the next morning’s market values of the two stocks, change of the proportion between them, and knowing the real values in the next morning. Like this example, in online learning, a learner predicts unknown future based on the past information, and repeats selections trying to maximize cumulative profit. In this stock portfolio problem, the player can also know the results for any proportion he/she has not chosen, but there is a setting in which the player can know only the result for the choice he/she selected, and the problem of such setting is called a bandit problem. We study bandit problems, which is applicable for recommendation and ad delivery.

Bandit Problem

Assume that you lent a part of your blog page to an internet ad agency as an ad space. Among 100 ads, the agency chooses an ad to display on the ad space of your blog page for each view of your blog page. A reader of your blog can click the ad and know its details if he/she has interest in it. The agency can get money from advertisers depending on the number of clicks. In this case, the agency wants to select ads so as to maximize the amount of money paid by advertisers, that is, so as to maximize the number of clicks, but to do so, the agency must estimate click rates of ads. This is a bandit problem because no information can obtained about the click rates of non-selected ads. In order to maximize the final cumulative number of clicks in a bandit problem like this,

[Exploration] choose an ad with a small number of selections to improve the accuracy of its click rate

and

[Exploitation] choose an ad with the highest estimated click rate

must be balanced. Strategies that deal with exploration-exploitation tradeoff have been studied theoretically.

In this laboratory, we study combinatorial bandit problems in which a combination of choices, instead of one of the choices, is selected . Strategies of combinatorial bandit problems are considered to be useful because generally multiple ads or multiple items are displayed in practical ad delivery and recommendation.

Recommendation

On the information page of a product in the online shopping site of Amazon, maybe you often find your favorite items in a frame entitled “Customers Who Bought This Item Also Bought.” This is a kind of recommendation called collaborative filtering in which products that were bought by other users with purchase history similar to yours are recommended. Research on collaborative filtering has been popular since the middle of 1990s with the wide spread of internet and WWW. Since the middle of 2000s, matrix factorization, which can be viewed as a prediction model of user’s ratings for his/her unrated items by the products of low dimensional latent vectors representing user and item features, has been popularly used to predict users’ ratings.

We deal with recommendation using collaborative filtering in a framework of a bandit problem, and study recommendation methods based on matrix factorization to maximize the degree of users’ satisfaction.

MENU

Bandit Problem

Recommendation

Research and Education Fields

Research and Education Fields