Decomposing the Human Palate with Matrix Factorization

February 6, 2015

This still frame from the visualization tool below shows the breakdown of Naan bread into its constituent tastes.


Summary: My classmates and I applied the matrix factorization technique for recommender systems to a data set of 120K user profiles scraped from the recipe-sharing web site AllRecipes.com and extracted 40 latent “tastes” that combine to make up all sorts of human foods.

If you understood that sentence, feel free to skip ahead to Part II of this post, Matrix Factorization and Food, where you can view the results and play around with some interactive visualizations.

A Gentle Introduction to Matrix Factorization

First off, I’m sorry. I know that the term “matrix factorization” is liable to trigger a life-threatening reaction in persons allergic to linear algebra. If reading the title of this blog post is causing to you go into anaphylactic shock, please know that I meant no harm.

The truth is that I didn’t pay any attention in linear algebra class either. As far as I can tell, “orthogonal” is just a fancy synonym for “unrelated” that I use when I want to sound smart, an “eigenvector” is probably something horrible that Wernher von Braun dreamt up before he switched employers in 1945, and a “product space” is the kind of thing that a “growth hacker” gets excited about.

The good news is that understanding matrix factorization does not require any linear algebra beyond the definition of a matrix multiplication.

But before I get into the details, I want to motivate the algorithm by pointing out an application to the most heavily studied problem in computer science: how to get people to buy more things.

Recommender Systems

Suppose that I order the City Chic First Years Jet Stroller on Amazon.com. Amazon would like to sell me more items. Which items should they recommend?

One idea is to recommend items that are intrinsically similar to the City Chic First Years Jet Stroller. These might be items that also have “stroller” in the name, or items that are made by the same manufacturer, or items whose descriptions share words in common with the description of the stroller. This is called content-based recommendation.

A more interesting idea is to treat the items as black boxes and somehow learn correlations between the items from the ratings history of all users. This is called collaborative filtering. For example, I would bet that people who buy baby strollers are also likely to buy the Essential Mozart CD collection. The association between these two items could never be detected by content-based methods alone because a Mozart collection has zero properties in common with a baby stroller. An intelligent yet uninformed alien visiting Planet Earth for a day would never think, “Ah, yes, of course people with no previous discernable interest in classical music will suddenly snatch up the collected works of Wolfgang Amadeus once they are also in the market for a baby stroller.” Fortunately, thanks to collaborative filtering, Amazon can pick up on the relationship between these two very distinct items and make the right recommendation. It’s a win-win situation: Jeff Bezos will become richer and the baby won’t get left behind.

One of the most fascinating techniques for collaborative filtering is matrix factorization. It is easiest understood as a form of dimensionality reduction.

Dimensionality Reduction

My recollection of one such quiz.

As an insufferable ninth grader, I took a bunch of Facebook quizzes during the 2008 U.S. presidential election cycle. These quizzes would ask a zillion questions about my personal political opinions, and then they would show me a Cartesian plane with two dimensions—social views and economic views—with a big dot somewhere that was supposed to represent me. There were also dots for the candidates, Barack Obama and John McCain; dots for insightful thinkers like Karl Marx, William F. Buckley, and Abraham Lincoln; and a dot for Ayn Rand.

These people had all apparently taken the quiz, and, incredibly enough, the sum total of their lives could be reduced to just two numbers: their position on the “social” axis, and their position on the “economic” axis.

This magic trick is called dimensionality reduction.

What a feat! Political platforms probably have hundreds of dimensions, like taxation, immigration, energy, national security, climate change, and whether or not the candidate believes it is acceptable to drive to Canada with an Irish Setter tied to the roof of his car. But the ideology quiz reduced every politician to just two dimensions: “social” and “economic.” Working in this simplified two-dimensional space made it a lot easier to identify the politicians whose views most closely matched my own.

The matrix factorization technique for recommender systems involves a similar kind of dimensionality reduction, except that the algorithm learns the dimensions on its own.

In other words, matrix factorization is the real deal. It’s the stuff that separates true data scientists from charlatans — the data alchemists, data phrenologists, and data astrologers of the world.

A plausible item vector for the item "Essential Mozart." Factor interpretations are on the left; factor weights are in the boxes.

Suppose that Amazon could somehow reduce the dimensionality of every item in its inventory down to, say, 50 factors. Both items and users could then be represented as weight vectors of length 50. Ideally, the factors would have interpretable meanings. So, to continue the Mozart example from above, maybe factor 32 represents “neurotic upper-middle-class American parent.” In that case, the 32nd weight of each user’s vector would quantify the degree to which that user is a neurotic upper-middle-class American parent. Likewise, the 32nd weight of each item’s vector would quantify the degree to which that item appeals to neurotic upper-middle-class American parents. (See an example item vector on the right.)

To predict any user’s rating for any item, we can just take the dot product between the user’s vector and the item’s vector.

I still have not explained how Amazon can use its user/item ratings data to learn a 50-dimensional vector representation of each item and of each user. Intuitively, learning these vectors is an optimization problem where the goal is to maximize the agreement between the user/item ratings that these vectors predict, and the user/item ratings that have been observed. More on this later.

Nor have I explained how Amazon can specify that factor 32 should be the “neurotic upper-middle-class American parent” factor. Actually, it can’t. The intuitive meaning of each factor is derived post hoc from vector representations of items and users, rather than the other way around. After the item and user vector representations have been generated, data scientists can look at the results, observe that the items with the highest weights for that factor are baby strollers and Mozart CDs, and conclude that this must be a “neurotic upper-middle-class American parent” factor.

This style of reasoning is called “extracting actionable insights.” It is the cornerstone of the data-scientific method that sets modern data science apart from earlier, barbaric, pseudo-data sciences.

The algorithm that learns these 50-dimensional representations does discover the factors, but it doesn’t name them. This algorithm is called matrix factorization. Why? Because it turns out that this task—learning low-dimensional vector representations for each item and for each user that are consistent with observed user ratings—can be viewed from another angle as a matrix completion problem.

The Matrix Completion View

A realistic ratings matrix. An empty square means that the user hasn't yet rated the item.

Imagine that Amazon has a big, incomplete USERS x ITEMS matrix. The value in the (USER, ITEM) square is the rating (say, from 1 to 5) that USER gave to ITEM. If USER hasn’t yet rated ITEM, then the (USER, ITEM) square is empty. Indeed, most squares are empty. Amazon needs to somehow complete the USERS x ITEMS matrix, so that it can make good recommendations, so that it can bring in revenue, because, you know, the Washington Post isn’t going to buy itself.

The matrix factorization approach is to approximate the completed USERS x ITEMS matrix as the product of a USERS x FACTORS matrix and a FACTORS x ITEMS matrix. The number of factors is fixed in advance at, say, 50. The values in these two individual matrices are carefully chosen such that the product of the two matrices (that is, the completed USERS x ITEMS matrix, or the predicted user/item ratings) is mostly consistent with the incomplete USERS x ITEMS matrix (that is, the observed user/item ratings). Each row in the USERS x FACTORS matrix is a 1 x 50 user vector that represents a single user’s preferences for each of the 50 factors. Each column in the FACTORS x ITEMS matrix is a 50 x 1 item vector that represents the breakdown of a single item into each of the 50 factors. When these two matrices are multiplied together, the value at the (USER, ITEM) location in the completed matrix is, by definition, the dot product of USER’s user vector and ITEM’s item vector, which, as we have already seen, is a measure of USER’s preference for ITEM!

As discussed in the preceding section, factorizing the incomplete USERS x ITEMS matrix is an optimization problem where the goal is to minimize some metric of difference between the incomplete USERS x ITEMS matrix (the observed ratings) and the completed USERS x ITEMS matrix (the predicted ratings). We will also add a non-negativity constraint: every value in the multiplicand and multiplier matrices should be greater than or equal to zero. This constraint makes the factors more interpretable.

There are a number of algorithms in the literature for non-negative matrix factorization. Below, we use the iterative method described here. The algorithm uses “multiplicative update rules” to minimize a metric called the KL divergence which measures the difference between the two matrices.

Anyway, enough about Amazon. The rest of this post will focus on an application of matrix factorization to a domain—possibly the only domain—that is more central to my life than the online purchase of consumer products.

Matrix Factorization and Food

I took an introductory machine learning course from David Blei in Spring 2014. For the final project, a few classmates and I decided to build a recommender system for foods. Our recommender system itself turned out to be pretty mediocre, but we got interesting factors to fall out of matrix factorization.

We scraped our data set from AllRecipes.com, a popular recipe-sharing web site. With tens of thousands of recipes and over a hundred thousand users, AllRecipes is a gold mine for food data. Each AllRecipes user has a recipe box where he or she can list his or her favorite recipes. It was these recipe boxes that we scraped. After throwing away all recipes that occurred in fewer than 300 recipe boxes, we were left with 9,546 recipes and 115,308 users. All users collectively had 9.6 million recipes in their recipe boxes.

We experimented with two different approaches to the problem of recipe recommendation: a content-based method using only information about the ingredients in each recipe, and a collaborative filtering method using only data about user-recipe preferences.

The content-based method was straightforward: represent each recipe as a vector of all ingredients in the “kitchen,” where the value at each element is the mass fraction that the corresponding ingredient takes up in the recipe. To measure the similarity between two recipes, take the cosine similarity between their vectors. This approach worked fine, but I’ll ignore it for the remainder of this post.

The collaborative filtering method was far more interesting because it didn’t know anything at all other than which users liked which recipes. We represented the user-recipe preference data as a 115,308 x 9,546 user-by-recipe matrix with a 1 in every square where the corresponding user had the corresponding recipe in his/her recipe box, and a 0 everywhere else. Then we factorized the user-by-recipe matrix, using 40 factors, into a 115,308 x 40 user-by-factor matrix and a 40 x 9,546 factor-by-recipe matrix.

We used the MATLAB NMF toolbox to perform non-negative matrix factorization. The NMF algorithm took 8 hours to complete 2,000 iterations on a machine with a 2.2GHz processor and 128 GB RAM.

Sure enough, after eight hours of compute time, the factors produced by non-negative matrix factorization really did seem to correspond to human tastes!

To clarify, each row in the user-by-factor matrix is a 1 x 40 vector that represents a single user’s preference for each of the 40 factors. And each column in the factor-by-recipe matrix is a 40 x 1 vector that represents the decomposition of a single recipe into its constituent factors.

I will explore three different ways to play around with the data generated by the factorization:

  1. First, we can ask questions like: "What does factor #3 really represent?" The easiest way to find out is to look at the dishes with the highest weights in factor #3. The top ones are Quinoa and Black Beans, Baked Kale Chips, and Quinoa Side Dish. In this example, we could conclude that factor #3 represents vegan food.
  2. Second, we can ask questions like: "Which factors does the dish Butter Chickpea Curry consist of, and in what proportions?" Answering this question is as simple as looking up the row for Butter Chickpea Curry in the factors-by-recipe matrix produced by NMF. Use the Taste Breakdown visualization tool below to view the factor breakdown of any recipe. Apparently, Butter Chickpea Curry is 62% vegetarian, 34% Indian food, and 4% random noise.
  3. Third, we can ask questions like "What food would we get if we smashed together the 'vegetarian' factor and the 'cookies/brownies' factor?" To find out, we can sort all recipes in decreasing order of cosine similarity to a made-up recipe vector with `0.5` in the 'vegetarian' position and `0.5` in the 'cookies/brownies' position. The Mix 'n Match visualization tool below allows you to make such queries. The answer, in this instance, is apparently "Vegan Brownies."

The Tastes

These are the 40 tastes that came out of the matrix factorization. We describe each taste by printing the five exemplar recipes with the highest weights.

A few observations:

Taste Breakdown

This interactive tool gives the taste breakdown of 9,000 or so recipes from AllRecipes.com.

To pull up the taste breakdown for any food, just start typing the name of the food into the search bar, and then click on one of the recipes that will pop up in the autocomplete box.

Clicking on any of the example recipes that appear will pull up their taste breakdowns.

To share the link to a taste breakdown that catches your interest, right click the permalink icon and select “Copy Link Address” in your browser.

Here are some intuitions that were borne out by the matrix factorization:

Mix ‘n Match

This interactive tool lets you mix and match tastes! Select two or three tastes from the two-column list below, and see which recipes most closely match the combined taste profile.

If you’re out of ideas, try mixing alcoholic beverages with sugar cookies, slow cooker with barbecue, or biscotti with breakfast foods.

Click on some tastes to begin!

Dishes closely matching the selected taste profile:

Data

Code and data are available on Github.

Hacker News

Read the discussion on Hacker News.

Thanks for Reading

This post describes a project carried out for David Blei’s course “COS 424: Interacting with Data” in Spring 2014 at Princeton University by Rob Sami (now Google), Aaron Schild (now Berkeley), Spencer Tank (now Palantir), and me.

Thanks to Charles Guo, Charlie Marsh, Lucas Mayer, Shubhro Saha, and Jacob Simon for reading drafts of this post.