Kaggle meetup #3 instacart 2nd place solution


Published on

Instacart 2nd place solution for kaggle meetup in Tokyo

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Kaggle meetup #3 instacart 2nd place solution

  1. 1. 2nd Place Solution Instacart Market Basket Analysis
  2. 2. Agenda • My Background • Problem Overview • Main Approach • Feature Engineering • Feature Importance • Important Findings • F1 maximization
  3. 3. My Background • Bachelor of Economics • Programmer of Financial Industry • Consultant of Financial Industry • 2nd Place at KDDCUP2015 • Data Scientist at Yahoo! JAPAN
  4. 4. Problem Overview • In this competition, we have to predict reorder. • So, it is little different from general recommendation. • I mean,
  5. 5. Problem Overview • How hot(user)? *prior is regarded as train
  6. 6. Problem Overview • How hot(item)? *Clipped by 500
  7. 7. Problem Overview • Evaluation metric is mean F1 score • Precision and Recall
  8. 8. Problem Overview • Links between the files
  9. 9. Main Approach • We are given orders.csv
  10. 10. Main Approach • We are given orders.csv
  11. 11. Main Approach • We are given order_products.csv
  12. 12. Main Approach • Reorder Prediction user_id product_id label
  13. 13. Main Approach • None Prediction user_id label
  14. 14. Main Approach
  15. 15. Main Approach
  16. 16. Feature Engineering • I made 4 types of features 1. User • What this user like 2. Item • What this item like 3. User x Item • How do the user feel about the item 4. Datetime • What this day and hour like *For None model, I can’t use above features except user and datetime. So I convert those to stats(min, mean, max, sum, std…).
  17. 17. Feature Importance for reorder
  18. 18. Feature Importance for None
  19. 19. Important Findings for reorder - 1 • user_id: 54035
  20. 20. Important Findings for reorder - 2 • days_last_order-max is difference between days_since_last_order_this_item and useritem_order_days_max • days_since_last_order_this_item is a feature belong to user and item. This means how many days passed since last order • Also, useritem_order_days_max is a feature belong to user and item. This means max span(day) of order • For more detail, see the next page
  21. 21. Important Findings for reorder - 2 • See the index 0, this means the user bought this item 14 days ago, and max span is 30 days • So I think this feature says if the user is bored or not by that item
  22. 22. Important Findings for reorder - 3 • We already know fruits are reordered more frequently than vegetables(3 Million Instacart Orders, Open Sourced) • I wanted to know how often • So I made a item_10to1_ratio feature that’s defined as the reorder ratio after an item is ordered vs. not ordered. • Next page, for more details
  23. 23. Important Findings for reorder - 3 • Let’s say userA bought itemA at order_number 1 and 4 • And userB bought itemA at order_number 1 and 3 • item_10to1_ratio is 0.5
  24. 24. Important Findings for None - 1 • Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart that Item B falls into • Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all items • So this feature essentially captures the average position of an item in a user’s cart, and we can see that users who don’t buy many items all at once are more likely to be None
  25. 25. Important Findings for None - 2 • total_buy is number of total order • If userA bought itemA 3 times in the past, this would be 3 • So total_buy-max is max of above feature by user • We can see that it predicts whether or not a user will make a reorder
  26. 26. Important Findings for None - 3 • t-1_is_None(User A) is a binary feature that says whether or not the user’s previous order was None. • If the previous order is None, then the next order will also be None with 30% probability.
  27. 27. F1 maximization • In this competition, the evaluation metric was an F1 score, which is a way of capturing both precision and recall in a single metric. • Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No) numbers. • However, in order to perform this conversion, we need to know a threshold. At first, I used grid search to find a universal threshold of 0.2. But I saw comments on the Kaggle discussion boards that said different orders should have different thresholds. • To understand why, let’s look at an example.
  28. 28. F1 maximization
  29. 29. F1 maximization • In the first example, threshold is between 0.9 and 0.3 • In the second example, threshold is lower than 0.2 • As I showed, each order should have each threshold • But using above calculation, we have to prepare all patterns of probability at first • Thus I needed to come up with another calculation • See the next page
  30. 30. F1 maximization • Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities. • For example, the simulated labels might look like this. • I then calculate the expected F1 score for each set of labels, starting from the highest probability items, and then adding items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases. • We don’t need to calculate all of patterns like A, B, AB… • Because if we should select itemB, we should select itemA as well
  31. 31. F1 maximization • F1score_mean( , [A]) -> 0.809747641431 • F1score_mean( , [A,B]) -> 0.709004233757
  32. 32. F1 maximization - Predicting None • One way to think about None is as the probability (1 - Item A) * (1 - Item B) * … • But another method is to try to predict None as a special case. • By using our None model and treating None as just another item, we can boost the F1 score from 0.400 to 0.407.
  33. 33. Appendix
  34. 34. Appendix
  35. 35. Appendix
  36. 36. 1 month to go…
  37. 37. 7 days to go…
  38. 38. 2 days to go…
  39. 39. (´-`).。oO(
  40. 40. 1 hours to go…
  41. 41. 30 minutes to go…
  42. 42. やったか?!
  43. 43. やったか?! (やってない)
  44. 44. 20 minutes to go…
  45. 45. EOP

Save this presentationTap To Close