Understanding Random Forests: From Theory to Practice
Upcoming SlideShare
Loading in...5
×
 

Understanding Random Forests: From Theory to Practice

on

  • 236 views

Slides of my PhD defense, held on October 9.

Slides of my PhD defense, held on October 9.

Statistics

Views

Total Views
236
Views on SlideShare
219
Embed Views
17

Actions

Likes
1
Downloads
5
Comments
0

2 Embeds 17

https://twitter.com 15
https://tweetdeck.twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Understanding Random Forests: From Theory to Practice Presentation Transcript

  • 1. Understanding Random Forests From Theory to Practice Gilles Louppe Universit´e de Li`ege, Belgium October 9, 2014 1 / 39
  • 2. Motivation 2 / 39
  • 3. Objective From a set of measurements, learn a model to predict and understand a phenomenon. 3 / 39
  • 4. Running example From physicochemical properties (alcohol, acidity, sulphates, ...), learn a model to predict wine taste preferences (from 0 to 10). P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical properties, 2009. 4 / 39
  • 5. Outline 1 Motivation 2 Growing decision trees and random forests Review of state-of-the-art, minor contributions 3 Interpreting random forests Major contributions (Theory) 4 Implementing and accelerating random forests Major contributions (Practice) 5 Conclusions 5 / 39
  • 6. Supervised learning • The inputs are random variables X = X1, ..., Xp ; • The output is a random variable Y . • Data comes as a finite learning set L = {(xi , yi )|i = 0, . . . , N − 1}, where xi ∈ X = X1 × ... × Xp and yi ∈ Y are randomly drawn from PX,Y . E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6) • The goal is to find a model ϕL : X → Y minimizing Err(ϕL) = EX,Y {L(Y , ϕL(X))}. 6 / 39
  • 7. Performance evaluation Classification • Symbolic output (e.g., Y = {yes, no}) • Zero-one loss L(Y , ϕL(X)) = 1(Y = ϕL(X)) Regression • Numerical output (e.g., Y = R) • Squared error loss L(Y , ϕL(X)) = (Y − ϕL(X))2 7 / 39
  • 8. Divide and conquer X1 X2 8 / 39
  • 9. Divide and conquer 0.7 X1 X2 8 / 39
  • 10. Divide and conquer 0.7 0.5 X1 X2 8 / 39
  • 11. Decision trees 0.7 0.5 X1 X2 t5 t3 t4 𝑡2 𝑋1 ≤ 0.7 𝑡1 𝑡3 𝑡4 𝑡5 𝒙 𝑝(𝑌 = 𝑐|𝑋 = 𝒙) Split node Leaf node≤ > 𝑋2 ≤ 0.5 ≤ > t ∈ ϕ : nodes of the tree ϕ Xt : split variable at t vt ∈ R : split threshold at t ϕ(x) = arg maxc∈Y p(Y = c|X = x) 9 / 39
  • 12. Learning from data (CART) function BuildDecisionTree(L) Create node t from the learning sample Lt = L if the stopping criterion is met for t then yt = some constant value else Find the split on Lt that maximizes impurity decrease s∗ = arg max s∈Q ∆i(s, t) Partition Lt into LtL ∪ LtR according to s∗ tL = BuildDecisionTree(LL) tR = BuildDecisionTree(LR) end if return t end function 10 / 39
  • 13. Back to our example alcohol <= 10.625 vol. acidity <= 0.237 alcohol <= 11.741 y = 5.915 y = 5.382 vol. acidity <= 0.442 y = 6.516 y = 6.131 y = 5.557 11 / 39
  • 14. Bias-variance decomposition Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at X = x is EL{Err(ϕL(x))} = noise(x)+bias2 (x)+var(x) where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL{ϕL(x)})2 , var(x) = EL{(EL{ϕL(x)} − ϕL(x))2 }. y P 'B (x) L 'L(x) bias2 (x) noise(x) var(x) { } 12 / 39
  • 15. Diagnosing the generalization error of a decision tree • (Residual error : Lowest achievable error, independent of ϕL.) • Bias : Decision trees usually have low bias. • Variance : They often suffer from high variance. • Solution : Combine the predictions of several randomized trees into a single model. 13 / 39
  • 16. Random forests 𝒙 𝑝 𝜑1 (𝑌 = 𝑐|𝑋 = 𝒙) 𝜑1 𝜑 𝑀 … 𝑝 𝜑 𝑚 (𝑌 = 𝑐|𝑋 = 𝒙) ∑ 𝑝 𝜓(𝑌 = 𝑐|𝑋 = 𝒙) Randomization • Bootstrap samples } Random Forests• Random selection of K p split variables } Extra-Trees• Random selection of the threshold 14 / 39
  • 17. Bias-variance decomposition (cont.) Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of M randomized models ϕL,θm is EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2 (x) + var(x), where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2 , var(x) = ρ(x)σ2 L,θ(x) + 1 − ρ(x) M σ2 L,θ(x). and where ρ(x) is the Pearson correlation coefficient between the predictions of two randomized trees built on the same learning set. 15 / 39
  • 18. Diagnosing the generalization error of random forests • Bias : Identical to the bias of a single randomized tree. • Variance : var(x) = ρ(x)σ2 L,θ(x) + 1−ρ(x) M σ2 L,θ(x) As M → ∞, var(x) → ρ(x)σ2 L,θ(x) The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2 L,θ(x) Bias-variance trade-off. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model. The crux of the problem is to find the right trade-off. 16 / 39
  • 19. Back to our example Method Trees MSE CART 1 1.055 Random Forest 50 0.517 Extra-Trees 50 0.507 Combining several randomized trees indeed works better ! 17 / 39
  • 20. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 18 / 39
  • 21. Variable importances • Interpretability can be recovered through variable importances • Two main importance measures : The mean decrease of impurity (MDI) : summing total impurity reductions at all tree nodes where the variable appears (Breiman et al., 1984) ; The mean decrease of accuracy (MDA) : measuring accuracy reduction on out-of-bag samples when the values of the variable are randomly permuted (Breiman, 2001). • We focus here on MDI because : It is faster to compute ; It does not require to use bootstrap sampling ; In practice, it correlates well with the MDA measure. 19 / 39
  • 22. Mean decrease of impurity 𝜑1 𝜑 𝑀𝜑2 … Importance of variable Xj for an ensemble of M trees ϕm is : Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) , where jt denotes the variable used at node t, p(t) = Nt/N and ∆i(t) is the impurity reduction at node t : ∆i(t) = i(t) − NtL Nt i(tL) − Ntr Nt i(tR) 20 / 39
  • 23. Back to our example MDI scores as computed from a forest of 1000 fully developed trees on the Wine dataset (Random Forest, default parameters). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 color fixed acidity citric acid density chlorides pH residual sugar total sulfur dioxide sulphates free sulfur dioxide volatile acidity alcohol 21 / 39
  • 24. What does it mean ? • MDI works well, but it is not well understood theoretically ; • We would like to better characterize it and derive its main properties from this characterization. • Working assumptions : All variables are discrete ; Multi-way splits `a la C4.5 (i.e., one branch per value) ; Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt Totally randomized trees (RF with K = 1) ; Asymptotic conditions : N → ∞, M → ∞. 22 / 39
  • 25. Result 1 : Three-level decomposition (Louppe et al., 2013) Theorem. Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k ii) Decomposition along the degrees k of interaction with the other variables B∈Pk (V −j ) I(Xj ; Y |B) iii) Decomposition along all interaction terms B of a given degree k E.g. : p = 3, Imp(X1) = 1 3 I(X1; Y )+ 1 6 (I(X1; Y |X2)+I(X1; Y |X3))+ 1 3 I(X1; Y |X2, X3) 23 / 39
  • 26. Illustration : 7-segment display (Breiman et al., 1984) y x1 x2 x3 x4 x5 x6 x7 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 2 1 0 1 1 1 0 1 3 1 0 1 1 0 1 1 4 0 1 1 1 0 1 0 5 1 1 0 1 0 1 1 6 1 1 0 1 1 1 1 7 1 0 1 0 0 1 0 8 1 1 1 1 1 1 1 9 1 1 1 1 0 1 1 24 / 39
  • 27. Illustration : 7-segment display (Breiman et al., 1984) Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k B∈Pk (V −j ) I(Xj ; Y |B) Var Imp X1 0.412 X2 0.581 X3 0.531 X4 0.542 X5 0.656 X6 0.225 X7 0.372 3.321 0 1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 X7 k 24 / 39
  • 28. Result 2 : Irrelevant variables (Louppe et al., 2013) Theorem. Variable importances depend only on the relevant variables. Theorem. A variable Xj is irrelevant if and only if Imp(Xj ) = 0. ⇒ The importance of a relevant variable is insensitive to the addition or the removal of irrelevant variables. Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. 25 / 39
  • 29. Relaxing assumptions When trees are not totally random... • There can be relevant variables with zero importances (due to masking effects). • The importance of relevant variables can be influenced by the number of irrelevant variables. When the learning set is finite... • Importances are biased towards variables of high cardinality. • This effect can be minimized by collecting impurity terms measured from large enough sample only. When splits are not multiway... • i(t) does not actually measure the mutual information. 26 / 39
  • 30. Back to our example MDI scores as computed from a forest of 1000 fixed-depth trees on the Wine dataset (Extra-Trees, K = 1, max depth = 5). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 pH residual sugar fixed acidity sulphates free sulfur dioxide citric acid chlorides total sulfur dioxide density color volatile acidity alcohol Taking into account (some of) the biases results in quite a different story ! 27 / 39
  • 31. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 28 / 39
  • 32. Implementation (Buitinck et al., 2013) Scikit-Learn • Open source machine learning library for Python • Classical and well-established algorithms • Emphasis on code quality and usability scikit A long team effort Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 29 / 39
  • 33. Implementation overview • Modular implementation, designed with a strict separation of concerns Builders : for building and connecting nodes into a tree Splitters : for finding a split Criteria : for evaluating the goodness of a split Tree : dedicated data structure • Efficient algorithmic formulation [See Louppe, 2014] Dedicated sorting procedure Efficient evaluation of consecutive splits • Close to the metal, carefully coded, implementation 2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests # But we kept it stupid simple for users! clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) 30 / 39
  • 34. A winning strategy Scikit-Learn implementation proves to be one of the fastest among all libraries and programming languages. 0 2000 4000 6000 8000 10000 12000 14000 Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python 31 / 39
  • 35. Computational complexity (Louppe, 2014) Average time complexity CART Θ(pN log2 N) Random Forest Θ(MKN log2 N) Extra-Trees Θ(MKN log N) • N : number of samples in L • p : number of input variables • K : the number of variables randomly drawn at each node • N = 0.632N. 32 / 39
  • 36. Improving scalability through randomization Motivation • Randomization and averaging allow to improve accuracy by reducing variance. • As a nice side-effect, the resulting algorithms are fast and embarrassingly parallel. • Why not purposely exploit randomization to make the algorithm even more scalable (and at least as accurate) ? Problem • Let assume a supervised learning problem of Ns samples defined over Nf features. Let also assume T computing nodes, each with a memory capacity limited to Mmax , with Mmax Ns × Nf . • How to best exploit the memory constraint to obtain the most accurate model, as quickly as possible ? 33 / 39
  • 37. A straightforward solution : Random Patches (Louppe et al., 2012) X Y ... } 1. Draw a subsample r of psNs random examples, with pf Nf random features. 2. Build a base estimator on r. 3. Repeat 1-2 for a number T of estimators. 4. Aggregate the predictions by voting. ps and pf are two meta-parameters that should be selected • such that psNs × pf Nf Mmax • to optimize accuracy 34 / 39
  • 38. Impact of memory constraint 0.1 0.2 0.3 0.4 0.5 Memory constraint 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Accuracy RP-ET RP-DT ET RF 35 / 39
  • 39. Lessons learned from subsampling • Training each estimator on the whole data is (often) useless. The size of the random patches can be reduced without (significant) loss in accuracy. • As a result, both memory consumption and training time can be reduced, at low cost. • With strong memory constraints, RP can exploit data better than the other methods. • Sampling features is critical to improve accuracy. Sampling the examples only is often ineffective. 36 / 39
  • 40. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 37 / 39
  • 41. Opening the black box • Random forests constitute one of the most robust and effective machine learning algorithms for many problems. • While simple in design and easy to use, random forests remain however hard to analyze theoretically, non-trivial to interpret, difficult to implement properly. • Through an in-depth re-assessment of the method, this dissertation has proposed original contributions on these issues. 38 / 39
  • 42. Future works Variable importances • Theoretical characterization of variable importances in a finite setting. • (Re-analysis of) empirical studies based on variable importances, in light of the results and conclusions of the thesis. • Study of variable importances in boosting. Subsampling • Finer study of subsampling statistical mechanisms. • Smart sampling. 39 / 39
  • 43. Questions ? 40 / 39
  • 44. Backup slides 41 / 39
  • 45. Condorcet’s jury theorem Let consider a group of M voters. If each voter has an independent probability p > 1 2 of voting for the correct decision, then adding more voters increases the probability of the majority decision to be correct. When M → ∞, the probability that the decision taken by the group is correct approaches 1. 42 / 39
  • 46. Interpretation of ρ(x) (Louppe, 2014) Theorem. ρ(x) = VL{Eθ|L{ϕL,θ(x)}} VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}} In other words, it is the ratio between • the variance due to the learning set and • the total variance, accounting for random effects due to both the learning set and the random perburbations. ρ(x) → 1 when variance is mostly due to the learning set ; ρ(x) → 0 when variance is mostly due to the random perturbations ; ρ(x) 0. 43 / 39