Boosting is already “explainable AI”

A criticism on Microsoft Research’s EBM

The quest for more “explainable AI” has become one of today’s main topics among Machine Learning practitioners and end users, governments included.

Recently one interesting proposal came from Microsoft Research of a unified framework for explainability with a especial focus in boosting. The proposed tool, named Explainable Boosting Machine (EBM) is a mixed bag of good and not so good ideas.

Why is boosting important?

Humble, less hyped than other algorithms, boosting is today as relevant as ever. Described by Trevor Hastie et al. as “one of the most powerful learning ideas introduced in the last twenty years”, boosting works incredibly well. From a pure pragmatic point of view it is today the most competitive algorithm for dealing with structured data (unless non explicit feature interactions dominate the problem making NNets a better option).

But what is boosting really doing? When Decision Trees are boosted with depth 1 (“stumps”) boosting fits a generalized additive model, or GAM. In practice, though, boosting outperforms GAMs in the majority of cases (unless the ground truth lacks interactions and goes well with smoothness). So, boosting > GAMs. But why? Because with optimal depth it does:

  • sparse feature selection
  • interaction detection and selection

What is the closest thing to magic I can think of. Explainable magic. At least that is what boosting tries to do. Like always, those feature importances have to be confirmed through surefire tools like permutation importance, but let’s leave that aside by now.

Given a proper validation setup you can trust boosting will optimize the choice of task-relevant features with a remarkable resistance to noise. Almost “out of the box”.

What is MR’s Explainable Boosting?

It starts from the capability of boosting with depth 1 to fit a GAM, but not in the usual way: model is trained sequentially in a round robin way, one feature by one. They use a very small learning rate (small enough to make the order irrelevant) and a huge number of rounds.The result: a pure GAM model. No interactions so they get rid of the trees and keep the vectorized predictions per feature that you can plot as a interpretable Partial Dependence Plot.

To account for the main interactions they use a clever “cheap” heuristic (FAST) applied to the residuals of the GAM model. This allows ranking of all pairwise interactions by their strength. Top-n interactions specified can then be included , as detailed here.

According to MR the algorithm seems to work as well as boosting for low-ish dimensionality datasets and it is argued that interpretability is enhanced by the additive fitting on features and main pairwise interactions.

Is this a viable idea?

The positive:

  • Boosting deserves to be recognized as one of the most powerful tools compatible with explainability nowadays.
  • The link Boosting-GAMS is important to understand and really helps to think of boosting as and additive modelling that accounts for interactions.
  • Round robin boosting way of fitting GAMs is interesting and potentially useful in itself

The problems:

  • Boosting capacity for sparse feature selection is priceless. Even if it is not enough by itself (permutation importance assessment is unavoidable in all real life noisy setups) this capacity is something that you can not give away when dealing with most real life medium-dimensional data. Interpreting hundreds of features modelled in a round robin way doesn’t sound like the best idea.
  • How many interactions are relevant? With regular boosting this is reduced to a hyperparameter that can be efficiently explored : depth. I very rarely have found the optimal boosting depth to be 2. In my experience boosting is not that picky with exact depth, as long as it is above a minimum that almost always is bigger than 2. Expect limiting to pairwise interactions to be suboptimal. So, how many of those top interactions? Again, boosting will optimize that decision for you and no firesafe heuristic can warranty that you manually choose the best one.

As a conclusion, it is very likely that previous points make the choice of “regular” boosting over the EBM a better choice, specially considering that regular boosting already takes you very close to model explainability if properly used.

The future will say, anyway, if the algorithm is adopted and used more successfully than boosting in demanding real life datasets it will mean that either previous points were not that relevant or that were addressed in future versions (current release is an Alpha Release).

This is already a long post so will leave it here by now, I hope it was of some interest. There is more to say about “Machine Learning Explainability”, I will try to find the time in the future for sharing some more ideas about it.

Architect & Data Scientist