Boruta (algorithm)

Boruta is an algorithm in the field of machine-learning, and more specifically, a feature-selection algorithm. The aim of the algorithm as presented in the original paper describing it is to find all relevant features (compare with minimal-optimal features set). The Boruta algorithm is not a stand-alone algorithm, but is implemented as a wrapper algorithm around the random-forest classification algorithm. In its essence, Boruta works in an iterative manner, and in each iteration the aim is to remove features which according to a statistical test, are less relevant than what is defined by the authors as a random probe. One of the fundamental components of Boruta is the use of shadow attributes. Shadow attributes are pseudo-features that are added to the information system, and produced by taking existing features from the original data-set and shuffling the values of those features between the original samples (data points). After generating the shadow attributes the procedure proceeds with building random-forest trees and comparing the Z-scores obtained by original features to Z-scores obtained by the shadow attributes. This comparison is the foundation for Boruta to decide whether a feature is important or not.

 High level pseudo-code:

1.  Copy all variables (features)
2.  Shuffle values in each feature
3.  Run random-forest on the extended system (shuffled features), gather Z scores
4.  Find maximum MSZA (max Z-score among shadow attributes)
5.  Run random-forest on original features
6.  Assign each original feature a hit if feature Z-score > MSZA
7.  If Z-score <= MSZA, perform two-side equality test against MSZA
8.  If Z-score < MSZA significantly, drop feature as unimportant
9.  If Z-score > MSZA significantly, keep feature as important
10. Repeat from step 5 until all importance is determined for all features or max RF runs have been reached