The "Black Box Paradox" in Big Data Analytics and Data-Driven Modeling
Some predictive models are analytical and based on first principles, while others are solely data-driven. Analytical models are often based on a human’s understanding of nature, while data-driven models attempt to model nature using data alone. Some data-driven models, such as linear regression, are transparent and interpretable, while other “black-box” models are not transparent at all and can be difficult to interpret.
Disciplines as physics, chemistry, engineering, mathematics, and others typically rely on analytical, numerical, and statistical models to explain results, understand intrinsic relationships, and make predictions (inter-/extrapolations).
Some other disciplines–such as Big Data Analytics–deal with huge volumes of numerical and categorical data that are often noisy and incomplete. Supervised machine-learning (ML) models are data-driven and are often the preferred choice for predictive models in these disciplines. In many practical settings, the learned model improves as the amount of data available to train the model grows. Some ML models result in “black-box” predictions, especially when a large amount of training data is used to learn complicated non-linear relationships in the data. Such “black-box” models are often the best in terms of prediction accuracy, but this accuracy often comes with less interpretability than other model choices.
A “classical” study involving analytical or statistical modeling will likely start with a general model description providing disclaimers about the main assumptions used in the model, limits of its applicability, and cautionary notes for the future users of the model. A well-written study description will address the shortcuts used, the uncertainties in the inputs and will try to provide an accurate estimate of the prediction error. Finally, the study will present its results and conclusions.
In the case of a BB algorithm, the model is trained using a training sub-set of the total available dataset.
The above approach allows the reader to understand the fine details of the model and all of its elements: data, algorithms, logic, etc. All of this is extremely valuable for understanding why the results are what they are. In fact, this is how most readers and users are performing the ‘sanity check’ of the model and its results before they adopt it – by looking at those elements of the model and trying to check if the model is self-consistent and if its statements make physical, mathematical, and general sense.
Let’s now consider a different approach to modeling that is frequently used in Big Data Analytics: relying on so-called “black box” (BB) algorithms from the field of machine learning. One of the most well known examples of such an algorithm – the Random Forest Algorithm – was introduced by Leo Breiman around 2000 and is used to address a broad spectrum of problems and practical applications.
In the case of a BB algorithm, the model is trained using a training sub-set of the total available dataset. In a random forest, this training set is then randomly sampled to create several different sample-training sets. A separate decision tree is then trained to perform regression or classification on each sample training set (this process is called bootstrap aggregation or bagging, resulting in several fitted trees.
Only when one combines all of the trees together (see figure) – or, collects one “consensus answer” from the entire forest – is the algorithm’s job is truly done. This model can make accurate predictions when applied correctly, but the large number of trees obscures the explanation of why a prediction was made.
Similar to traditional models, this “black box” model allows for accuracy testing. The “test” subset contains data not used for training or validation and has known, or labeled, answers. The test set is used to confirm the overall predictive power of the trained and validated model.
Therefore, in many ways, the “black box” model is no different from the classical models. Ultimately, it takes a known input, runs it through the “trained model”, and compares the answer to the known answers in order to estimate the model’s accuracy.
With the classical model, one can, essentially, reproduce its decision-making process step-by-step and test every such step since all of those steps are “visible” as the model is “transparent” to the user.
While the BB models could be used to address some classes of problems better than their classical counterparts, what really happens inside the black box, stays inside the black box.
For the end-user, “black box” models trade transparency for the answer.
This is perfectly acceptable for some of the users, who want an accurate answer much more than an understanding of all the reasons and relationships leading to the answer. For some other users, however, it might be completely unacceptable.
Which brings us to the main theme of this article: what does this lack of transparency mean in practice?
This is the right time to introduce the “BB Paradox” – an interesting (psychological) phenomenon that we observed in practical settings where model interpretability was not absolutely required. The “paradox” could be formulated as follows:
Less transparent models are generally accepted by the engineering and development community faster and with less resistance or questioning than their more “transparent” counterparts.
In other words, people seem to trust the models they don’t completely understand over the models they can understand in fine details.
The way this works in practice could be described in the following approximate sequence of reviewing and deciding on a traditional, classical model:
1. Model assumptions are presented and reviewed
2. Model machinery (equations, heuristics, approximations) is scrutinized
3. Data used in the model are reviewed
4. General results and conclusions are reviewed
5. Sub-conclusions are analyzed and checked for sanity and consistency
6. Model and its results are either accepted or rejected
The difference between the above process flow for a classical model and that for BB model is that – for the BB model - steps 2) and 5) are either skipped completely or reduced to a superficial discussion about “how the forest finds the answer better than any one decision tree”, “the performance metrics confirm the efficacy of the model”, or “the model was validated with empirical data”, etc.
The main conclusions from the above discussion are:
• The “Black Box Paradox” in modeling that we observe changes the way the models are being scrutinized (less) and accepted (easier) by the development community:
• Therefore, model developers might be tempted to gravitate towards those less transparent or completely opaque models as a way to achieve quick results
• While this looks like a bonus for the model developers, they have to realize that the responsibility for developing and testing anaccurate model is still on them and not on the end users
• Users of such BB models have to continue improving their knowledge of the tools they use as they have to be able to better validate and question the models and tools they are using
In closing, the above “BB Paradox” prompts us to think about the possible near future when most analytics will be outsourced to fully-automated semi-intelligent systems that will resemble Black Box algorithms with their decision-making logic being mostly opaque to us. Would this represent the phase when we (humans) will stop questioning automated decisions completely and simply accept every recommendation made to us by the machines? Some of this seems to be already happening today.