You’re walking into work after getting your usual morning drink. Everyone is working like always, with the normal office noise you’ve become used to. You check your Key Performance Indicators (KPIs) to make sure everything is running smoothly, and you notice that although your system is processing information as usual, it’s classifying most of the data as being one category in particular. In a fit of confusion and suspense, you rush to review some of the records……. So, what happens now?
That depends on if you used an opaque model or a transparent model. If you used a transparent model, the data will be in that category (or within that range if you’re using numeric data); otherwise, the input parameters would have let you know that something important in your system was changing before you arrived at the office. Like the new data having extremes that wasn’t in the training set. At worst, you’ll discover that your model didn’t account for some sub-process you didn’t previously know could have existed. At least then you’ll gain a better understanding of the actual system and may be able to leverage that knowledge for other projects.
If you used an opaque model, you may see that the data isn’t in that class (or range) and still not know what went wrong. After many hours of playing with the data and your model, at best you might learn that “some” parameter caused the information to become erroneous once your business got the particular combination of inputs it currently has. The remedy now is simply retrain the model on the new data (once you hand label at least a few thousand of them), because opaque models just fit the model to the data and does not validate any specific hypothesis or structure.….. Does that sound like a good explanation of what went wrong and why, or even how to fix it? Does this experience sound like something that would affirm confidence in your techniques or confidence in your understanding of the business and mathematics?
Transparent models give you the power to understand exactly what is going on with your system from top to bottom; however, they require more research and an advanced understanding of mathematics (particularly statistics) and of the business itself. This is required because transparent models are based on specific mathematical patterns; which would need to be match to the corresponding business process. On the other hand, opaque models are much more flexible and, in some cases, more accurate (knowing advanced mathematics can help, but only so much). Opaque models are also good for automating tasks that already have many labeled records and in the sense is built from bottom to top.
Opaque models aren’t as centered as transparent models are, on a particular pattern or test. Opaque models get their power from having so many parameters to tune that the model behaves like clay and fits its structure to any dataset. In doing so nothing can be said with any certainty about the final form it did get. Or even what patterns are statistically significant based on the domain. This why opaque models tend to be accurate but also impractical.
When doing quantitative analysis and data mining for business needs, people often fail to recognize the strengths and weaknesses of opaque and transparent models. Depending on your use case, these differences may not matter. For example, if you just want to identify which grocery bills are outliers in regard to the number of loaves of bread bought, most methods will get you the needed result. But if your trying to analyze data to learn more about how flexible a business process is or how long it will take a supplier to get their product to you then you need to determine which model in more effective and appropriate, as one situation the model is meant to find and explain real-life process and the other is just meant to be accurate in its prediction.
The capability you want to have will usually require a combination of transparent and opaque models. The important part is to understand the system being created. That why when you do use opaque models, you can place them strategically to mitigate possible risks should the system randomly malfunction.