1 Introduction
Although the market for enterprise AI has rapidly expanded to include both in-house and AI-on-demand solutions, robust monitoring processes are often left out. The actual development of machine learning (ML) models and application to business scenarios often takes the spotlight of research and is the focus of organizations wishing to build of procure a ML model. Any organization deploying a ML model likely hopes that the investment will deliver valid and impactful outputs for an extended period of time, but does not have the empirical proofs that are easy to deliver in a business user-friendly form. This is due to some inherent difficulties in working with machine learning models, but also because solutions are only beginning to be delivered to enterprises deploying ML.
Implementing an AI project requires attention to the ethical and interpersonal implications of operating a ML model, such as mitigating biased model behavior, explaining model decisions, and validating the model. Bias is a constant risk for the interpretation of any data and training of any model, increasing the real world impacts than for non-automated data analysis because ML models not only rely on the input of potentially biased data to deliver the payload – often a decision or prediction – but also dynamically adapt to changing inputs with sometimes unexpected and unethical results.
The persistent weakness of ML models is the lack of explainability for individual decisions. These models are often described as “black boxes” that provide neither the operator nor end users with justification for a single decision or outcome from a learning model. Explainability is not only a necessity for mission-critical situations – explainability is a form of accountability, which is required in every situation, however mundane. For this reason, explainability is hugely helpful in providing individuals with the ability to understand and if necessary contest a faulty decision. It has been repeatedly demonstrated that many open-source training data sets contain underlying sexist, racist, and other problematic trends that are being trained into ML models. Spot checking with the help of explanations for individual decisions can be helpful in identifying the problematic learned assumptions that impact a single decision. While this is an important safeguard for end-users or those affected by individual decisions, there is still the necessity of tracing bias trends across many decisions. Fairness monitors for detecting bias in algorithms are developing as a viable method to debug models, preventing harm to end users and saving reputational face.
Explainable AI has made significant progress in the last two years. Although most research is coming from the academic sector, private firms are adopting these techniques and building them into enterprise AI management platforms. The burden of ensuring explainability still lies primarily with the organization implementing it, but some degree of interpretability must be a standard feature of any ML deployment.
ML models are increasingly being developed by separate or external teams than the team that oversees the runtime. This makes it very difficult to gain insight into the minds of the developers to understand sources of unconscious bias, or to gain full auditability of the development part of a ML model, including knowing who the human participants/trainers/operators were, the training data sets, and all inputs and outputs.
Organizations often face their own very stringent risk management departments that may prevent an AI project from going into production or operation because there is not enough assurance that the model accomplishes with an appropriate level of accuracy the stated business goals. Model validation that can be applied to a prepared model before and during production is an important feature for managing ML models in an enterprise. Model drift is another key topic, which is a change in input which may lead the model to make inaccurate predictions. Proactive drift management can allow an organization to have advance notice if their model is becoming inaccurate, and provide indicators of when a model’s accuracy decreases.
Key challenges that drive the need for structured AI monitoring and management solutions are:
- The difficulty of providing explanations in an easy to understand form
- The pervasiveness of unconscious bias of input data and trained models
- The need to measure and ensure model accuracy over time
- The increasing distance between model designers and operators
IBM’s Watson OpenScale is a stand alone product to provide enterprises with a comprehensive set of management and monitoring tools for AI projects. These features cover the major concerns of bias and explainability, and offer compelling additions such as measuring for model drift and contextualizing outputs with business goals.