Event Recording

Dr. Danish Rafique: Path to AI Production - A Strategy for Value Creation

Name: Dr. Danish Rafique: Path to AI Production - A Strategy for Value Creation
Uploaded: 2020-12-03T12:00:00+01:00
Duration: 17 min 20 s

Dr. Danish Rafique

Head of Digitalization & Data Analysis

Bayer AG

Posted on Dec 03, 2020

Congratulations. Your AI business case is crisp; you already have a data strategy in place; your proof-of-concept looks and feels great; you have the right talent to build the AI product or service which will push your organisation directly into the digital age. Sounds familiar? It is at this stage where most organisations give up on the AI initiatives due to lack of value creation. Why is that, one might ask? The business case was already locked, among other aspects, where's the problem at? One word: Production. AI products and services are notoriously different in terms of production than any other SW, and traditional workflows do not work anymore. This is what we are going to talk about. I will share my vision, blue print and personal stories across telecom, manufacturing and automotive industries, including both corporate and start-up experiences. I will tell you how to go from a shiny proof-of-concept to AI production systems, what challenges we faced, and the best practices to avoid the pitfalls.
Key takeaways: 1) Vision and strategy to create value out of AI - The last mile
2) How to go from a shiny proof-of-concept to AI production systems
3) AI production challenges and pitfalls
4) Multi-industry use cases across corporate and start-up ecosystems

Video Description

Show Transcript

So let's start with the, with the contents of the talk. So first as I mentioned quickly, so I'll introduce some applications and driver behind AI, just discuss a little bit on the AI fairy tale as an, what do we think and what actually is the reality. And also then discuss a little bit on the production challenges, give you a one on one on that, and then eventually touch operationally how such production challenges can be tackled through what are the approaches, which we can take to address those.

And then I will end up with two use cases specifically from the telecommunications industry and the automotive industry. And hopefully some of those experiences you will find useful. So with that here, some let's say flagship applications from ML systems or AI, sometimes I'm interchangeably using them. The two terms, ML, and AI, the reason being that here, I'm talking about the whole systems as opposed to an algorithm. And as I said, the flagship applications are wise recognition as an example, a major recognition and so on.

And the key point of the take home point here is most of these systems have achieved human parity or have actually beaten the humans on those tasks. That means the systems are really, really powerful. Also on the academic side, if you see there is a lot of research happening in this area of AI and on the bottom left, you can see, we are generating about hundred on average, about hundred articles or publications per day in different fields of AI can compare vision and, and robotics and so on.

And likewise on the industrial side, you can see there is a lot of movement on the merger and acquisitions of different companies, startups, and so on. So that scene is also really, really active. And this is all motivation. This all sort of clarifies that this is not just a hype. So to say where people are putting money and so on, but really a lot of work is going behind. And that's why we need to really talk about end-to-end systems as opposed to a singular algorithm development.

Very quickly, two key drivers, why AI systems have picked up this kind of activity as shown on the previous slide are the amount of data which is available today to IOT devices and so on and connected systems overall. And the second thing is of course, the processing power and these two things combined have actually led to this sort of a surge in AI applications. But of course we need to be very of the pitfalls as well, which could have not only let's say performance implications, but really moral and ethical implications as well, which is a big topic these days in context of these systems.

I listed two here. All of us probably remember the Cambridge Analytica file story. Then of course there are misuse of applications, for example, mapping applications, where drivers alert each other of a presence of a police car, for example, which could be used by a criminal to, to game the system. And that's, these are the things which we really need to take care of. And we're talking about this end to end system design. And here is this AI fairy tale, as I mentioned on the count content slide.

So if you look at the top, so we have data, we do some magic, we get some value, people go one step deeper into that. And that's on the left end side, bottom left. So you identify a business problem. You gather data, you build a machine learning model, you tune the model and you present the results. And if you follow these steps, you called basically you get two numbers, like what you see on the bottom right side, I don't know, 80, 84% improvements in the education sector and so on.

So a huge, huge improvements are promised based on these sort of strategy, but of course the reality is different. And if you look really do an analysis of what's going on, on, on the industrial side, borrowing a few players, here are the numbers, and that's still that these numbers still completely different story. So starting from the left side. So these are the companies obtaining, obtaining measurable value from data. And this is just really 12%.

And the companies who have actually deployed the AI systems to production, I mean, they, they come out to be about 15% and that's where it sort of shows you that at, at some point in this cycle, there is a gap and we need to, we need to really look into that. Otherwise we have a huge business value challenge, and eventually from a business perspective, if you have shiny prototypes and so on, you can show something great, but unless there is a value, and in this case, it's clear that this value is not coming through to productions.

Would you say this will fall off and we need to cater that this is just showing that visually. So on the left hand side, we see the launch and on the right hand side, we see that operating is a hard part. And launching is not to claim that launching is easy.

So to say, as, as I mentioned it, here's just to give you a comparative analysis that we have done that last several years. We have been there. Now we need to really talk about operation of these systems. What is this stock not about? This stock is not going to give you a strategic blueprint. So to say of how AI systems are good and not, not going to talk about the lack of talent problems, not going to look at that. We have a lot of silos.

We are, we are, we are beyond that, what we are going to look at. So this, what, all, what I said is basically depicted on the right hand side workflow, business to models, to presentation. As we discussed also a couple of slides back, rather we are going to talk about starting from the bottom, right inside generation, the model life cycle, and the production pipeline. And how does that play together with, let's say the conventional development areas. So let's start with the production challenges. So what are these challenges? What are we talking about here?

So there are probably gazillions of things. I mentioned umbrella items in this area, compliance and legal is a big topic. For example, not only about your data privacy and so on, but overall systems, is it really a validated system? How do you make sure that the old and the new does not clash? Reproducibility is a big topic. Can you actually make sure that whatever you're doing can be done again, allocation of resources. That's some, a lot of times that's really a scarce commodity.

So computer in store, for example, in your system or through cloud, if that per if that's permissible, is that something which aligns with your AI production systems, which vendors are you going to choose? Are you going to lock in, into those vendors? Do you wanna keep open source? How much open source and so on edge to cloud, do you wanna have complete local solutions or complete cloud or a hybrid? Also very big topic? What is going to be the latency of this system? So a lot of times you will have a perfect algorithm, a perfect system.

So to say, but just that you don't have enough bandwidth in your, in your production systems to allow for those communications, which are happening data, download, upload, model, insight, export, and so on. And of course, finally, the, the interfaces which you are talking about, are you going to build new interfaces to, to these AI systems?

Are you going to extend your current pack, to which extent are these the best interfaces and not, and these are some of the challenges, admittedly, the major ones, which you come across and then you have to dig into details of these, that, how do, how do we cater that pitfalls? When you have deployed a machine learning model into these systems, the model goes stable. How do you cater that the model is not useful anymore? It has diverse, there are performance shifts in the system. How do you take cater that at which frequency are you going to update the system?

So for example, is it going to be daily? Is it going to be hourly? Is it going to be weekly?

What, what is the update frequency of these, this systems when you have a self learning or, or, or transfer learning deployed, is this something which you want to leave uncontrolled? What level of human interventions are allowed and so on? And then there are some best practices to address some of these look for a unified architecture look for overall box, which can make sure that things are orchestrated in a proper way. Don't build individual software components and put them together. Make sure you are monitoring important KPIs. That's really important.

Also more often than not looking at the data is better than looking at the algorithm output. So look at your data already to see what's happening. Think about explainability of your models. Simple models are usually easier to explain than the complicated ones always have rule based contingencies. So you're not always dependent on AI systems. So to say from machine learning perspective, but also certain traditional systems in the end, if you do that, eventually you come up to of course, real value.

I'm not going to read the whole slide, but certain some values, which are, I think industry agnostic are you have better uptime of your resources. You can save on CapEx and OPEX time to market can be increased higher revenue. And most importantly, which I think is really, really critical is also you get a market differentiator. So you also need to think about that. I'll just use a minute to explain you how practically that would look like. So on the left end side, we have a traditional software development. So requirements implementation, verification release, and iteration.

AI is of course different. The first thing we need to think about is to separate out the training and the serving pipeline.

You know, so here you have requirements, you have data sources, model design, and model release, and you need to iterate on that. And then you have to have separate branch to the serving.

And here, basically what you do is you take the train models, you bundle them, and then you have two options either to serve them in a batch way. So, you know, some, some sort of chunks or you stream this, you make this model available to stream inferences. Yeah. And then that's the two fundamental branches which then need to also talk to the traditional software branch, of course. And there's going to be feedback mechanism between them as well. And that's how the overall productional life cycle in the new world looks like. How can you do that? So all of that translated into applications.

You can do it ad hoc. You can build individual pipelines for different components. What I just explained training using certain programming languages, your certain integration pipelines, however, that's very fast. That's impossible to monitor scale and maintain. So you need to say, well, I, I can do a quick prototype, but this will not be a scalable system. How to solve that. Look at the left side, you can create an expanding data science toolbox.

So you let people, or, or your developers use whatever or data scientist, ML, engineers, data, engineers, use whatever they like under, under certain framework for as one example, use model as a data, meaning standardize your interfaces. So whatever language is being used or whatever system, the models are coming in a particular format, which can then be used into the serving pipeline. Similarly on the serving side, the software toolbox is typically static. Semit static, I would say. Yeah.

And here, what you can do is in order to control this environment, that that is coming through a very dynamic environment of the training pipelines and updates. And so on. Try to come up with a global unified framework so you can orchestrate all the resources and the models and everything. I'm going to skip this slide in the, in the interest of time, just so that I can show you two applications. The first one is from the automotive industry.

Typically when we have autonomous for vehicles, they have all the sensors and so that they can drive independent, automated in an automated way, by sensing the environment around them. What they also need is certain mapping functionality, which is going to serve, or which should serve as a backup in order that if a certain sensor fails can, or we can augment the sensor data through these mapping functionalities. Yeah. And in order to create that map, of course you can do it by hand, traditionally have been done by hand, but this is really, really complicated.

Some of the problems are it's slow, it's inaccurate. It's not really sentimental level accurate. So to say across the whole globe, it's not scalable. We cannot do it again and again. And it's extremely expensive. The goal come up with a certain automated way of creating this maps. So that's what we did. We looked at the aerial imagery of, of the world and created those maps by doing semantic segmentation on those pictures. And then stitch stitch them together here. I just give you some lessons learned from that. So starting from top left, we divided into four sprints, for example. Yeah.

And each of them is a three week sprint. So the first sprint is look at the data sources, define the success metrics. Sprint two define, basically create lane marking models and individual deep learning models for different things. So this is also something to remember. You cannot usually practically speaking, not build one model for everything you have typically multiple models for multiple things is sprint three, make sure that you can cater different kind of inputs.

Meaning for example, if an image is a certain different resolution, will this model still work in that particular system and also give semantics to your data? So if there is a lane detected, how does, how can the system know that it's a lane and finally work on the entire pipeline? Automation lessons learned were data gathering. Of course you can find some open source data on that, but that's at low resolution. High resolution data is really expensive. So data gathering is, is a critical piece here, especially in terms of investment as well.

Then you have the pipeline topic, pipeline automation is not as easy as you might think. And finally, the system integration because such a map is not a product by itself. Rather it's a service which has to be integrated into a car vendor systems, for example. And you need to have that philosophy. What I explained on the previous side slide in terms of global unified architecture, the second application which I'll introduced in two minutes is in the telecommunications area. And here typically when you have a network, you need to plan resources.

For example, if you're in area a and somebody's in area B both of you want to watch something or communicate or whatever the availability has to be really, really high. What that means is the network providers need to plan these networks so that the bandwidth and the resources are always available. Typically this has been done using predefined rules or, or very cumbersome engineering models or, or some manual fitting and so on.

Again, all the problems which you saw on the previous slide, even though very different industry in a very different use case, the problems are the same slow unscalable, expensive and so on. So rather what you can do is, again, you can automate part of that through an AI system. In this case, again, the four sprints starting from top left, what are the data sources? That's a network topology, how the networks will look like, and what are the success metrics look at different features. If you look at all the network features, you have very, very high dimensions. So that's not a good idea.

So reduce your features. Sprint three, build the data, learning deep learning model with data augmentation, as op, as opposed to co let's say typical belief that telecommunications is a lot of data. Actually. It's very difficult to get that data to the right from the right stakeholders, because everybody, there are many, many, many different stakeholders who sit on this data and it's not that easy to get access. So you will always need to have some kind of, some kind of a data augmentation strategy and finally use different models.

You don't always have to use the most complicated models using ensemble models and so on and put it something which is integrable in your production system. That's the most important thing lessons learned. Keep it simple, simple models are already improving the utilization rate by 10% in this example. Yeah. So earlier let's say we were 70% utilizing the network by those.

We are doing 77% by except for example, monitor the adoption rates, how much these models are used are, are these, let's say machine learning models, scale if they're not getting used anymore because the, the designer is seeing that, okay, well, I can do better than that by manual. Then nobody's going to use that. So monitor those things. And of course, educate, explain, have models which are explainable and show clear KPIs to conclude, add, end to end process thinking to your business unit. So that use cases are, are, are more successful. AI is definitely sexy and it's a, it is a huge hype.

Everybody is doing AI production is not, but that's where the real value is. Yeah.

POCs, your proof of concept will not scale if lacking a unified architecture and finally resource patterns. This is the slide I make.

I, I skipped for the sake of time, but you need to also monitor what are the resource patterns in terms of how much computer store training takes exploration takes, serving takes, and try to manage your orchestration pipeline based on that. Okay. So with that, I'm going to thank you. And I'm happy to take any question.

Like this?

Don't like this?

Why don't you like this?

Dr. Danish Rafique: Path to AI Production - A Strategy for Value Creation