Yeah, so good. Well, good, good, good morning everyone. Welcome to my talk. Can you Thank you. Now I have the slides, but so Gar, we don't see you right now. So originally it was planned to be together with Gar on stage, my kind of customer and long lasting colleague from BMW.
However, some of you know there was heavy rainfall in Bavaria, which resulted in some floods, which resulted in constant trains and all this pain. And so Gerald is enjoying the beach in LA as you have seen and making his talk from there. Today we are talking about multi-region and multi hyperscaler set up the crucial role of identity providers in modern IT architectures and how to ensure their availability. A super long title.
What's, what's in it for you? So for those who have been attending last year, I've spoken with Stefanos, the product owner for the access management at BMW about migrating their IDP to the crowd from an on-prem installation in making it super cool.
Some of you remind last year, Azure has a couple of hiccups at the beginning of, of the year, unfortunately it was at the day of our go live. And for a go live in a remote setup, you need at least a proper communication within the team. So MS teams working outlook working in case anything goes, goes wrong.
But even if everything goes good, you would like to have a confirmation or so, and at least be in touch with your colleagues and not relying on WhatsApp, iMessages or on a completely separate, separate infrastructure. And though we started thinking on how to, can we make this more resilient? How can we increase availability? And I'm starting now from there to share a couple of of backgrounds.
So what was the motivation behind every large enterprise or most of the large enterprises follow our cloud strategy.
So a lot of it is put into the cloud, into the hyperscalers, AWS Azure, Google Cloud platform, you, you name it. International enterprises normally need a multi-region concept. So Europe is not enough. So you have at least Europe and the US most con companies are operating in Asia. So you have something in Asia and especially in the automotive where companies have a huge market share in China, you have to have some solution in China as well.
To be compliant to China cybersecurity law, to have latencies in a, in a good working manner, you have to support on demand scalability for certain events and an increased resilience that no hitter hiccups kind effect. On the one hand, your customer experience when the services are related to the customers, to the drivers of the car, where you wanna have a sheer driving pleasure as BW saying on the one hand, but also to your employees when employees are working as white collar workers in an office.
But even when you are a blue color worker in a fabric, so even if your IDP is not integrated in an OT setup, at least your computers are. And when you steer your machine from your desktop computer or your notebook, whatever you name it, you have suddenly dependencies to the protection production facilities. What numbers are we talking about? To give you some impressions on the sheer size of the BMW setup.
So two, a 2,300 application, 1.1 million identities. This are not only just employees of BMW as you can see, but a large partner and supplier ecosystem that's integrated and 26 million authentications a day, which is a bit more than 1 million authentications an hour. BMW is a global company. So when Munich is sleeping somewhere, someone is working in the US hopefully at least. And so the businesses are ongoing.
What what did we do for sure BMW has had an access management infrastructure in, in in the past.
We introduced the four truck stack on-prem long, long time ago, and as most of you know, for truck and ping have been merging. So I have both logos for historic reasons. So it's the for truck part of the, the ping business side that's that's used there and how it was our approach to bring it to the cloud. It was clear we want to have modern platform and BM BMW wanted to have the best of breed product, which was back at this time four truck was, was, was, was winning.
And it was very important for us when we go to the cloud, when we want to support various hyperscalers to support more than paradigms as well. So infrastructure is code, configuration is code with the ultimate goal of a hundred percent automation that we can really act quickly.
How did we do that?
We, we are using our service layers platform, which brings a good basis for doing this stuff. For those who wanna learn more, IM, and my team is available at the booth of, of IC consultant. We can dive deeper, dive deeper later on. So reach out. Basically it brings all the stuff you need. Build CICD so that we enable the a hundred percent automation that set up can be set up very, very quickly, which is a huge advantage when you have this in place when you're talking about multi-cloud setups. An important thing on how is done cloud wise, the common denominator is a managed Kubernetes.
So we are not using the capabilities of the hyperscalers in depth, but relying on Kubernetes to have a kind of certain decoupling from advanced features that are specific to AWS to Microsoft Azure or or to GCP global target architecture.
Not telling you something, something new is basically that instances have to be distributed globally for so many reason I talked to you before, latency, data protection, cybersecurity laws, and for sure increased of availability, but oh, service unavailable.
So not stopping my my talk today, but when you look back in the, in the news around the Microsoft techster has been Microsoft affected, but not to just bash Microsoft. The same happened to AWS, the same happened to GCP and the same will happen to all, all of the cloud providers, larger outages, smaller outages, depending on what capabilities you're using. At the end of the day, you have to ensure that you have an increased resilience and you will have an immediate impact when your IDP is not available.
The IDP is not just any application like a CRM, like an ERP where you affect business processes in a, in a crucial way.
But when you can't log in, when you can't do any authentication, when you can't use your refresh tokens to be authenticated again, you're basically locked out in doing your, your business kind of immediately. This has immediate impact. So the end user has an immediate impact. He will create load at first level and second level support.
Hey, I can't log in, can you check my password? Unfortunately, it's not the password. There are some limitations to debug for the end user for sure, and you have immediately operational this disruption. So despite your production line might be able to produce cars still, you as a blue color welcome might not be able to adapt some steering mechanisms of your machines or switch the programs from type A to type B or what's ever important for them. And then we have kind of follow up consequences that, so-called ripple effects, you have reduced productivity and a potential revenue loss.
So it's very easy when you produce something and you don't produce something for an hour, for 10 minutes for one and a half or two hours. You can't produce later on faster, like in the pandemic with, with hotels, if you don't sell a hotel room, you can't sell the hotel room later twice. And so you can't produce most of the time things faster. It's not just software that you can copy. You might have some potential security vulnerabilities as you have to have work around.
So that broken glass thing to attach and get in touch with some system is good and for an emergency, but basically it normally puts away some security preconditions and some security preventions. And this might immediately lead to non-compliance issues. And coming from that, giving the the outline on why we started this experience, I would like to hand over to Gerald who shows and shares a little bit more thoughts about the role of it behind.
So Gerald, over to you.
Yeah, hiku, thank you very much. I will share my screen. So let's first of all check if this works out. So I hope you can see, yeah, you can see the full screen right works out. So you can hear me right?
Yes.
Okay then yeah, unfortunately I cannot be present in person today as, as Haiku mentioned. If I would, I would've kicked it, picked it off with the, with the question to all of you, like who is owning five cars?
So yeah, but I can't do as I'm not there, but if, if I would've asked this question to my, to my little son, he's six years old, he would've looked at me and like five cars. No, I get, I have 150 cars and that's true, you know, he got this big bag of small matchbox cars where roundabout 150 cars are inside.
And so why did I take these two numbers? Five and 150. So five sounds quite much.
Yeah, if our system, so the access management system, which is called web EAM, web enterprise access management inside BMW, if this system is down for one minute, BMW loses five cars. So that means five cars are not produced. If it is down for 30 minutes, minutes, which is kind of the timeframe we have for that presentation here, this would be 150 cars, which are not produced.
150 cars, which are not produced, means a loss in revenue of 10 million euros and a loss in profit of roundabout quarter of a million euros. So it's, this is an impact for BMW and that's why resilience is, is very, very crucial for, for, for our product. Espec speci, ESP especially.
So let's, let's take a look at, at the reaction of losing cars.
So if, if my son loses five cars, he goes like damnit, but not a big deal, right? If BMW uses five cars loses five cars, the BMW managers go like, yeah, damnit, but not a big deal. So reactions are quite similar compared to losing 150 cars.
Yeah, my son would react like whining, crying, kicking me, making me responsible for losing his cars and stuff like that. M BMW managers would go like, damn it, how can we avoid losing 150 cars in the future? So the reactions are quite different and to be honest, there isn't BMW employee, I'm very glad that the, that the, the reaction is different.
So wait, let me go to the point. Doesn't it work? Sorry.
Ah, next slide. So we identified, and this began in the past like 2016 and the following years, we identified some challenges and pains at BMW, at BMW for our software product, for the access management component. And the first and most important pain was reliability and resilience.
The second one was performance and latency. So at this point in time we saw that performance was quite still quite, quite sufficient, but latency was an issue depending on where you were located globally. Yeah.
So for example, from South Africa, we had some issues from Asia, we had some issues because we were only, we were hosted centrally here in Europe and we identified over time that with the change of the, of the application architecture, so like the new modern application architectures, they are require requirement and requesting more performance, lower latencies in order to, to implement their use cases.
So performance and latency is an issue as well.
For future readiness, we wanted to be able to use modern authentication technologies, we wanted to be able to increase our security and of course we wanted to have a software product, which is fitting our expectations concerning deployment architecture especially Yeah. To, to bring it into the cloud in the future. And last topic is time to market. So we wanted to get rid of the whole BMW infrastructure where we had to take care of and in, in order to be able to focus on the feature deployment of our product.
So what did we do early conclusion, sorry, forgot that early conclusion was we need a high resilient authentication as a service solution, but it needs still to be very customizable for our needs. So how, how did we achieve that? Or how did we solve the problems or how we are, how are we about to solve the problems? So I'd like to show you the four steps, which we, which we took or which we are about to take in order to solve that. The first step took us four years is the migration to the state of the art software pro product.
So we migrated and that this was in 20 beginning, in 2017, we migrated to four Truck AM and with that setting the base for all the follow up steps, which we want to, which we wanted to perform. So why did we go for, for truck? Our existing product was not future proof, it was kind of outdated for truck offered us most of the features we needed out of, out of the box was still customizable so that we could implement the, let's say the, the remaining parts. It was cloud ready and for us, very important, highly automatable so we could use all the APIs they offered.
So how did our infrastructure look like when we changed the software product? We still installed 4:00 AM on Linux server in our corporate network.
So, and this, this instance was used from employees from the corporate network directly and from the internet through our DMC using a reverse proxy. So why didn't we put it into the cloud directly? Because the BM BMW cloud strategy was not at that point where we were convinced that we can bring our critical infrastructure to the cl to the cloud already. So that's why we put it on traditional servers in the first step. How did this impact our four challenges? So for reliability and resilience, there was no impact because it's just a change of the product, nothing else.
But we saw better performance with the new product and latency of course was not impacted as well because still only one deployment region future readiness, this was a big impact because as mentioned, it was the basis for, for, for all of our next steps, which you will see, see in the future, and also time to market. Why? Because we use the automation options of for truck and we created a self-service so that each and every application was able to onboard automatically on their own and know human resources of our team were necessary to do so. So that was a, was a really big step for us.
Why did it take four years? So it didn't take, take us four years to set up the environment, but then all the applications which needed to migrate to the new infrastructure, they had dependencies, they had priorities. These were around about 1500 applications at that point in time. So the whole migration process took kind of three years. Next step was the migration to the cloud.
So, and that's where we decided to go with service layers because service layers offered us the flexibility we needed and also offer service layers is able to offer the service on different cloud providers, which we will see later on. So why did we move to the cloud? As mentioned, we wanted to get rid of our structure problems we had with our BMW infrastructure structure because the, the infrastructure was not not highly automated at BM bmw.
So there were many manual processes for changes and of course we wanted to free our resources in order to be able to focus on feature development and yeah, that's, that's the reason why we went to the cloud and why we took service for that. So our infrastructure looked like after that move. So now we were, we are hosted on AWS and the AWS cloud on the service layers platform, we still have the web BM core, which is the for truck AM plus public and private ingress. Please be sure this is a very high level architecture. I'm focusing on the relevant parts.
So there is much many, many other components around that, but I'm, for simplicity reasons, I'm, I'm focusing on that. So move to the cloud.
How did we do that?
We, no, first of all, how did that impact our for, for, for, for targets? Yeah. So this was a big impact for reliability and resilience because we got rid of the m BMW infrastructure and we saw that the many, many issues in the past were based in the BMW infrastructure. So that was a big impact on reliability also.
We, again, saw better performance for some use case cases for mainly internet use cases. Future readiness, I guess again, big impact because again, it enabled us to proceed with our strategic approach, approach with the next steps and time to market. As mentioned, we freed our resources and we're able to focus on, on, on, on development of, of functions.
How did we perform the MI migration this time? As mentioned, when we changed the software products, all the applications had to migrate one by one from the old platform to the new platform.
When we went to the cloud, which was kind of one, one year later, yeah, this approach was not accepted by our management. So we need, needed to find a solution where we kind of could migrate the applications through our new environment without having them to do any changes. So it was kind of a big bang approach in detail.
We changed our DNF entry so that they pointed not to our on-prem environment, but to the, the, the, the one, the, the big chunk of time which, which this took was all the preparation stuff, the communication stuff, testing in the test and integration environment in order to be sure that the big bang will work out at, in the end.
But then we come to step number three, which is we are about to finish currently.
So that's, that's still work in progress and that's our multi-region set up. So what are we doing here? We are installing or implementing web EAM in several regions. Like AWS EMEA is currently in, in place. China is about to go live and US is planned to be, to go live at the end of this, of this year.
However, the requirement is still, we want to have global single sign on. So we don't want to build silos. So if you log in into an EMEA application for, or using an EMEA application for example, and then afterwards using an US application, you must not, or you, you should not be forced to log in twice.
So that's, that's the big challenge here. But on the other hand, other goals are keeping regions as independent as possible so that if one region is down, the other regions are still up and running.
And also if one of the region has to leave this global signal sign on for political reasons or regulatory reasons or whatnot, this must be very, very easy to be implemented. So we identified that all of the regions must be able to process sessions and tokens from all other regions locally without doing a callback.
So that, that's our approach in, in order to, to meet the requirements as good as possible. And yeah, that's what we are currently working on.
Again, the impact on our goals. So again, a big impact or concerning reliability and resilience because if one region is down, it is now less harmful to BMW because we still have two other regions which are used.
Of course for latency it's a big impact because that was the main goal of that. Step number three, the multi-region setup, reducing the, and future red readiness may be not a big impact for us as, as as access management, but for the applications using web BM because now they have lower latency, they can implement other use cases and for concerning time to market.
Well there is, this is no impact obviously, which brings me now to the fourth and last step, which is not only multi region but also multi-cloud provider setup. So our, and that's what we are currently working on. So this is not yet live, but we are currently working on it. Our idea is to have dedicated failover environments in each and every global region. So one for us, one for emea, one for China, as mentioned as Haiko mentioned, there were some outages and some bigger outages on cloud providers.
And there was the point in time where I had a talk with one of my managers and he asked me, what happens if your cloud provider is down and with that web EM is down. And then I just replied like, well I guess we have more severe issues at BMW than just web being down. And he replied like, yeah, we have a, but we have a multi-cloud strategy. So we are not only using AWS, we are also using Azure and other clouds, we also have applications running on-prem. So they would in theory still be up and running, but we would not be able to to be accessed because web BM would be down.
And my reply was, hmm, let me think about it. So, and that was, that's where we already had the idea to create a failover region, but that was the point in time where the idea came up to put the failover region on a different cloud provider.
Yeah, and that's, that's what what we are currently working on, we are currently in discussion of how the failover will work. Yeah. So if it's will be somehow automated or if we will somehow do it manually, but in the end, if we managed to do that and to get that implemented, then in my opinion we, we reached a maturity level which is very, very high even inside BMW there. I'm not aware of any other applications or any other infrastructures doing, doing this, what we are doing here.
So if we manage to to get that done, then let's say at least most of our pains, which we identified are as solved as good as possible. And yeah, so that that's our, our goal for this year and and for next year. And I think that was my last slide and with that I'd like to hand over to Heco again.
So thank you. Thank you. Carol. Do you mind showing the last slide and the
Ah yeah.
Team does not to. So we are now to the next slide. Lessons learned, no. Yeah. So one minute left.
Ending with, with lessons learned, we did for sure learn a bit and experienced a bit over the long journey since 2016. Originally we had planned doing it in dialogue. I'm just grabbing your part as well Gerald.
Yeah, so I think a couple of things are very important. Automation is key telling nothing new. If you wanna have setups in this scale, you have to automate. Communication is key and the top management buy-in so such a long journey back from 2016 to two 2025 in the future when the vision will be implemented is quite a journey. It's a decade and so it takes time and you need the awareness and the commitment of your management team to do such stuff and basically they have to consider it as important.
If they don't, if AWS is down, if Microsoft is down, I have other problems, I'm fine with them then it's part of risk management. You can give it a value, you can work around on it. If you wanna do such huge things into management volume and do a proper communication, when you have 2,300 integrated applications, you have most likely 2,300 application owners probably a bit less. But it's not only the application owner, it's the team around the application. So you affect a lot of people that are your sparing partners that have to play with you and you have to consider the needs.
So you can't do this big changes without ensuring an end-to-end testing without having their availability end-to-end wise in case they have to adapt things in case they have to remove dirty hacks from the past in case they have to remove some shadow ID that they have built up and from a negative side.
So we expect the unexpected on the journey, you will see a lot of things that you have not planned and for sure application resilience training, we had a lot of learnings there as as well.
And then even though we have been surprised during a go live day that a cloud provider within the communication stack exchange and teams can have some hiccups and basically stop your go live where everyone is, is super excited to execute it. So it's for a project team, always a huge emotional thing as well. You are happy to make it. And then short before the peak, you have to push it out by, by another two weeks or so. And if you can go to the next slide, Gerald, please. Building on our success. So regular and clear communication.
And I would just point out the last point before stopping, identify the single points of failure. So we have now addressed the hyperscaler setup, but there might be other single points of failures as well. So always reflect about your IT architecture, about your setup, what happens if a certain thing has an outage, what will be the consequences for your overall setup. Perhaps I get a minute for some questions in the audience. I don't know.
Well thank you very much Hiku and Carol as well. That was a really thought provoking presentation.
Unfortunately we don't have any time left for questions for you now, but of course you can always talk to IC consult later and I guess they will be ready to answer all of your questions. So thank you very much again and
Thank you. Thank you very much. So.