For those of you who are regular attendees of our EIC event, our next speaker will be no stranger to you. You'll be familiar with her and she brings a wealth of security and identity knowledge to the cyber security resilience topic. So please welcome EnBW Chief Architect, Eleni Richter. Thank you. So the speaker before was talking about resilience of architecture and today I will give you some patterns and some ideas how you can have a built-in architecture, some ideas for resilience. So it's just a selection of practical patterns. We don't have much time, but I'll stick around.
So if you want to ask me questions, just ask questions after my talk. So first of all, who am I? My name is Eleni Richter. I got a degree in engineering, exactly in Wirtschaftsingenieur of the University of Karlsruhe. And I'm more than 20 years at EnBW in several positions. I worked as IT security manager, IT consultant, IT system designer, several projects. I was the project manager and overall an architect. And right now I'm the Chief Architect of the Identity and Organizational Data Management of EnBW. And I'm responsible for design in the architecture.
And I'm also a lecturer at the University of Lucerne for Applied Science and Arts in IIM and cybersecurity. So I'm an engineer and an architect and I like to draw, so I brought a sketchbook with me. And inside this sketchbook, we will take a look at some practical patterns on cyber resilience, which addresses mainly large and complex cyber systems, often spanning legacy to cloud systems, as my speaker before already explained. And they are difficult to maintain and operate, including all questions of cybersecurity.
And investing in cyber resilience, not only helping in cybersecurity and in those really critical incidents you have, or a crisis, it also helps in building, maintaining and operating these systems. So the daily base work will get easier. So we will explore some basic patterns for building robust and resilient cyber systems. And we reflect on them on some practical use cases, on-premises, cloud scenarios, and of course, my favorite, complex identity management environments. So important, all models are based on well-known practice and general knowledge, and all examples are fictional.
We'll start with one thing. What is resilience?
Well, a better question is, what kind of resilience we are talking about? Because the term is used in various fields, in social science, in IT, psychology, engineering, and there are more narrow definitions of the term, there are wider definitions of the term. You will find it for a simple strength, it could be called for elasticity, for robustness, but it can go up to the capabilities of self-restoring and self-improving even. It's very common in the field that you use the definition of one field, that you will use it in another field.
And my definitions of terms are within a reduced scope of cyber systems, I'm not going to the psychology thing and so on. And my definition, which I will use is the ability of a system to react to stress and restore its function.
Okay, first question is now, what is a system? A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. That could be algorithms, applications, combination of software and hardware, could be a whole data center, could be a cloud service, may include associated people, processes and organizations. So something unwanted occurs, stress, any action placing a strain on the system, about a significant level, it will have a negative impact on the system.
It might not only be the ransomware attack, it can only be, it can just be a short outage. And sources of stress you should distinguish, there's exterior stress from the outside or extrinsic stress from the inside, which happens inside a system. And the reason for stress can be unexpected, unintended, incorrect, like error or malfunction. And it can be, of course, also malicious interaction. But remember, we look at all kinds of stress, not only at malicious interaction and not only on the cybersecurity stress. It can be just operational stress because something is going wrong.
What is not meant by resilience in my definition, in my scope, of course, elasticity. Elasticity is a simple form of resilience. It's very common in engineering. It's meaning in IT pattern to provide resources on demand, mainly used in cloud technology. So you forgot it's called scaling on demand. So if you've got a web shop in the cloud and Christmas is coming and you've got a lot of people buying things, you will scale up the web shop. And in January or February, we'll scale it down on demand. That's elasticity. High availability. You shouldn't mix up high availability with resilience.
High availability means the concept to reduce expected or unexpected downtimes, mainly used in building and operating cluster systems, classical HR cluster. But it's a very narrow sense. In the field of software engineering, you will find the term of robustness, the ability of tolerating perturbations that might affect the system's functional body.
In IT, ability of an IT system to cope with errors, mainly used in the field of programming, like on error, go to somewhere or try catch or more advanced concepts are unit testing and TDD or something like that. Similar to robustness in IT is the term of fault tolerance. These concepts are specific for a system type or an environment. You don't have HR clusters in code and you don't have try catch blocks in HR clusters, of course. And a well-defined sort of stress and a well-defined reaction to it. When placed in the right position, they can contribute to resilience.
But resilience is a much larger scope and a more general concept. It means it is suitable for complex distributed systems, for unforeseen stress, and in general includes complex adaptive reactions. Resilience, solution for everything, suitable or unsuitable, deep dive, we'll take a look at suitable and unsuitable problems, which you can address by resilience or not. Unsuitable is if you have a small size or trivial system. If you have a closed system, which you can, for example, can find in an OT environment, a really closed system.
I mean, not that ones that should be closed and they are still open. If you've got a completely fixed system and absolutely no abilities for changing behavior. Suitable problems to address resilience are complex open systems, which have interactions from outside. If the number of elements are above trivial and maybe it's a distributed system, could be on-premise data center, private cloud, several public clouds or something like that. Or it could be just your software, which is run from several parts.
And if you've got a frequently changing system or you've got some adaptive parts inside the system. A suitable system may be composed out of unsuitable systems, including some kind of orchestration. So you can have some of these and if you put them together, you could get one of those. So first introduce a model of topology for system design, which I find rather useful. We'll have levels, we'll have three levels, we'll have the system itself. It's divided into subsystems or services, as I call them. And we got a systems of systems, or I would call it compound system.
And we'll take a look at the system design. Subsystem services or a subsystem is made up of components. Could be a GUI, could be some logic, could be a database or whatever you're using. A system is composed of several services or subsystems. And a compound system is made up of several systems. Now the complexity is increasing, of course. Up here we got many parts, down here it might be much simpler. And complexity is increasing with the number of elements and amount of interaction on each level. So far so good.
Principle, reduce dependencies to minimum, avoid hard dependencies. So we look at each level. Stress on subsystem level should not be able to cause severe damage to the overall function or integrity. Blocking situations above service levels should be avoided. So inside maybe something fails and something is blocking, but it shouldn't block the whole system. Of course it shouldn't influence the behavior of the compound system. And patterns for robustness may include clustering for high availability and transactions. On this level you have these classical patterns.
Resilience policies, you could now formulate them for each level. On the subsystem service, components belong to one service only. No sharing between services. Transaction within the service may be okay. On the system level, no service-to-service transactions. Use non-blocking interaction instead. And on top level, it could be each system should work independently. Typically pattern, lose coupling. You can add, of course, more to your resilience policies. It's just an example or a model. Now I'll go for example how to achieve these policies. I'll take a, I call it a lookup pattern.
We've got a use case. We've got one service and it's provisioning of data from source S to multiple targets T1 up to TN. Data is important for the service. Data is possessed further. Okay. And topology could look like that. Services belong to systems. Systems are some, some systems are working together in some compound system or SOS, which is the abbreviation for system of systems.
So, an anti-pattern, which we usually use is we, we're using invasive low-level multi-part transaction. No matter to which system a target belongs, source will write directly into the underlying data bearing component. And we got the benefits, same data in all parts of the compound. But the negative part is if the transaction partly fails, the provisioning will fail in total. If you've got a complex system compound out of many parts, it's a likely situation that something will fail and then you will have no provisioning at all.
So, reaction could be a typical reaction is we go for centralization. That's somehow the IDP situation.
So, we got many IDPs and they don't work together. So, we get, we'll centralize it in one IDP. Good example. Everyone pulls data from bearing component of S. That is non-blocking, of course. We might have temporarily different data in the target and source a single point of failure. Depends on trigger to pull data on time or event. Unexpected data out of sync situation causes stress. Trigger concept is quite fragile.
Okay, but what now? There has to be something better.
Yes, we'll just go for my lookup pattern. Well, you can build up a sender and subscriber mesh. You will create a subscription service A. Service may be sender or subscriber to messages. And on data change, S sends a message to A, I have new data. A sends a message to all subscribers, new data available at S. Message contains only a reference to the object, not the actual data. To avoid congestion collapse, use adaptive buffering on all sides, similar to NaGL 84. You can read it here. It's a very old algorithm. It's used in Ethernet cards.
Buffer incoming and outgoing messages, combine messages for the same object into one and make timeframes adaptive to traffic situation. And then you won't have a traffic jam. So if the trigger is received and the target system does have the time and is ready to pull the data, it will get the data. And parameter of buffering, you can select them individual to the service. Result data is pulled when further processing is possible. The message contains only a reference to the data. Messages have no order, even if M is delayed. Pulled data is always correct and up to date.
That is quite different from a message bus. A message bus is based on transactions. You can use a message bus to build something like that. But you then have to throw away about 80 to 90% of its abilities because you're just using it like an email. So you can extend this pattern to a complete orchestration of a complex compound system. But what happens if the message gets lost? Then something won't get an update.
Well, it's absolutely no problem. You should put a fallback mechanism in place. Schedule triggers inside each target to execute or timestamp-based synchronization up from the last known good update. And then you're quite bulletproof. So we take a look at errors and enabling. Assumption, consider errors as regular. They will occur. Not only a ransomware attack will occur somewhere. It's just errors will occur. And errors are a regular stress situation on the system. You should divide them into two parts. You've got non-serious errors.
They have no serious consequences to your system, or they're just temporary. And you should accept them. And you have serious errors. They have serious consequences or irreversible consequences. You should avoid and reduce them. The right resilience patterns in your system reduces the effects of errors. So example, complex inhomogeneous multi-part system distributed partly out of control. Could be a complex IIM environment, multi-companies with distributed data sources and various targets, and somewhere an IDP and a complete IIM system in the middle. It could be the typical provider situation.
You've got customers which are out of your control, and you've got things which you've got to provide for your customers. Or you've got a multi-cloud, multi-part business application and combinations of on-premises, legacy and cloud systems. The right resilience patterns allow to build complex distributed systems which don't get into locked situations. A four-step guide to resilience.
First, system design. Design a system topology with systems, subsystems and compound system. Identify your system borders. Assign dedicated purpose and non-overlapping function for each part. That will put some structure into your landscape. Second part, rules for system interaction. Design a rule set for interaction on each level of your topology and for the overall orchestration. Enforce loose coupling. Avoid dependencies as far as possible. Step number three, design extra logic for resilience. Add some missing parts, maybe some adaptive functions or algorithms.
And number four, enforce resilience. Make sure that nothing is added which will break the rules above.
So, and do not forget, resilience is not easy to build and hard to maintain. The prefactored IT tools may not fit in or will pull into the wrong direction, into the hard dependencies. And implementation will need extra work and use unusual patterns. Developers may find it hard to understand. And operating such a system, it means you are operating a complex adaptive system, is very unusual. Administrators and ops teams have to get used to it. And resilience is a matter of IT and business. Business services have to be ready for resilience.
If you have business services which enforce the patterns of transactions and high availability and overall integrity, always the same data, you will find it really hard to establish resilience. So, if you did it all right, well, then there's one question left. We do not have high availability, do we? If we have it?
Yeah, but it doesn't matter. That's it. A responsible business person should be happy, but it might be quite difficult to explain.
So, final word and conclusion. Resilience is more than just high availability. It is a concept to cope with complexity and complex situations. Resilience is an essential part of good architecture. And benefit for business enables system designs in complex distributed environments. It makes it easier to go in the cloud, into the cloud, into distributed scenarios without the risk of going to the top level.
And, well, feel free. The introduced model is, well, for system design, it might give you an idea how to reduce complexity.
And, well, everything's fictional. You will need some adaption to your company's situation, of course. And that's it. Are there any questions? APPLAUSE So, Eleni's covered a huge amount of territory there and lots of patterns and scenarios. Is there no one who would like to have a follow-up question?
I mean, this is your opportunity. We have a few minutes.
Yes, please. Thanks a lot. One question. What you described to us is very complex. Yes. We need it. OK. What you described to us is very complex. And considering your background as, I mean, security manager, PM, the designer, maybe do you have also a couple of points that you can share with us to how to, let's say, I mean, this such kind of team in the company should work together?
I mean, who is taking the most burden? I mean, the PM or the architect or design?
I mean, how these people should work together? OK, I like to work in mixed teams and we established it in working with mixed teams together.
So, we had security, of course, we had architecture, but we also had the dev and the ops teams inside and the business. You have to put them all together to make a design like that. And you've got to have someone with a strong architectural vision how to build that. Because it somehow doesn't match perfectly to the things the tools usually support or the companies usually tend to sell to you. They're all in the transactional high availability side and they want to control dependencies. And this pattern just says, don't try to manage dependencies, just avoid dependencies, live independently.
And then you will be much happier with your construction or your IT. I love that approach. Don't try to manage them, just avoid them. I think that's brilliant. Any follow-up question there? Are you happy with that answer? Is there anyone else in the room? While they're thinking, we still have a few minutes in hand. I just wanted to ask you, in your experience, what is the kind of most commonly overlooked aspects of cyber resilience in complex identity management systems? What are the things that people tend to not address and then go, oh dear, when things go wrong?
Yeah, I think that's just a basic question. If you can place it in there, you add some new component, let's say, or a new system or whatever, a new function, and you will just ask yourself, OK, we ask ourselves how we will deploy it, maybe how we configure it, how we buy it, which will be the provider, our own programming, whatever it is, how we run it.
Even then, we think about the risk, OK, how we receive high availability of a disaster recovery or whatever. But there's one basic question we're usually missing is, what are we going to do if it's missing, if it's just not there? And that's the question which addresses resilience. We don't know what will happen. Maybe it's a failure, maybe it got a hiccup, maybe it got a ransomware attack or whatever. But we've got to ask ourselves the question, well, if it's not there, what are we going to do? And that's the thing we should address with resilience patterns.
And it doesn't matter why it isn't there. It's just not there. It's not working, it's not functioning, it's not reachable or whatever.
OK, thank you very much. Round of applause for Elena Richter. Thank you.