Hello and welcome everybody.
Okay, thank you. So I'm happy to be here and I'm really glad to see you all and hope you're having some fun with cyber resilience. How is this one working?
Yeah, very good. First, who am I? Maybe some of you, you already know myself. My name is Rista. I got a degree of engineering at the University of Kawa.
As, as you can see, that's a typical engineer as you can imagine. And I'm about more than 20 years. I'm in different position at N bw, which is a large German energy provider. I've done security consulting, security manager, system design architecture, project manager and so on. And right now I'm chief architect of the identity and access management system of the MBW and I am also lecturer in identity and access management and cybersecurity architecture at the University of Luan.
So, and since I like to draw, which I've got in common with a lot of engineers and architects, I brought a sketchbook and we'll take some sketch, we'll take a look at some sketches on practical cybersecurity architecture in this case on cyber resilience, it's, it's a very large field, so it could only pick out some patterns and give you some idea you may want to work on.
So large and complex cyber systems often spanning from legacy up to cloud are difficult to maintain, operate, including all questions of cybersecurity.
And investing in cyber resilience will help you not only to improve cybersecurity but also will help you to build, maintain, and operate those systems. We'll explore some basic patterns and some practical use cases. Here we go. Important note or models are based on well-known practice and general knowledge. All examples are fictional. So what is resilience?
Yeah, I, I just had a look at and the better, really better question is which kind of resilience I'm going to talk, well there's various fields in social science, it, psychology, engineering, especially building engineering where the term of resilience is used and there are quite different somehow in the definitions and they are somehow more narrow and there are wider definitions of this term. One of the narrow ones I found is the simple, the strength or some elasticity or robustness.
And the wider definitions contain something, the capabilities of self restoring or self-healing or self-improving of a system. So what is also common but is really funny, the definition of one term from one field is complete used in another field in a completely different field. So you got take the the definition of resilience from psychology and put it on onto a cyber system. That is a common practice. I was astonished. So in my talk I will do all my definitions are in the term within a reduced scope of cyber systems.
I won't go into social systems and I won't go into psychology, I don't have to worry. So my definition is the ability of a system to react to stress and restore its functions or function. So what is a system? A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole wall that's quite abstract In practice it may be any algorithm application, any combination of software and hardware. Cloud services may include associated people, processes and organization.
So, and something unwanted stress, any action, playing a strain on the system, which means something unwanted occurs and it gives you some stress on your system and above a significant level of stress it'll have a negative impact on the system. It doesn't has been necessary. Cybersecurity stress, this can be any kind of stress, I mean stress by good traffic about just the quantities too match. So source of stress can be exterior from outside, can be also created within the system.
Maybe something is, has a malfunction in here and the reason for stress can be unexpected, unintended, incorrect, like an arrow or malfunction or it can be a malicious interaction. Okay, now we got all the terms and now we can go into the deep dive. What is not meant by resilience in my scope not meant is simple elasticity is simple form of resilience. Common engineering it mean meaning in it.
It's usually a pattern to provide resources on demand, mainly used in cloud computing.
That means if you go to web server farm and you are Amazon and it's somewhere before Christmas, you want to add some service so you can act according to the storm of your customers and after Christmas you will downsize the farm. Again that is called elasticity and sometimes it is called resilience but that is not really resilience, it's just scaling on demand. What is also not, not only resilience is high availability. That means concept to reduce expected or unexpected downtime, mainly used in building and operating cluster system. HR cluster, so should be well known to us.
And a simple robustness, which is a term which is often used in IT development in general, it means the ability of tolerating perturbations that might affect the system's functionality's body in it, it means a system to cope with errors, mainly use in the field of programming. So if you're on arrow, go to try catch or do unit test in test driven development and things like, like that. And fall tolerance is also a similar definition to robustness.
These concepts are specific for a system type or environment and a well-defined sort of stress or a well-defined selection of stress and a well-defined reaction to it when placed in the right position, they can contribute to resilience but it's not the main aspect of resilience.
So resilience as a more general concept is suitable for complex distributed systems. Unforeseen stress in general included in in general includes complex adaptive reactions.
Okay, so what what when is, is it's suitable to talk about resilience and when it's an unsuitable concept. So is it the solution for everything? So we take a look at what is suitable and what is unsuitable. Well unsuitable is if your system or your problem you're looking at or your solution you're looking for is just very small size, it is trivial size, it's a very close system. You can't get in, you can't get out and nothing changes inside or it's a completely fixed system and you have absolutely no abilities to change it.
Behaviors, suitable problems or systems you can address with resilience or complex open systems. So they have the, some interactions here. There are a number of elements above trivial or and or distributed systems which are in several places and frequently changing systems and maybe you have some adopting parts. So if one thing is broken down here, the arrow goes, the path goes another way.
A suitable system may be composed out of unsuitable systems including some kind of orchestration.
Okay, I told you it would would be hard stuff. So we'll go into a model for topology. For system design we define some levels. We got our system level subsystem or service level as I call it. And up there we've got what I would call a compound system. The term system of systems is also used, but it's usually really used in a really large scale when you have many countries, a whole half a continent or whatever, I won't use system of systems so I take it at one size, smaller, it's compound system in system design, but one service or one subsystem, a service is made out of components.
Maybe you have some something which is graphical interface, some logic and some something for store, some data and altogether it makes up a service. So a service, some services working together, give up, produce a system. A system is composed of several services and they have some interaction. And a compound system again is made up from several systems. Okay?
And what we'll do now, the complexity down here is low up here it's rising and up here it's maximum because you have a really large system, landscape of system and complexity is increasing with the number of elements and amount of interaction on each level. So reduce dependencies to minimum and avoid hard dependencies will help us reducing complexity.
So how are we going to reduce dependencies and and avoid hard dependencies? We'll put up some rules. Stress on a subsystem level should not be able to cause several damage to the overall function or integrity.
Blocking situations above service level should be avoided. And patterns for robustness may include clustering for high availability and they may include transactions down here at this subsystem level. So from there we'll derive some resilience policies which we will will apply to our system topology here. Down on the subsystem level, we tell components belong to one service only. No sharing between services transactions within the service here in the inner part may be okay, but they're not allowed outside.
So that means this database which is inside this very small service should not be used by any other system or service on the system level, no service to service transactions use non-blocking inter interaction instead.
So these errors in here between should not contain any interaction and I don't mean inter, it should not contain any transactions and I do not only mean the transactions on a database sense, but also transactions in a application sense. Because if one thing is blocking it'll block the whole system.
So, and in the system design level and the compound system, each system should work independently. Typical PLA patterns here are loose coupling.
Okay, now you, we have some resilience policies and we'll take a look how, how it can work. I just took out one pattern. We have been implemented in using heavily and often, which was quite good. Well the use case is we have some data source and it has to provision data to multiple targets here, here our targets, that's a common situation in identity access management. I mean we have it really often and the data is important for the services and the data will be processed further.
Okay, the data is here, the blue dot and the topology looks somehow like that our source system is somehow in here and the other target systems are somewhere here spread out. The services belong to systems and systems work together in some compound systems or system of systems if you prefer it.
Okay, now we first go into an anti-pattern, which is a pattern which is you shouldn't use. We could say okay, no matter which system target belongs to source will write directly into the underlying data bearing component. So source will directly write here in the data system of target one to target N. And it's some has some advantages, some data in all the same data in all parts of the compound. But if the transactions partly fails, the provisioning will fail in total because we put this in a kind of a transaction. It's it's kind of invasive low level.
And in this case we designed it as a multi-part par, multi-part transactions. So everyone has really the same data, which is which we want to have. So to fail is a likely situation in a complex system compound and that means the provisioning will get stuck. Okay? Reaction is we, we go for centralizing, we go for centralization. We that is just turning it around and we say everyone pulls data from bearing compound as so we got a central data storage and we just pull the data that is non-blocking because everyone is pulling the data when it's ready to pull the data.
You may live with temporary different data in the target and well the source is a single point of failure and it depends on trigger to pull data, maybe good time-based or some event-based thing.
But unexpected data out of things situation causes stress and trigger concept is probably fragile. There should be something better. Now we go on with our example, we got, we'll build a subscription service here. This one create a sub subscription service. A service may be sender or the other services may be sender or subscriber to messages.
So S will be a sender and the other one will be sub subscribers on data chains. A s sends a message to a and I have new data. A sends a message to all subscribers, which is in all part T one and TM and new data available at s and message contains only a reference to the object, not the actual data.
To avoid congestion collapse, use adaptive buffering on all sides similar to nagel 84 buffer incoming and outgoing messages combine messages for the same object into one, make timeframes adaptive to traffic situation nagel 84 is an algorithm used in ethernet in networking from the year 84 last century.
Yeah, well what happens next is those targets get the messages and they're, when they're ready put to pull the data, they'll pull the data from the source system and the parameter of buffering is individual to each service and should be configurable. And as a result, data is pulled in.
Further processing is possible since the messages do only contain a reference to the data but not the actual data messages do not need to have an order. Messages do not need to be delivered. And even if a message has a massive delay, it's no problem because in the moment the data is pulled, it'll be pulled fresh data.
So, and the pattern can be extended for a complete orchestration of a complex compound system as I put it in here. But what happens if a message gets lost? How? How would do we get this one moving if the message drops?
Well, no problem. You've got to implement fallback mechanism schedule triggers inside each target to execute and then do a timestamp based synchronization up from the last known good update and then we'll have your fix.
So errors and enabling assumption consider errors as regular error errors are regular stress situation on the system. So there are non-serious errors that mean serious, no serious consequences or maybe they're just temporary. Those you might want to accept and there are serious errors. This ones you want to avoid or to reduce.
This means serious consequences or IRR I consequences. So the right resilience patterns help you to reduce the effects of errors. That means you can transform these serious errors up here to tolerable situations.
Example, complex in homogeneous multi-part system. Distributed system partly out of control. So like a complex IRM environment, multi companies with distributed data sources, various data targets, something in the middle. The identity access management system is under your control. Everything else is out of control and out of discussion you can't get control on it. So the typical provider situation is you have your customers which are completely out of your control and you have your backend system which should be safe.
And you have this in-between system where you run the provider service you're doing to the customers or typical multi-part multi-cloud business application, which runs over many parts of different clouds. And also the promising combination of on-premises legacy and cloud systems.
The right resilience patterns allow to build complex distributed systems and they are still working. So a far step guide to resilience. Step number one, system design. Design a system topology with system subsystems and compound system.
Identify the right system borders, assign dedicated purpose and non-overlapping function for each part. Step two rules for system interaction design a rule set for interaction on each level of your topology and overall orchestration and forced loose coupling. Avoid dependencies as far as possible. Step number three, design extra logic for resilience. You might need some add the missing parts, some adaptive functions or algorithm like the example I showed you. And step four, enforce resilience.
Make sure that nothing is added which will break the rules above, which means you've got to have a hard talk with your developers, with your architects and the ones who are doing building and changes in the system.
So resilience is not easy to build and hard to maintain. Prefect IT tools may not fit in nor will pull into the wrong direction.
Hard exam, hard dependencies, for example, implementation will need extra work and uses unusual patterns. Developers may find it hard to understand what you're doing there because everything they're used to you're not using and you're using very strange patterns. And operating a complex adaptive system is unusual. Administrators have to get used to it. Your operating staff will get, will have some questions and resilience is a matter of it and business, business services have to be ready to work in such an environment.
So, and if you ever has management has the question, we do not have high availability availability, do we? Is this a problem? But that doesn't matter because we've got resilience. That is something to consider.
Yeah, summary resilience is more than just high availity. It is concept to cope with complexity. Resilience is an essential part of good architecture, especially in my opinion. Benefits for business enables system design in complex distributed environments, in environments you usually should don't want to mess around. And the introduced practical models for system design and resilience might give you an idea how to reduce complexity. The metals are based on fictional examples that we need some adoption to your company's situation of course. And that's it. Are there any questions?
Yes, please.
Hi, I've got a question about, I think was sketch number five when you had a pulling data from source. Yes. So essentially you have event based architecture where you're not sending data with events, but you're just then pulling data from the source. Yes. How are you dealing with availability of the source?
Okay, that's a single point of failure.
No, not so much.
If you, if this get data is developed in a loose pattern, you can say well try to get data. If you can't get data, try it again. If you can't data wait, if you can't get data, wait another longer term and then you continue waiting and, and sometime you'll, well you'll give it a timeout, but usually some, when the data will be available at some then you can get
It at, at some point you're gonna stop waiting, right? Yeah.
At some point you're gonna stop waiting and call for human.
So what will, why, why is that a better pattern when compared to the one that actually sends the data with the eventing? Well obviously you need to, you need to work out the order of messages, but if you saw that one out, then you have a local copy of the data and that kind of to me means better resilience.
Well, it depends if you, if you have, if you have order in those messages and even if they're containing the, all the actual payload, I mean if they're containing the data, it could be not that, well I, I put in here only one path. But if you have a complex system, it could happen that there are many messages, get data, get data, get data, and you want to make a decision in this service, which one you want to work on first, which one second and which one you maybe throw away and just don't work on it. It gives you a more flexible sense.
The decision of you really want to read the data and you you're going to store it inside the target system or inside the service team to be a little bit more general should be the target system's decision, not this decision of the sender. Because the sender doesn't know if the target is ready, if the target is willing to process, if the target wants to process or whatever. If you invert, it is an inversion of control and if you fit it in the right way, you can make it very loosely coupled.
Okay, thank you.
Yes,
Thank you. That was a brilliant presentation.
Okay, beautiful. Now the trouble is that, as you say, resilience is one of these port manto words and in the current environment there is NIS two, which is talking about resilience and there is Dora amongst other things. And for most organizations they're not in a position to go back and redesign everything. Do you have any advice to organizations of how they're going to include extra resilience within the real systems that they have?
Yeah, it's quite easy. If you put in a new system, you should rethink how am I going to place it? Is it going to be a central system, a leading system? Has it the leading data or is it just as a system in between others and what are we going to do if the system is missing? So whenever you put in a new system, you have the ability to place it, to integrate it in some resilient way. That would be an advice if you put in new systems, if you got really, really old legacy systems and every one of us ha has them, well I could have done the same patterns using very old legacy technology.
You could, you could just, instead of get data here and send messages, you could just drop data files somewhere and s schedule them to be picked up somewhere.
And in there they can arrange a loosely covering just by transporting data files, which means that is something even legacy. IT is able to, and every part in your company, which you manage to organize in a resilient way. It could be a very small part, but this part won't do any harm. It will be error proof and you don't have to do so much in time operating on it. So it'll enhance the whole landscape.
It's not a pattern you have to fit all over the whole cyber system or your whole data center. You can just start in a very small, with just one application and one data feed.
Like MQ serious has some resilience because they stay there. It's all
Technology.
Yeah, sure. But you can also use it to have transactions. So careful
I have to repeat it apparently.
Yeah,
Yeah. You just talk like MQ series, which is a very old legacy system which which you can just place data and it can be picked up, which is very good pattern with old systems you still have to, yeah, it's very good fism, but you also have to be careful with those message queue systems. They're usually enforcing over service and system wide transactions and that's something which is heavily de coupling.
We actually have one question online audience, and I won't be reading out any specific company names, but let's rephrase it slightly differently.
How do you incorporate third party suppliers, external dependencies and other companies into this resiliency patterns? Well,
One thing is if you have, just take a low level, easy non-transactional drop off of data for example. Or look if they have an API and you can just put up a mechanism like this, usually it works, it should go, we should go into the specific case. But so far we, we solved a lot of cases.
Okay, great. Thank you very much.
Thank you.