Yes. Oh that's pretty noisy. Thank you very much.
Yeah, I'm Tosin if you didn't guess this. So we want to talk about externalizing authorization, what we needed to do to make this work for solando.
So we've, with introducing every new technology, once you move beyond the first steps, there's a lot of interesting challenges that you need to solve. And one particular one that we need to tackle was the topic of ownership.
It
Works to give you some context why that method.
At Sland, we have around 3000 applications developed by over 2000 engineers and we do up to 3000 deployments in production per week.
So just work here. Our starting point for Journey was, is probably familiar to a lot of people and it started with configuring access directly in the application itself. But with this number of applications and engineers and also with a more liberal engineering culture, you will see a lot of programming languages and even more frameworks in use to configure access.
And with that, reviewing these policies, making sure that they're fulfill what they need to do becomes a challenge. And the solution of course to externalize authorization and move to some common framework. And that gives us standardization, which allows us to do reviews, it allows us to do knowledge sharing, but it also allows us to create abstractions and introduce libraries that make writing policies easier. It also gives us a single pane of glass for authorization that gives us auditing, that allows us to put the information into the CM system, but it also allows us to reason about access.
So we can now look into why a certain access was allowed, a certain request was allowed or denied. And finally we can move the concern into the infrastructure which makes life ideally easier for developers. If you're interested into the details about this and how we set this up and especially how we connected this with our governance systems to make access Requestable encourage you to have a look from at a talk from two of our colleagues that they did two years ago also here at ESC.
Now that we solved this problem, did we just create a new one?
One of the things we may have to make sure is that our policy is correct and that it allows legitimate access. So if we prevent that we have an operational incident at our hands, which potentially can cost the company millions. So in the past this was tackled and checked by the same processes and checks that we have in place for application code. But now with the policy being externalized, this policy now has its independent lifecycle. And as a quick reminder, how a development and deployment process can look like.
You start with developing your code, you want local tests, you commit to a version control system, then a continuous integration process. Once making sure that the change can be merged into the main branch, then you have a review by another engineer, then you have an deployment ideally automated. And finally we make sure that the deployment works using our monitoring tools. And if it doesn't, we have a rollback process in place which allows us to go back to our previous version that what do we have for externalized authorization?
We use open policy agents to run the policies and Steward does as a control plane. And this gives us unit testing using open policy agent and the regal language and it gives us control about the policy deployments using Steward does. But the question is now is this sufficient?
So this interesting,
There's some slides missing anyway, I can tell you talk through them anyway.
So for solando that is not the case because our scale and the size, not a single team can manage all the access policies and not just because of the workload but also because not a single team can know all the different access requirements. And so we need to delegate this. And another thing that we need to consider is that changing the policy might also change the behavior of the application. So the team that is actually owning the application running in production needs to be aware about the policy change.
So when we started looking for an owner for managing these policies, the candidate strictly became the engineering team that owns the application. And then the question becomes what do these engineers need to fill this, fulfill this ownership responsibility? And for the answer for this questions, I hand over to push flanker.
Yeah,
Thank you Dustin.
So to answer the question, what do engineering teams need to own authorization policies? We can have a look at what enable them to own applications and apply the same authorization policies. We believe it is the ability of the engineering teams to independently perform and take applications through software development lifecycle that enable them to own own applications. So the next question is how we apply the same to authorization policies.
In other words, we are trying to treat policy as court not just at authoring level, but throughout the end-to-end lifecycle of the of the software development lifecycle. So some of these steps are straightforward but some of them are not. Let's look into the retails. So starting with development, we can use version control for authorization policies as well. Similar to applications we already have GitHub used in-house, so we are using GitHub as the policy repository as well.
Then it is a new policy language that is coming in the company.
So we don't want to force 2000 plus engineers to learn a new policy language. So for that to reduce this overhead on the developers and to reduce this learning curve, we have introduced several libraries. So their function is to provide the abstract player rather than letting the developers deal with complexities of the policies. So we support common functions such as checking whether a particular user has access to a particular resource like this call has function. And then also we use libraries to share commonly needed data sources like employee access data and then mock as well.
So in addition to having an abstraction layer provided for the developers where libraries, we could guide them towards some standardization and best practices. So they adhere to some structured access data structure. So the next important point for development is the ability to locally test these policies.
So the main challenge for this is not having access data that is from the environment available locally. So for that we provide mocks.
So mocks basically to represent different personas we have in the workforce in the company and then to mock the resource types and then again to mock the STTP request that we are handling via are open policy agent. And then the next thing is with the new language there comes new tools and new UIs.
Again, we don't want to force 2000 plus engineers to learn these new tools and commands or get familiarized with the new UIs. So for that we have a inbuilt CLI tool which wraps the set of tools we are using. So this tool basically handles the interdependencies or these library dependencies we have from the policies and then enable them to run these tests locally and perform linking likewise.
And then if the developer is satisfied with their local development of the policies, they will create a pr at which point our CICD platform will trigger a pipeline.
And this is enabled using RA APIs, which is our control plane and integration points of the CICD platform. So this allows us to place some central controls on what we take to production and we have visibility on that which stage actually these changes are. So as of now we perform mandatory test based validations and mandatory LinkedIn rule based validations here. And if there are violations we will book block the pipeline so that changes won't go to production.
And the other advantage here is we get deployment authorization for authorization policies so that only the distinguished team can take modifications into production in the system. And with this we also need to peer review the policy modifications that are going to production. This is a straightforward step that we use JT hub for policy repository Only thing is we may need more stricter approvals for highly sensitive repositories.
Now at this point, if we switch back to the application development track, we clearly don't want these modifications to go live at the PR merge.
We would let the artifacts to be created but we will have control on when we take these policies to production. We would rather do a gradual traffic switch rather than taking this advanced to production. So how do we support these four authorization policies as well? So for that we have a specific deployment step for policies. So looking at the diagram, we have policy deployment policy decision point here, which is open policy agent acting in the ingress, which is our policy enforcement point.
So for the policy decision point to make decision, it needs to know the policies and it needs to have data. So we have separated this out as two separate bundles. So how does open policy agent get these bundles is from our bundle registry, which is AWS S3 bucket.
So it has several bundles that are coming to the bundle registry and only one will be active from them for policies and one will be active from them for data. So how the bundle registry get these bundles is from it gathers policies from GT A P and publish modifications to bundle registry. And then we have context data.
So this is coming from our IGS system and some other data sources and it continuously get updated to open policy agent. So this is most synonymous to how applications have access to fresh data from the databases. So that data is more like in near real time reflected to open policy agent. But for policies we specifically control when the changes are becoming active, keeping the control with us. So with that, if everything is fine, we will take the modifications to production. Now that it is in production, we need to have ICE on if everything is going fine as expected.
So for that, since open policy agent is integrated with our infrastructure, we have ICE onto the performance numbers and resource consumption levels. But to get the details of the decision rates, like if there are any abnormal allowed deny rates, we get this stats from ADEs which has a holistic view onto the policy agents that are running in the environments. Then the next use case is if the engineering teams need to trace one particular request and see which OPA decision actually impacted this particular request to be allowed do denied.
So for that we have improved the tracing spans that are coming from, that are going through open policy agent with OPA decision id so that correlation can be built up and this will be beneficial for analytics as well. And if everything goes fine with monitoring, we are good. But in the unfortunate case, if there's an issue we need to roll back as fast as possible. So for that now it is a simple step now that we have everything in the bundle registry, this is a me step of step of marking the old bundle as the new active bundle. With that I hand over to first.
So we know there was quite a lot of information and maybe too much information, but the key point is that at Solando we still consider externalized authorization and integral part of the application. And with that the ownership lies with the engineering team that owns the application. And to make this work we had to focus on developer experience. So we did this by aligning the development and deployment processes to the ones we have in place for application code.
We did this by integrating with the existing tooling that we have and we did this by providing abstractions and libraries that make writing policy easier. And by this we want to reduce or reduce the learning curve and we also increase operational excellence, so ideally less incidents. So with that, we are at the end of our talk and I think we have some time for questions.
Awesome, thank you so much. Yeah, really interesting and I'm glad that you provided some specificity around the rollback and you know, the reality of needing that rollback. So a number of questions have come in in the app. Does anybody here in the room have questions that they didn't put in the app?
Okay, so one question was do you centrally own any audit related controls that checks how effective the implementation of the policies are in your environment?
So not so we in the process of evaluating this, so we are rolling this out. So first we first need to gather information and then we also bring this into a place where we can centrally control it. But as of now, there's not that much work to be done has been done in this direction, might not.
Okay. Besides the applications that you all have developed in house, what about commercial off the shelf applications?
Are those are policies being managed in the same way or how are they handled
Now this, for this we use a different process. So this is actually targeted for things that we develop in-house for things that we, where we use off the shelf products that depends on the off the shelf product. So we have some way to integrate it with Okta then and using Okta groups, we use it over there. We indicate it like this, if it has a its own more elaborate access model, we use that and integrate it with our governance systems.
So we, it depends like usually
Are Are there things you wish for in trying to align the two processes that you or that you might be planning for in the future?
For external, so
External and internal apps?
I think it, I mean it would be I nice but I think it's unrealistic and maybe also a a step too much. I mean we are focusing really on the thousands of applications that we develop for running our platform and have a, a tool set that is tailored towards that and another set tailored towards products that we use.
Okay. Yeah.
Sebastian
Apologies but I, I just got a hiccup where you said the thousands of applications that that you've developed. Would you mind defining what an application in that context is?
It's a challenge that of course I think a lot of organizations struggle with. So ideally an application provides some feature to the user or to our customers in this case. And it consists of multiple components. So it could be that you have, you can see
It's not in one single microservice.
No, no.
You're not counting microservices.
Yeah. The thing it is discovered by the engineering team that creates the application. So it's self-defined scope, but the strategy is to consolidate on bigger applications that consists of multiple components.
I mean it's not unheard of to have at least hundreds and soon you're into thousands And yeah, for an organization that is very sort of API first, I imagine it's easy to find another a hundred, right?
But even if you, just to be sure, even if you consolidate some of these applications into once with and multiple components, we will stand up with more than thousands, I'm pretty sure.
Alright, any other questions? I think I think we'll wrap up here. Thank you so much you guys. Thank you.
Thank you.