Event Recording

Building a Rich Workload Identity Stack with SPIFFE and OPA

Name: Building a Rich Workload Identity Stack with SPIFFE and OPA
Uploaded: 2023-05-11T12:00:00+02:00
Duration: 21 min 56 s

Charlie Egan

Developer Advocate & OPA Maintainer

Styra

Posted on May 11, 2023

What’s the highest value platform feature you can offer your Kubernetes tenants? It might be standardizing workload identity and policy controls. In this session, we will discuss desirable properties for a workload identity and present a modern architecture built on SPIFFE and cert-manager which uses Open Policy Agent (OPA) for policy decisions. This should leave you with actionable ideas to help you re-evaluate your workload identity functionality and security posture.

Video Description

Short Summary

Lorem ipsum odor amet, consectetuer adipiscing elit. Luctus fames rutrum metus habitasse donec quis turpis.

Nibh porta tristique sociosqu eleifend condimentum sapien ultricies. Dapibus rhoncus urna elit commodo blandit ut vestibulum tristique. Ante parturient morbi maecenas leo ac est dolor aliquam iaculis.

Leo vehicula vivamus ipsum lacinia cubilia torquent accumsan! Viverra a dictumst dapibus; nam consequat felis mus. Euismod semper iaculis congue mauris nullam.

Become a member of the KuppingerCole Community to access this and thousands of other publications.

Interesting Facts

Become a member of the KuppingerCole Community to access this and thousands of other publications.

Recommendations

Become a member of the KuppingerCole Community to access this and thousands of other publications.

Takeaways

Become a member of the KuppingerCole Community to access this and thousands of other publications.

Video Description

Short Summary

Interesting Facts

Recommendations

Takeaways

Show Transcript

Okay. Okay. Hello everybody. So I'm gonna be talking about how we can use two different cloud native computing foundation projects, Biffy and opa, to authenticate and authorize workloads.

So first, a little bit about me. My name's Charlie. I work on the developer relations team at Tyra. I'm currently an open policy agent, OPA maintainer, and have been contributing since 2019. I'm interested in Spiffy, which I'll explain more about shortly and all things authentication and authorization. And I'm gonna show how we can use Spiffy and OPA today together to, to achieve that in a modern environment. So prior to starting at Star, I was working at Jets Stack, which is why I first came into contact with the Spiffy project and the, the Spiffy ideas.

So some of the material today is based on my activities there and, and what I was interested at the time. So it's a little bit about me, just of people here who's heard of Spiffy. Just a quick show of hands, has anybody come across Spiffy before? I've got one hand at the back. Okay. And how about the open policy agent Op? Has anybody come across that before?

Okay, few more hands. Okay. Interesting. Dunno why, but I don't seem to be, there we go. Okay.

So yeah, I wanted to start with this question. What is a workload? And we've all, when I think, when I say that, a lot of you will have an idea about what a workload means to you. And when you're thinking about a workload, something comes to mind. Perhaps a virtual machine or a container or an application or something. For the purpose of this presentation, I'm, when I talk about a workload, I'm talking about in a running instance of an application. So that might be a long running web service of some sort or a batch job. It might be a lambda function, a container running an application.

But that's what I'm, that's what I'm referring to. And we're in an identity conference. Workloads need identities to communicate with one another. So what is a workload identity?

You, again, probably have some ideas about what this means in your environments and in your businesses or in your platforms, but when I'm talking about a workload identity, really I'm talking about how a workload establishes trust and how it makes sure that it is trusted by other workloads and how it can trust other workloads that it's interacting with. So that's, that's what I mean by workload identity. So now we're, we're getting onto the more interesting topics. Hopefully we've got some shared vernacular Now, what's a good workload identity?

So we can all imagine lots of bad workload identities. We can have long-lived static secrets which are shared and updated or baked into container images. We can think of lots of bad ways to do it, but we're, we're looking forward. We're trying to work out how we might do, how we might do workload identity in a really good way. What would, what would the criteria for that be?

So again, I've done this a bit of prior rt, I wrote a blog post while I was working at Jet Stack and thought about this and wrote them down, and I'm gonna summarize them for you now. So the first thing, or the first factor that I think of as being a good workload identity or something that a good workload identity has, is that the workload identity is based on a short-lived credential, which can be automatically rotated. So you can imagine what that might look like in a modern environment, but that's a requirement that we're, we're gonna be using.

The other requirement that I like to, I'd like to add to my list of things that make a good workload identity would be that it's not possible for a workload identity to be susceptible to a man in the middle attack of some sort. Such that during, while a workload is identifying itself to another workload, it shouldn't be possible for a bad actor that happened to find themself on the path between those two workloads to intercept the identity and replay it in some way.

So I'm, I'm not, I'm talking, so just as an example, like we're also talking about things that could be you, you might think, well, I'm, I'm already using mutual tls, I don't need to worry about this. I can share some secrets over my mutual TLS connection, but remember that people aren't always, or other services may not be very well behaved, they may log information incorrectly or something may be misconfigured further down the stream of calls that you're making.

So it should be that your credential is only known by you as the workload and not transferred over the wire at all, regardless of whether or not that's over an encrypted channel. So the next criteria I'd like to talk about is that a workloads identity should be, it should be possible to carefully bound a workloads identity. So we should be able to control where that workload identity should be trusted, and that should be up to us. That shouldn't be something that's global by default.

It should be something that we can roll out and control exactly as we need it per environment or per system or however your your system is set up. And this is, this is the final criteria I have with is this one, which is the idea is that the identity or, or a caller identity should be known by a service when it's being invoked. So if I'm a web service and I'm a workload that's a web service and someone is calling me, I should know exactly who's calling me. And that should be known to me. It shouldn't be abstracted away by some proxy.

And the reason that I might like to know that is that it would allow me to make policy decisions within my application around how the data's filtered or whether the data should be allowed. And it also allows me to forward that identity should it be relevant for any downstream calls that I'm making. So here's a little showdown or a comparison of some different workload identity systems you may be using or sort of standard patterns that may form the system that you are using.

So a shared secret, a password which is known by two parties, for example, this is short-lived, it's, well, it's not short-lived, so it fails on that front. It la lasts for as long as, or it lasts un until it is rotated manually usually, which is painful. It's replayable. If somebody catches your, your, your password or shared secret, they can use it to impersonate you and therefore it's not really a very good identity on that mechanism or that factor either. The scope is not bounded.

Anybody anywhere in the world can use that shared secret to call the service that you were, were expecting to talk to and impersonate you. The scope is unbounded, however, the it, it does identify you. So the call the service, which is being called does know who it is. So it it does okay on that final one. The other thing that I've seen people do is they use publicly trusted certificates to back mutual TLS within their infrastructure. So they get up and running with Kubernetes, they realize that let's crypto give them certificates.

They use Let's encrypt to get them some certificates for different services in their infrastructure, and they use them to open up mutual TLS connections. The problem here is that it's not very easy to do that short-lived LS encrypt won't give you a very short certificate. They'll give you a set certificate, which is usually three months.

It, it does well on, it's not replaceable. It's a mutual TLS connection. The private key never leaves the workload.

So it's, it's not possible to replay the credentials and for them to be suspect to a man in the middle attack. However, the scope is unbounded.

Again, it's a publicly trusted certificate and it's not possible to control exactly where that certificate is valid. It's valid that everywhere because it's publicly trusted by some definition of value of valid everywhere. It's also known by the service or the service which is being invoked. So service measures get us most of the way there. They can rotate short-lived credentials for us.

They're not replayable, the scope is carefully banded for each of the service meshes that we're using, but it's common in a service mesh world and mentality for the id, the workload identity not to be presented to the wrapped service. You can configure service mesh to forward identity like that, but I think it's sometimes the mentality is more of a problem than the technology where they're trying to offload all of the identity problem to the mesh and not making the application responsible for knowing who it's responding to and the data that it's sending back.

And I think that's a slightly or sometimes creates problems. And actually it's useful to know what the calling identity is so you can process data effectively.

So yeah, what what's the, what's the mechanism that scores or what's the method which scores well on all factors? Well, I would like to propose that people consider using spiffy flavored mtls. So what's spiffy? So spiffy is the secure production identity framework for everybody or for everyone. It's short-lived. It's not replayable. You can bound the scope of spiffy identities using trust domains or a concept of trust domains. And if you are terminating your mutual TLS connections at your applications, it's also known by the workload which is being invoked.

The idea is that there are two flavors of spiffy. We're mostly interested in the X 5 0 9 flavor today where you've got two workloads and they're talking to each other and they've used the spiffy certificates or the spiffy verifiable identity documents to establish a mutual TLS connection between each other where the identity is known on both sides. Those certificates look a little bit like this. It's hard to fit an X 5 0 9 certificate in full on the slide. But the crucial part is this subject alternative name SAN, where we have a URI representing the workload identity.

So the, the protocol is spiffy. The, the domain part or the host part of the URI is the trust domain. And then the path represents the workload uniquely. So we've got all of our, let's imagine a world where we've got all of our services, they've all got spiffy IDs, they're talking to each other, and we're interested in controlling how they talk to each other. So some requests should be allowed, some should be denied today, we're interested in those decisions. So how can we do that? What could we use to control and authorize requests as as they're being made between our different workloads?

Well, it's the other part of the torque, which is the open policy agent, the project that I work on. And I'd like to propose that you consider using OPA for such decisions and let's see what that looks like. So nobody or very few people had heard of OPPA before OP a's a general purpose policy engine. It's open source, it's a CNCF graduated project. And at the heart of oppa is this, is this language called rego. It's a domain specific policy language specifically created to make it easy to write policy for the kinds of decisions that you might be making. This is an example policy.

It's about the shortest policy that I could come up with, which allows users if their role is admin, the way that it works is that the policy is evaluated in the context of some data. There's some example data below a common way to to, to invoke OPA is over the rest api. So we have an api, you can send it some data policy at the given path is, is evaluated and the response is returned. So it's an easy general purpose tool to integrate with, to integrate your workloads with when you need to make policy decisions.

And it's a, it's a way that you can standardize the evaluation of policy across a whole number of different workloads and distribute different policy to it, to, to, to opens out of bandwidth your, perhaps your other deployments if you need to make policy changes or data changes that are needed at the time of policy evaluation quickly. So if we have spiffy and we have oppa running, what would that look like? What would we, how, how would we, what sorts of things could we actually enforce?

So try to put together a little example and in the longer version of this talk, I have a demo where I show showed how this works. But the idea is that we have a local train station and there's a, the train company headquarters where the bookings are made, customers are booking tickets and things and there's the local station and then the train driver is sitting in the train and they're wirelessly communicating with a local service at the local station. So toy example, just bear with me, it will get more technical, but that's just the kind of example domain that we've created.

A little bit more detail. It looks like this, where we have the train driver sitting in the train, they have a web client where they're communicating with the service that's at the station. The service at the station is called the reservation service. And the idea is that that service is, we're providing a list of seats that are booked for that. The train that the driver, the driver is about to drive. So they make a request and they say, I'm this driver, I'm running this train. What are the reservations? What do I need reservations? Do I need to set on my train?

The reservation service at the local station. Then consults oppa to say whether or not the request made by a train driver is allowed. If it's allowed, it makes a request through to a booking service, which is at the company headquarters, which contains all of the booking information.

Again, the request is validated using opa. And we also have some different options which are set on the request, again using a different OPA policy and the data is returned back to the reservation service and eventually back to the driver. So what can we do, what, what's beneficial about, about this setup? What could we do with Spiffy and opa? So one of the things that we can do is we can share some of the logic and we can run the same validations at different points if we need to.

So if something were to be not working correctly or one of the services, the integration with OPA wasn't running for some reason, we would could validate things in different places. So the original request looks a bit like this, sorry if the text is a bit small, they're making a get request to the reservations endpoint with the train and their driver id. The reservation service then provides the train and driver to the local OPA at the local, on the LO local station cluster.

And it does some simple validation just to check that they're setting the values correct or that they have actually supplied values for both fields. But then we can do the same validation again when the reservation service forwards the request to the booking service to get the actual list of bookings.

So we can also check that the train and driver are no longer blank or shouldn't be blank, but when we're at the booking service, we couldn't have some additional logic which is has access to each service, who is driving, which driver is driving each service and the booking service can use that data to enforce some additional decisions or enforce some additional rules.

So the, the final rule that we have here, so it states that the request should only, what would be denied if the driver for the the driver that which is in the record for that service is different from the driver who's requesting it. So the idea is that drivers can only request reservations for their own services. So we can also, but crucially, like we haven't talked about spiffy IDs yet.

So we can, we can also do what I'm calling identity based configuration for the, for the sake of example. So in the booking service for example, the booking service may be, is called by lots of different services in our infrastructure it's called from the different local stations and, but it might be called from elsewhere in our estate as well. But what we can do and what we might want to do is in this example, we're providing back from the booking service to the reservation service and eventually back to the driver. We're providing a list of, of customer bookings, customer reservations.

But what we can, we can have the booking service do is it can call out to the local, to its local open and say, I'm being called by this service, what options should I apply to the query or what options should I apply to the data which is being sent back? So in this example, the booking service is requesting OPA format as to how it should format the response. And here when it's being called by station one, when the station one reservation service is calling the booking service, we want to make sure that show email is set to false. So we don't want to show the customer email to the driver.

They only need to know whether or not the seat is booked. That's the idea. So the final thing I wanted to highlight is that we can use spiffy IDs everywhere or I'm trying to encourage people to use spiffy IDs everywhere as a standard for workload identity. Finally it's, it's, it's actually possible to authorize workloads as they are calling and invoking policy decisions within OPA as well. So that's these channels, these two circle channels of communication between workloads and the local oppa.

So we can actually state, state that we can configure something in oppa to say I want to only allow authenticated clients to, I want to only allow authenticated clients when they're making a call to oppa. And what that looks like is I have a, what is effectively an access control list where a given spiffy ID is only allowed to call particular policy endpoints to make particular policy decisions. It might be that some policies expose more information or potentially confidential information.

So you want to often control who can evaluate which policies to, so here we just check that the reservation service spiffy ID is matches the path which it is actually requesting. So yeah, as part of the preparing for this presentation, I added support for authorizing callers to oppa using spiffy IDs. There are a few other things that we'd like to add to the project. I'm getting quite close on time so I'm not gonna go into them in any detail. But they're up there and you can ask me about them afterwards.

This is the final slide where I've got links, various links to things that I've made reference to. The QR code at the bottom will take you to the page I have for this talk with all of the links, the links to the code, the slides. But it was also a recording of this talk where I give a demo. So if you're interested in that too, that might be of interest.

So yeah, thanks for for listening as a rather whirlwind tour of Spiffy and Nopa. I dunno if anybody has any questions, I'd be interested to hear them.

Yeah, we can take some questions. Yep, Thanks. It's great to hear that we are finally communicating about Spiffy. I do think though, that we need to differentiate between when we want to use Spiffy and when we don't. So the TRA train train driver example you could argue maybe doesn't even need that level of security. Maybe it does. I kind of feel maybe it does, it's on that side, but there's a lot of software workloads that don't. So could you give us an opinion like if I'm just downloading a software workload as a a Kubernetes container, do I really need spiffy under what circumstances?

Yes, because there's an overhead to it. Yeah, that's true. There is an overhead to it and I haven't gone into any detail about the provisioning of Spiffy and how you, the different ways you might do that. Some of them are easier than others, but none of them are trivial. I mean when you, interestingly, when you book a Uber taxi, your Uber app actually uses a spiffy ID to communicate back to the to the backend. So where you use spiffy IDs, you're right, it's important to think about where is appropriate and is it appropriate to give spiffy IDs to users.

I think it's more appropriate to give it to devices and use spiffy IDs to manage or spiffy to manage the communication channels or which user information is is distributed. But yeah, you're right, it's up to you and and the challenge of getting a spiffy ID to an edge device might be quite difficult.

But yeah, start, start with it where it's easy for you to do it. It's easy in a Kubernetes environment there are lots of good options to do it there and if you want to standardize on it, I suppose take it case by case. I would say. Yep. Maybe you could quick the answer to this question, how are the certificate certificates areed in Spiffy?

Sorry, could you repeat it? How are the certificates? Are Ren renewed in Spiffy? How are they renewed? Yeah. Yes. So there's again, I've, I've kind of sort of skipped over the how spiffy is, how spiffy certificates or spiffy verifiable identity documents are provisioned in the stack. You can do this in different ways. Commonly the most common way that people actually use Spiffy is via the EO service mesh that is based on spiffy IDs, even though it doesn't by default expose them to workloads, which is why I gave it a black mark.

But they're renewed by either by the service mesh or they're renewed by Spire, which is a sort of reference implementation of spiffy and spiffy workload identity. You can also use the cert manager CSI driver if you are in a Kubernetes environment.

So yeah, those are the the different ways that I'm aware of provisioning spiffy identities at the moment. And it would be handled, the renewals are handled by those tools. By those components. Okay. Thank You Charlie. Thank you.

Like this?

Don't like this?

Why don't you like this?

Building a Rich Workload Identity Stack with SPIFFE and OPA