Event Recording

Alyssa Kelber, Jon Lehtinen: Build Your Own IDaaS: Lessons from Year One

Name: Alyssa Kelber, Jon Lehtinen: Build Your Own IDaaS: Lessons from Year One
Uploaded: 2020-05-18T12:00:00+02:00
Duration: 15 min 34 s

Posted on May 18, 2020

Build or buy? Do we have the staff, talent, & budget to operate a new security service if we decide to build? In this talk, Alyssa Kelber & Jon Lehtinen deconstruct the myth that you need large teams & expensive software to run cloud-native Identity-as-a-Service platforms for your enterprise. They will share their experience building their own at Thomson Reuters using commercial off the shelf software, containerization, and native cloud services, as well as the lessons learned, business impact & costs savings over the year since the service’s launch.

Video Description

Show Transcript

All right. Good morning. Good afternoon. Good evening. Wherever you are in the world. I am Alyssa Keber and today I'll be presenting with my colleague.

John, let sharing lessons learned from our first year of building our own identity as a service. So you may be wondering why would one want to build their own ideas?

Well, for us, it was a business investiture and the ensuing disentanglement of resources meant we had the opportunity to rebuild and modernize the entire enterprise identity stack. We were responsible for SSO and MFA.

Specifically, there are three pillars to a perfect eye to us, strong Federation support, supporting industry standards and scheme capabilities, automation, and low effort, operational support, and developer support for the last mile, thinking application integrations, development guides, and ongoing use of the service and all of these operate under an unspoken fourth pillar, which is total cost of operation. Let's review our objectives with this new SSO MFA program and why these things led to our decision to build our own IDAs rather than buy into one.

Our first goal was to improve the user experience from our, for our end users and developers. The old systems were a mix of old and proprietary protocols with differing log on names, ceremonies, and experiences. While these are concerning, they don't necessarily dictate our need to build our own system quite yet, adding onto the user experience story and touching on one of the major technical challenges of most IDAs and this project, we wanted to provide a single global IDP for our global workforce. That would be performant for everyone regardless of their geographic location.

We want a session originating from India to be respected in New York, turns out this isn't done very often amongst turnkey I a providers. You also surrender a good amount of flexibility by going IDAs. Every situation is a bit different, but we often have to engineer our way out of business challenges. On short notice, I can't think of a feature request pipeline for any IDAs vendor that would be as responsive. I think we could all agree that no one really enjoys being on late night operational calls.

We did not have the visibility into what the long term support strategy was for these new services we were building. So we had to assume we would be the ultimate end point for support, which was the driver for us to get creative with a robust self-healing design. We wanted notes to come back up on their own. If they hang to truly treat our instances as cattle, et cetera, as to keep the operational overhead at a minimum, if there were ours to bear for all the technical and strategic reasons, I've outlined for building our own ideas. I come back to cost often.

This is not to say that the organization where we did this work was singularly driven by cost. We actually received plenty of executive and financial support for our initiatives, but cost consideration is important for another reason. If the business units, application teams, developers, et cetera, did not migrate to the new services. Then the organization would be hamstrung by cost, from support in infrastructure for both the new and legacy systems. If the old stuff isn't forcibly retired, you will never shed it.

And the cost meant, and that isn't, that isn't even getting into the per head cost in total cost of ownership of these, each of these solutions where we discovered building our own IDAs brought some significant savings. The real cost driver for us was any money we didn't use in buying and running. Our new solution was funding. We could make available to reinvest into the business, to accelerate the migration out of the legacy environments, driving adoption of the new, cheaper, more secure systems just compounded the savings. So here's what we came up with.

Our own items like platform made from commercial off the shelf, software containerized, and retrofitted to run in a major cloud provider. This gets us all the things we wanted, how of an dash provider, while letting us retain the customizability and flexibility of on-prem commercial off the shelf software. We get to take advantage of all the cloud amenities this way, like auto scale in and out, the service runs as a global cluster with regional sub cluster servicing requests in three different regions.

Users are automatically routed to the cluster in the region, closest to them for optimal performance. Each region can automatically fail over to the high availability, can fail over to each region for T redundant, high availability and disaster recovery, as well as auto restore itself to working order to resume servicing traffic. If a region does go down.

So one of the challenges we had with trying to SHM commercial off the shelf software into an IDAs like arrangement was trying to maintain all the app integrations and client configs and secrets while not storing them in code as that you don't wanna store your secrets in code. So one of the first challenges we had to solve was trying to figure out a way to keep our containers, to find a way to keep our containers remembering who they are upon each instantiation.

So what we came up with was, as our cloud platform would stand up each new container, it would launch and reach into an encrypted bucket to find the latest configuration that we'd store there. It would bootstrap it on launch and then join the cluster at intervals. It would export any changes to that configuration into that encrypted bucket. So we're never more than about half hour away from the latest changes in our configuration. Another issue we had to figure out was making sure the randomized DNS names assigned to freshly provisioned resources, didn't break expected routes for traffic.

You can't rely on IPS in the cloud, at least not in a cost effective manner and the DNS C for data stores and commercial off the shelf software isn't capable of dynamically refreshing to reflect changes in your cloud environments. So to solve for our DNS issues, we use even more DNS using the cloud DNS services to issue DNS aliases and using those DNS aliases inside of the commercial off the shelf software for its data stores, we're able to SHM and hide the constantly changing DNS names for the resources that would get updated, updated with our cloud modifications.

So one of the big challenges that took a lot of time and thought to deliver was the user experience around synchronizing the passwords or the password we wanted to bring forward into the new SSO system from the old systems of identity. We cared about the user experience for rolling this out and the users already had too many accounts.

We also, we had a bit of a risk here from a technical perspective. We couldn't introduce a password history discrepancy across these new and old data stores. So we had to be very deliberate in terms of how we introduced the password history and the users to this new password store. So we came up with this very simple process. Number one, passwords were bidirectionally synced from the new and legacy active directories. Z two governance platform would provision the account into the SSO user store with a randomized password string. This was the first entry for password history in that store.

And it pretty much was statistically impossible for a user to cause a collision based on this because of the randomness of that stream user, the user never knew that password. We deployed a popular directory sync tool that conditionally synced active directory passwords into the SSO store at step three. And once password changed time, we set a trigger on password change time attribute to set a process that would monitor the password value. And that attribute is only set the first time the user sets the password here.

So that's how we'd control how, and when that value got tripped, four, we built a simple webpage that we used for user onboarding five, that page authenticated against the legacy D and six. If the credentials were valid, we would capture them and write them to the SSO directory seven, this trip, the password change time attribute bringing that account under scope of our synchronization tool. So now any change to the password and active directory or SSO would sync to the other directory.

Finally, we, we were be beginning to make good on our promise of consolidating the multiple accounts to bring people a single identity for their user experience. Nine, they authenticated and were introduced to the inline registration for MFA. They enroll their token MFA off, and now that authentication is completed. They could go to our new password manager, 10, they complete the password manager profile and 11, that password manager tool is how they then control their new password inside of this platform.

So solving some of the deployment and environment problems were tricky, but how we set up our code-based facilitated rapid iteration for solutioning. So there are solutions based on a monolithic enterprise app. We did our best to follow the practices outlined in the 12 factor app as written in 12 factor.net. So that is to maintain one code base in revision control that facilitates multiple deployments. So that means that single code base is responsible for four different environments, three different regions and three different container roles.

So from that single code base, using parameterizations and conditions, we're able to create 36 different versions of the container we need to produce just by flipping different switches or feeding different variables into the build process example of this. So as we build each container, we have a whole bunch of variables that we feed to decide what the container will be. On the other side, we have various parameterizations inside of our big YAML file, and that's the only place we reference them. Everything else deeper in the Yamo file is just a reference to these parameterizations up top.

This makes it very easy to maintain conditions and resources determine how, and when they're deployed and combined with three, the mappings, this makes it very easy to introduce new engineers to the process and maintain the code base. All right. So one of the risks of using commercial shelf software is the difficulty of upgrading it. Fortunately, the code probation parameterization facilitates the upgrades and testing. Most upgrades are matter of just dipping a handful of files that we modify in each four version upgrades and running through our deployment process.

If our containers work in the lower environments, then we're guaranteed that they will work in their upper environments because we've learned that having a dev, a QA and a prod isn't enough, you need a prod copy or a pre-prod any code that hasn't embedded inside of something that is identical to prod means you're releasing code untested into prod. So you really need four environments to do this. Right. We also learned that some things aren't really worth the hassle.

One of these cloud formation, it's just difficult to manage inside of an automated fashion mean going to that later, if you have questions and also something that didn't get us as much value compared to the effort would've taken to implement was multi-master data stores. You only lost a few milliseconds of performance when authenticating from different regions. I think a AWS did a good job on its back links to make up the performance gap. If performance is above all, something that is necessary for your organization. Sure. Dive into multi-master.

But for us 40 milliseconds, wasn't causing us to lose any sleep. Finally, our geo routing to try to get everybody to go close to their nodes sounded good on paper, but we, we failed to validate our assumptions in how our actual corporate network was designed, turns out all of our corporate network routes through north America. So everybody went to the north American nodes.

Anyway, fortunately as that network gets replaced, we're starting to realize the value of our initial design, but that's a lesson always validate your assumptions and something else that has been a surprise is the, the culture around change control. You think that as you build infrastructure is code with capability of rapidly iterating you.

I, I think there's opportunities to have a discussion with corporate governance about what constitutes proper change governance versus the traditional cab boards and two week cool off periods. All right. So let's discuss the impact of building our own IDAs an important lesson on successful service adoption by keeping the enrollment process in line with the same log on ceremony that the users would use with the service each day streamlined adoption. This had a measurably major impact for the MFA enrollment component as well.

Adoption was dramatically simplified compared to the last MFA initiative. The org undertook adoption happened six times faster, and the user experience was nearly universally priced. You would think this would take a lot of man or excuse me, adoption by application developers was great as well. Sticking to standards based protocols with reference implementations made life easy for developers and self-service app integration reduced the time barrier for integration compared to the service we replaced. You would think this would take a lot of manpower to run.

And though more would let us iterate faster. We are able to support this platform with a nominal staff. We prioritize our feature requests through frequent customer touchpoints and open office hours. So we can be responsive to the voice of our internal customers. Application integrations continue to grow across dev QA and prod, and the best majority performed with no innovation, no intervention from operational staff.

Of course, it helps to frame these things in the language. The business understands we did our due diligence embedded various platforms prior to building our service. So we have figures on what a comparable turn P IDAs would've cost. We accomplished all of this at a price point. That is $2.2 million less per year than commercial enterprise IDAs.

Oh, but Alyssa, you say nobody pays the retail price and you're right. We still come in $700,000, less per year in run rate compared to the best and final rate for a popular commercial Ida service. You can retain the of on-prem commercial off the shelf software and get the benefits of IDAs and accomplish it all for less. And we've open sourced it so you can try it for yourself. Our hope is others will take a look benefit from similar models and ideally contribute some improvements back.

This model has been running in production for nearly 18 months, and we are aware of at least three other large organizations who have begun to build similar deployments based on our experiences. Think it's safe to say that this model really does work. We appreciate your time today. We can open it up for questions.

Like this?

Don't like this?

Why don't you like this?

Alyssa Kelber, Jon Lehtinen: Build Your Own IDaaS: Lessons from Year One