All right. Creating human readable activity summaries from millions of flogs using large language models.
So, hi, I'm Kevin Roundy.
And I'm Lavanya, and we are researchers at Android Security. So here's our agenda.
Alright, why summarize? So we spoke to a large bank, how they analyze their cloud tray logs and cloud provider logs. So cloud provider logs like CloudTrail logs, GCP logs, Azure logs, have a lot of valuable data about what the identity does on the cloud. Now imagine a team of let's say, 20 analysts trying to figure out interesting things from CloudTrail data. So picture the fate of a person who has to grab a coffee every single morning and look at millions of logs to find what went wrong.
Wouldn't it be nice if we could replace this with a team of two to three analysts who could look at summaries generated by AI to detect anomalies and things that, interesting things that happened in, you know, the cloud user activity. So that's exactly what we wanna do. And you would say, now that chat GPT has come, all that we need to do is to grab all these logs and through it at chat GPT and it should give us the summary. So we actually did a small experiment to check just this.
Let's see.
So we created a small workflow where the user creates a cloud formation stack, creates a dynamo DB table, deletes the Dynamo DB table, and deletes the cloud formation stack. So we collected the cloud tray logs on Amazon on AWS for this workflow and we put it in a file and gave it to chat GPT and asked it to summarize, and this is what we got.
Now, is this a good summary? We would say no because there is no mention of Dynamo DB in this summary. But what do we want in a summary? So the least that we would want in this case from a summary is the fact that the user work with cloud formation stack and Dynamo db. What else do we want in a summary? So here's an example of a summary that's generated by a model. So security admins care about risk and exposure.
So when you are generating a summary for security admins, it would make sense to characterize the various actions in the session based on their risk profile.
It would be nice to know what went wrong in the session and of course it would be nice to get a high level view of what happened at the session and a summary of all the services used and so on. Alright, so we all know that LLMs are good at summarizing because we've heard that LLMs are great at tasks NLP tasks such as summarization. Now the question is how do we get LLMs to produce summaries that are useful for a security admin?
All right, so there's a lot of challenges in, in getting these LLMs to produce the types of Aries we want.
Fortunately, there is a path forward, but we really need to understand the challenges if we're gonna get the LLMs to give us what we need. So the first challenge is that we have, you know, megabytes and megabytes of logs. Imagine a scroll of, you know, hundreds of pages and you're trying to reduce that down to something that will fit on a single screen. And so as you go through this, you're just trying to squeeze a lot of information into a small space. Fortunately this is possible because most of the information that's there in the logs is not very interesting.
So we're looking at situations where we have, you know, 260 services in AWS, for example, more than 17,000 possible actions, many of which are repeated over and over again.
Also blobs of JSON data, giving you extra metadata. And so the real challenge is how do I focus in on what really matters in this so that I can surface it. So if you're going to manually, you'd be saying like, this is boring, boring, ah, okay, this is interesting, this is interesting, et cetera. So this is kind of what we want to be able to do.
The challenge is that LMS don't really understand cloud provider logs and they don't know what to look for. So if you step back a little bit and think about why LLMs have been so successful, I, I would argue there's maybe two key things. One is you've got these really fancy maral architectures that are capable of retaining high level concepts and information, but mo most importantly as as mentioned in the previous talk, it's what you need to feed in good information first, right?
So if you look at the common cloud crawl dataset, it's 8 billion websites with trillions of words in them.
And this is being fed into the model and it covers a vast category of different types of information that has been collected from so many different sources. Unfortunately, when it comes to cloud provider logs and summaries of those logs, you kind of draw a blank When you go online, there are no really good data sets of cloud provider logs. And so the, and and even less so summaries that a security administrator would wanna see that would tell them what matters inside of those cloud provider logs. And so this is why they're struggling so badly outta the box to give good results.
And you think of all the different types of events that they can see, these are not all created equal. Some of these events have to do with cryptography are very important from a security perspective.
Others are not very important from a security perspective. So we need to help the LLM understand what it is that we're feeding into it. So that's our second challenge. The third challenge is that as as detailed and as voluminous as those cloud providers logs are, they don't tell us a lot about the identities that are pro producing the events.
We don't have any information from the HR management system about the individuals and the humans. We don't know much about the patterns of the organization. Do they all kind of log in from the office on a certain network when they access to the cloud or is everyone working from home? All this sort of context is missing. So if we can bring in extra context about the identities, we can give the LLMA much better chance of telling us interesting things about the identities that are producing this behavior.
Finally, it's just expensive to use large language models. The more text you throw at it, which is exactly what we're trying to do in this case, throw a lot of text at it, the more expensive they are. And there are, you know, a lot of different price ranges for a different, for the use of these large language models. You can't just solve the problem by going for the cheap ones because you do get inferior results.
So you wanna use high quality models, but you wanna be able to be thoughtful about how you're gonna present the data to them so that you're not having to spend a fortune every time you wanna summarize logs and, and ultimately, you know, they, fortunately recently the cloud provider, the large language models have extended the length of the input that they'll accept up to about a million characters. But still we're talking about multiple megabytes of logs which won't fit in even into those larger context windows.
So you've got some challenges there as well.
Okay, so as we think about with these contact challenges in mind, how are we gonna provide our solution, right? So we need to keep those challenges in mind and, and also keep in mind what it is that we want the, we want the summary to do. The summary is designed to help a security administrator understand what are the CRI to answer certain key questions about what users did when they were using the cloud. So in particular they want to know things like were there any errors in the session? Was there any unusual activity in the session? Was there any high risk operations?
What kind of services were used? And just give me a good high level human understandable overview of what they were trying to do. So with this in mind, we can now start talking about, okay, how do we approach the problem and Lavan is gonna talk to us about that.
Yeah, so the first thing we need to understand when we are trying to solve the problem is to figure out what we want in a good summary. And we just saw that in this particular case of trying to come up with security summaries. So once we have that, like Kevin mentioned earlier, cloud provider locks do not have all the context that we need to generate a good security summary. So we aggregate data from many sources. So there's the cloud tray locks that we talked about.
Then there's cloud inventory data that has a relationship graph between users policies and resources and then the HRMS data like Okta logs and then custom curated expert insights as well. And also other context information from external apps such as Jira tickets for instance, that could have information about what happened in the session. So the idea is to integrate all this data and put it in the context of a window of a prompt.
So we wanna put it as a part of the prompt by condensing this data in a form that will fit there.
But apart from this we also have a suite of models, custom models for instance, models that detect anomalies, models for error and failure analysis, workflow detection models, and also risk models that give us valuable insights that are factored into the prompt. So we then have a prompt generator that generates a prompt for us that is given to L an LLM summarization model, which gives a human readable summary. Now let's look at our basic workflow.
I think the
Point it back point it in different direction, again, don't point it at the screen, maybe point it, oh
Okay, we have a challenge with the pointer.
Alright, but let me go over the basic workflow that we have with an LLM. So the basic thing that we first do is like few short learning. And the idea there is that we give a few examples of, you know, the ideal output that we want from the LLM by putting them into the prompt. So we take a few example inputs and the summary that we wanna see for them.
When we do this, it's very important for us to come up with a diverse set of examples here because there could be like different situations which require like, you know, different kinds of summaries. For example, if you have a lot of data, you might want like a more high level view of things, but if there are very few actions in your session, you might want a much higher detailed summary of what transparent in the session.
So now while few short learning is great to begin with, it's really hard to fit in a lot of diverts kind of examples into your prompt and that would just make your prompt too big and you can't cover all situations there. So it's important to do some, and again, like we use a lot of the prompt engineering techniques that was, that we talked about in the earlier session, like prompt templates, chain of thoughts and so on to construct a prompt along with a few short learning. But that doesn't cover all situations like we just talked about. And what we want to try next is like fine tuning.
So in order to do fine tuning, alright, so that's the previous slide and you know, yeah,
Next slide
Please. That's alright. So let me talk about fine tune there. It's alright. So in order to do fine tuning, what we want to do is to come up with, you know, list all the kinds of situations that we want summaries for and we come up with a data set that we curate to, to change the weights of the LLMA little bit in order to direct it to produce a kind of summaries that we want.
Now again, the challenge here is that handcrafting these examples is really hard. It takes a lot of time to generate the ideal kind of summary you want for a particular session input. And it's important to investigate ways of automating this data generation process and also to use other forms of data that you get, which may not be ideal summaries standing there, but you know, any kind of data that you get as feedback to improve the model. And that's something Kevin is gonna talk about again, because generating these examples by hand does not scale, right?
So another important aspect of our framework is our evaluation metrics. So evaluation metrics are important from, for a couple of reasons, a bunch of reasons. So evaluation metrics are important to create guardrails to make sure we show only high quality summaries to the customers to generate, you know, to gain customer trust. But they're also important because they can be used to improve the quality of the summaries themselves.
So for instance, if we have a good evaluation metric, we could generate summary over and over again and pick only the best one which has the best score and show that to the customer. Or if we had a synthetic data generation pipeline for training the LLM, we could actually leverage these evaluation metrics to discard any data points with a low summary score. So these evaluation metrics are useful in many different ways in our pipeline. Now what are the exact, what evaluation metrics do we use here?
Now there are many evaluation metrics for summarization in the NLP world, the natural language processing world, but all of them may not be relevant for us from a security summary standpoint. We all know that LMS generate coherent English data. So what we want is metrics that are relevant from a security summary perspective. And we have two types of metrics that we actually use here. One is the recall metrics which ensure that any important information in the session is actually covered in the summary.
And then we have precision metrics which ensure that anything that actually appears in the summary actually happened in the session, which means we are not hallucinating. So specifically we use the critical action recall, which ensures that any critical actions that happen in the session actually show up in the summary and service recall, which ensures that any services that the user used through the session actually show up somewhere in the summary.
And then for precision we have action precision, which measures whether every action that shows up in the summary actually happened in the session. And service precision, which checks whether all the services that appeared in the summary were actually invoked by the user in the original session. So let me give you a small example of, you know, the evaluation metrics here. So we talked about how charge GPT generated a summary which had no dynamo DB in it. This is exactly what we wanna prevent with these evaluation metrics.
So if you look at the critical action recall metric, it would ensure that the dynamo DB delete table,
It's
Back. Yeah. So it would ensure that for instance, the Dynamo DB delete table, which is an important action in the session is not missed, right? So this is exactly what we are trying to do here.
All right, so, so thanks Lavanya. So with these great metrics that she's talking about, now you get into possession where you can really create some very nice positive reinforcement loops, right?
Either with human, human generated labels or, or, or scores of the summaries that we're creating, or just by using automated metrics to tell us whether we're producing better summaries and encourage the model to get better and better. And so with that in mind, let's just recap. I think these getting good quality summaries out of cloud provider logs is really helpful because they're very painful to look at and it really gives you a good sense of what the identities are doing when they're on the cloud or in any other context and, and yet it's not so easy as you might hope, right?
So there's the kind of three key challenges that we can, that if we can overcome, we get good summaries.
The first is we need to know what we're looking for and be able to qualify that so that we can encourage the model to give us more of what we want. The second is we need to give it the right input and understanding and context. We can understand the log data that we're throwing at it. And finally we need to be able to create good positive feedback loops by having good metrics to tell us when the model is giving us good results.
So with that in mind, we're hanging out right here in the lobby all, all the time. We'd love to come and talk to you and show us some of the summaries we've been able to create out of these logs and also just the work we're doing to help get, get I customers to a position of least privilege with with their in the cloud and in SaaS applications. So thank you and happy to take any questions you might have.
Yes, a big thank you to both of you. A really interesting session here. So thanks for sharing your expertise.
Again, if you have questions, you can see them at their booth just outside. So thank you very much. Thank you.