Yeah, thanks for that intro. Yeah. Today we're gonna talk about homegrown machine learning, implementation for the cert. So many of you leaders out there, Analyst cybersecurity specialists, and those that are just interested in hearing my talk, you know, thanks for, for joining. It'll be a fun journey that, that we talk about today, about what it means for machine learning in the security operations center. So before we get started a little bit about myself, the, the self-proclaimed data nerd here, you can find me on, on Twitter.
I am the, the head of data science for our search. So that's the security, instant response team I'm at Goldman Sachs. And I have a, a master's in data science from NYU. I just had a, a publish a publishing in 2,600. That's a hacker quarterly magazine. I go by the, the name B logic as the author there.
And also soon to be released is a white paper from the artificial intelligence risk and security group consortium between technology financial, you know, banking sectors, and a few from academia.
So, you know, be on the lookout for that paper, mostly talking about how to implement AI, you know, the right ways and what to look out for in the banking sector. So more to come on that and, you know, feel free to, to reach out to me on, on LinkedIn or, or Twitter, just real quick. Disclaimer, here that opinions expressed in this presentation are those of the presenters, which is me and do not necessarily reflect the official policy or position of Goldman Sachs. So now that we got that outta the way, the agenda, so, you know, first and foremost, cybersecurity's hard.
A lot of you guys know that, and I'm gonna try to narrow in, on a data science land and cybersecurity.
And at the end, we will get into a deep dive on some really good use cases that we have implemented that are machine learning based, you know, data science based that have evolved over this, you know, homegrown type, you know, method that my team has, has taken to, to help our, our cert. So everybody's familiar with this, how we, you know, kind of funnel in on the disparate log sets that we have the tons of event sources that come in each day that funnel into incidents and alerts.
And, you know, this is kind of setting ourselves up for data science initiatives to narrow down that funnel and really make it a apparent for the Analyst to investigate, you know, good events and not necessarily, you know, false positive events or just noise that can really annoy them. So, you know, with that cyber security is hard and a handle on security.
You don't necessarily have it.
I, I would argue that, you know, we all don't have it because we're all vulnerable. There are tons of avenues for attackers to come in. And data science is just one of the many, you know, tactics we can use to help solve this problem.
So we, we look at a, so right, the security operations center, and essentially it is an operation, right? So we can use, you know, statistical methods to find efficiencies, to find what's effective. What's not effective and run a good operation.
Like, you know, similarly you would run on, you know, your business, right, to, you know, make widgets, run the widgets along the assembly line and hopefully output, you know, some widget that you can then sell, right? And that widget is going to come with some defects. There may be some false positives and hopefully along the route, you can, you know, find those efficiencies to make those widgets come out faster, more efficient and, you know, better quality at the, the end result.
So along that assembly line, if you will, for the, so there are many teams involved that involve, you know, a SIM that involve a ticketing, threat intelligence, you're, you know, research and development team, you're reporting to team your detection, engineering team. So along those, you know, cogs, you know, we can find efficiencies and usually their, their data science methods to identify that.
So with, you know, going down that funnel, what's coming in are all those logs events and, and cases usually, you know, coming into to Splunk. And now that I've noticed, at least in the, the financial sector, we've, you know, migrated to additional types of log management systems, you know, mostly open source, some of the, the shelf products, other than, you know, Splunk like Hadoop and big data engineering, you know, some Kafka streaming, some other methods like elastic search.
And now we're kind of weaning ourself off of that, but we also have to implement, you know, tools and, and pipelines to, you know, gather that data and, you know, siphon it into our SIM for, you know, more reliable notifications, reports, tickets and management at the end of the day, you know, I'm always putting myself in the Analyst shoe to see, you know, how can I make their job easier?
How can I frustrate them less and make them less overwhelmed to make them more efficient and investigate their, their cases with better knowledge, better contacts and understanding.
So the problem statement, many detections, but a lot of noise. And this is like the original problem that you can start with when you're trying to implement data science solutions, start easy, try to integrate some type of analytics platform. It can be home grown. There are some off the shelf products. We haven't found any success with some of those off the shelf products. So we continue to go down this, you know, grow your own type of analytics platform, which has been successful for us.
And at the end of the day, we want to be able to support a better identification of, of vulnerabilities and potential incidents. So a solution here, and we started the easier way route with data science in, in mind, you know, can we empower Analyst to work on higher fidelity alerts?
So what does that mean?
Well, there's always this balance that we're trying to make, and this isn't necessarily machine learning or AI. This is a problem that all socks and cybersecurity teams are, are coming to try and figure out, you know, we have our insiders, we have our attackers, and then you've got your Analyst sitting on the, the end of the, the Seesaw there. And the attackers and insiders are always, you know, knocking that Analyst off the Seasaw because there's never really a balance, but, you know, with some, you know, data science methods, I feel like we can get closer to finding that, that balance.
So going to the implementation of, you know, machine learning, you know, everyone thinks it's this silver bullet that can, you know, help us out with our defenders, but there's so much surrounding machine learning and implementation around it. Especially the, the engineering, the SDLC, the code, you got machine learning ops on top of your security operations center.
And, you know, some of the things surrounding it, you know, in the middle there, you've got your ML code. And usually that's, you know, 10% of the, the work effort going into a data science problem. But you know, other things to consider that you have to consider when, you know, the implementation comes out, are, you know, some of these feature extractions data collection, the configuration monitoring, keeping, you know, a watch on how well your, your predictions are, are analytics are, are coming in and out, and then tuning very similar to tuning, you know, detections within the socks.
So it's, it's nice that it's taping, taking that similar approach. However, there's a lot behind it. So what we've done with the, the data science pipeline, and I'll get to, you know, identifying, you know, higher fidelity alerts in a second here, cuz it's a use case, but I wanna provide the, the underlying, you know, platforms and initiatives in order to, to get to that end goal of an output or a metric and analytic.
So, you know, I, I mentioned earlier, how're kind of weeding ourselves off of Splunk and diversifying our data pipelines, you know, but with that, the disparate, you know, data sets are in all these different locations. So with that, we've, we've gotta build these custom integrators and what we call our data connector to these, you know, Hadoop to off to another log set off to elastic search.
And, you know, that takes some custom code and integration, but it's really set us up for success to pull those data sources in and then figure out how to transform that data more so in that data abstraction layer. And we're, we're trying to normalize that. So making it easier, you know, similar to like a Splunk search, you know, SQL is one of those methods that is interoperable between some of the, the big data sources that we found success with.
And we take care of the, the hard code to integrate.
And then we empower our Analyst to write, you know, simple SQL methods to these disparate data sets. So we set up the, the authentication, the Python code, the API translation.
And, you know, if an Analyst wants to search a hunt team wants to search the threat intelligence team wants to search among these different data sets they can do so in a simple sequel query. So when it comes to, okay, we found a good search or a data source that we want to identify some analytics or you run some detections on. Then that's where we come to the, the feature engineering and we use, you know, a, a dev type of sandbox in Jupiter notebooks to then, you know, identify additional features that may set us up for a model.
So that's the algo selection, the feature extraction and the Proman job does that automation.
So if we've got new data coming in, you know, we can train that data and then get a model coming out, depending on the frequency and the accuracy of that model. More data coming in more or less will equal a better model with higher accuracy. And then at the end, we can ha have that model. It can be queried via arrest API. We can evaluate with that model on a daily basis with the, a UI continue to log.
We also integrate some of the other teams to help us with the, the model risk management, making sure the, the model is acting as, as it's supposed to. Is it violating any, you know, policies or, and, you know, we can contain that, that model as well in a, you know, its own environment with, you know, things like Docker to, to have that running.
So that's, you know, the engineering part behind the scenes of a model going from a data store to prediction at the end.
So this is more or less on the, the integration part, each of the, the platforms here, you know, some of the, the data sources that we're working with and then that data connector going into that sandbox to empower the, the Analyst Analyst to either do those searches or, you know, create some models themselves. So more so on the, the ML platform again on model ML ops, what it means to go to production.
And this is brand new, even within our technology like group as to how to operate ML the right way. Does it really fall into the SDLC sometimes? Yes. Sometimes no, you know, data science code is it really, you know, application, you know, code and we're, we're also trying to figure this out, but we've make made some inroads and partners to implement that and have some lessons learned. So now I'll get into the use cases.
So, you know, back to the, the metrics, this is where we first started and had some success, and this is identifying some of the, the operations within the, the SOS. So thing analytics like, you know, how many count of cases per day are you getting within the last six months? And then breaking that down even further, the median case count per each hour.
And yeah, we had some interesting findings. So at 3:00 AM and 3:00 PM, it seems that that was the spike in our detections, you know, firing off and, you know, should we normalize that? So we can even the load for our 24 by seven operation. And then additionally too here, what days were most, you know, popular for cases to, to fire off. And that helped us arrange our teams, you know, based off of certain days, should we have three Analyst on this shift on, on this day to support the, the caseload?
So it was really eyeopening and you helped us better operate our, our SOC, even though the caseload. So, you know, more so on the, the analytics as the, the threat rates always continue to go up, more detections are, are being created. And this helped with our false positive rate, you know, should we be firing these cases? And it kind of helped us figure out, Hey, maybe this rule needs to be evaluated or tuned because it's has a high, false positive rate. So I wanna also talk about a DGA detection, which was been a successful machine learning type of exercise that helped us out.
So DGAs are domain generating algorithms. A lot of ransomware attackers are using these DGA domains as trying to like fool your, your proxy logs, or, you know, may come in as a domain that looks like, you know, a bunch of gobbly goop, but is actually behind the scenes, an algorithm being used to create these domains that are trying to masquerade themselves as you know, just suspicious domains.
And it's, it's tough to identify.
I mean, as a human, it's kind of easy to, to see that, you know, Hey, this looks like a DGA, but can you train an algorithm to also sift through your 150 million events per day in your proxy logs to identify a domain that may look like this, and then that could result in, you know, some type of malicious behavior going on in your, your network. So we had some success with that.
So, you know, similar to this domain that you and I can see as a, you know, looks like a suspicious DGA, well, you can break apart into features that can then be fed into a model to help predict. So, you know, a fixed length of the, the length of the, the DGA, the domain TLDs that may be suspicious.
Are there any English or, you know, regular words within the, the domain itself, are there any hyphens or, or weird characters, what's the ratio of consonants to vowels for a particular domain and it's more or less a linguistic type of exercise to then, you know, implement and break apart a domain to help predict, to see if it has a likelihood or percentage of yes, it's more so a DGA rather than, you know, a regular domain.
So you'll, here's an example of emote, a piece of malware that uses DGAs as it's, you know, C2, you know, server and you'll, here's the red Xs, you know, characters a through Y it's usually 16 characters in length using the EU domain and, you know, a brief description about it and, you know, with more, and then there are feeds out there that are, you know, showing the, the DGA domains, bring them in, mix them with the regular domain.
So like an Alexa top, you know, million domains and bring them across with the, the features and you can output and get, you know, a supervised learning model to help predict DGAs in the future. So here's an example of, you know, some of the features and, you know, for our Analyst, they can pop in a domain and these are the results that we'll get back.
You know, how many dots are in there?
What's it, TLD length? Is there any words in there? What's the word to digit ratio? And at the end there in the, the red box, we can get a likelihood. So it has a 2% likelihood of being benign and a 97% likelihood of being a DGA. And in addition to that, we can do some, a multi classification amongst the malware families of what we think this domain looks like as it relates to, you know, different malware families categorizing and using DGAs. So this has been really helpful for us, and it helps context.
And, you know, we're working on detections to, you know, tune down the, the false positive rates. So, you know, just to, to recap, went through some of the behind the scenes of what we are doing as a data science team to help our, our cert, some of the engineering behind the output of ML.
And then two use cases that have been, you know, pretty successful for us, for, you know, data science related to help our, our SOC, which have been evaluating the analytics of our alerts in general, and then using ML to, you know, fight fire with fire, use an algorithm to, you know, fight another algorithm in machine learning to identify DGA domains. So I'll stop there and pause for any questions.