Event Recording

Machine Learning: Cybersecurity’s Friend & Foe

Dr. Donnie Wendt

Principal Security Researcher

Mastercard

Posted on Nov 10, 2021

Show Transcript

I'm a principal security researcher at MasterCard and also an adjunct professor at Utica college. But we'll just go ahead and jump right into it. So at MasterCard, I do much of the scouting of future capabilities. Therefore I do see my fair share of sales pitches from security vendors based on much of the marketing material for the products. Machine learning often appears to be some sort of magical fairy dust that we just sprinkle a little here and a little there. The next thing you know, it's learned our environment and we're secure. Now don't get me wrong.

Machine learning does provide many profound benefits to cybersecurity and has proven effective a lot of the cybersecurity task. As a matter of fact, the benefits are so profound. We have to embrace it. This presentation hopefully will help demystify machine learning so that security professionals can harness the its true potential. Here's what I'm gonna cover today. I'm gonna skip over since we don't have a lot of time, I'm gonna skip over the concepts, the machine learning concepts.

However, those topics are in the slide deck for you to look at later. And so make sure you go back and look at those. If you're interested, we'll go ahead and jump to where we're using machine learning and applying it to cybersecurity. So employing machine learning for cybersecurity, of course, it offers us a lot of benefits and it's become a vital component in many of the security solutions. And we'll look at a few of those use cases.

Now, for instance, for malware and fishing detection, we take malicious and B nine samples collect them and then split 'em into training and testing data sets. What we take the discriminated features discrimative features and extract those and build classifiers, right? And these classifiers are then applied to the training data to learn how to classify and then against the test data set to determine the accuracy. These often rely on machine learning algorithms such as naive bays and sport vector machines or, or deep learning algorithms.

And then we a typical approach also to for intrusion detection is applying machine learning. We begin by collecting all the, collecting all the network data and select the features either through engineering or extraction and building IDs.

The, the IDs is analysis engine is then applied to discover existing attacks based on those features. In addition to using machine learning algorithms, such as random force and recurrent neural networks IDs often employ other AI techniques, such as fuzzy logic and expert systems. And then in the area of user behavior analytics, that's where we've seen a lot of application machine learning it. They seek to detect threats by Def basically defining normal behavior and then detecting deviations.

So we take data mining and machine learning to identify what normal behavior is for a user, such as what data, the user accesses, what programs he or she runs, what time the access system and so forth. Then the machine learning algorithms are used to discover outliers important to note that those outliers can be time based where we're saying the user behavior changed over time, or they can be peer based where we're looking at that user's activity compared to their peers or others in their role.

These same methods of course, can be applied to computer systems to, to see any deviations in their behavior as well. And interesting area where recent studies are leveraging machine learning to try to predict vulnerability exploit. The goal with that is to assist our software vendors with prioritizing patch development. Because less than what we've seen is less than 3% of all vulnerabilities are exploited.

So how, how do we prioritize those that will be exploited? So researchers use ML models that leverage multiple data sources to try to predict if and when and exploit of a vulnerability will be seen in a wild using data sources, such as exploit DB zero day initiative, various threat intelligence feeds, scrape new dark websites and social media such as Twitter. And then these, the, the approaches often for, for predicting vulnerability exploit often combine many methods, including artificial neural networks and natural language processing, and finally network risk scoring is another area.

We see a lot of use of machine learning. It helps organizations prioritize those scarce resources to assign by assigning relative risk to various components based on quantitative measures, often using machine learning methods, such as K air neighbors and sport vector machines.

So we see, we see a lot of benefit from it, right? So machine learning offers these tremendous benefits.

However, we have to be aware of some of the possible weaknesses as, as these methods are vulnerable. So researchers have proven that both traditional machine learning and deep learning, which is a subset of machine learning are vulnerable to adversarial inputs and attacks. And these attacks can be categorized based on the phase such as training or inference phase. So let's look at some of the attacks first in the training phase. So poisoning attacks are caused of attacks. They alter the training data or the model parameters.

Since machine learning methods rely on the quality of the training data. They are vulnerable to such manipulation poisoning attacks, the attacker injects adversarial samples into the training dataset or manipulates the labels to impact the algorithm's performance. And also the, the, the poisoning can be either director indirect, direct poisoning targets the training dataset, whereas indirect poisoning injects data into the raw data before we extract it for use with machine learning.

Now, since that training dataset is often well guarded poisoning attacks against that original dataset can be very difficult. However, in a changing environment, such as cybersecurity, our, our models need to be retrained often so they can adapt. For example, machine learning model that seeks to determine anomalous network activity must periodically be retrained to identify what normal network traffic is. So attackers seek to exploit that need for training by targeting that stage of the machine learning.

Now, what about the inference phase? And here we have exploratory tax. They do not tamper with the training data or the model.

Instead, the exploratory attacks, they collect information about how the, about the training data and the model. Typically during inference, sometimes exploratory attacks are designed to maybe duplicate the model or to extract training data. The attacker uses reversing techniques to discover how that algorithm works. Then that knowledge can be used to launch an evasion or integrity attack.

So integrity attacks they're seeking to evade detection by producing a negative or benign result on an adversarial sample, such attacks, of course, that they often rely on that, that initial exploratory attack to figure out how the classifi works. One example is researchers. Researchers demonstrated that for instance, biometric based authentication, there are vulnerabilities of those systems to evasion attacks. Another type of integrity attack is analogous to a denial of service.

Such an attack is designed to create cause the classifier to misclassify many benign samples as malicious increasing all lot of false positives for the security team to evaluate. And finally, we have output integrity attacks. These are analogous to man in the middle attacks. They do not attack the model or data directly.

Instead, these attacks intercept the result from a class fire and change it for example, on a malware classifi the attacker could change the result from malicious to benign allowing malware to execute another rapidly. Expanding area of concern is the use of machine learning to attack machine learning, classifiers. This is often referred to as adversarial machine learning. Essentially. This is the battle Royal of machine learning versus machine learning.

We'll take a quick look at how an attacker can use machine learning to conduct an integrity attack on a machine learning classification system. The attacker employees what's called a generative adversarial network. Organic again is a, it's a two player zero sum game that's pitten pits, two different AI models against each other.

And again, the generative model in this case, the attacker is using unsupervised learning to discover and learn the patterns in the input data so they can generate new plausible examples to fool the classifier. The discrim, in this case, the classifier of the security system tries to classify these examples with each iteration. The generator's updated based on how well it performed against the discrimination until, until it gets better at creating these examples. The diagram here shows how GS often underlie the creation of realistic deep defects.

Now let's look at some of the basic defenses we can employ to protect our machine learning models and data. Now, much of the effort when we're developing good supervised training model is really around the collection and preparation of that data since the, the learning is based on historical data, that data defines the ground truth, the quality, quantity, and relevance of that data is going to affect the learning. Often the data must be cleansed. Many of the decisions such as how to deal with missing data can greatly impact the learning.

You know, we have to consider should the missing data be ignored or should it be imputed? And if imputed, by what means, we also have give considerate considerations to outliers and, and the distribution of the data. So all of that work is done before we even start the training. Now basics such as version control and access control are extremely important, especially as it re relates to the training data and models, we must ensure that our machine learning model development employs sound, change management processes, robust learning this.

It seeks to make the model inherently less susceptible to those adversarial data or outliers. We improve the robustness of learning algorithms to guard against poisoning attacks. And this is an area where a lot of research is being done today. So what we're trying to do is, is actually simulate attacks during the development stage. So we can try to improve the robustness of that model.

Another method used to create a more robust classifier is what's called bagging, which is where we use multiple classifiers instead of a single classifier, then detecting attacks really detecting attacks on training and testing phases. This really needs a lot more research. Current research typically points to things like analyzing model drift.

However, there is a natural drift associated with changing attack parameters and methods in many of our cybersecurity use cases. So distinguishing drift caused by a low and slow attack may prove very difficult, which this is another good reason for that version control the training data. So we can run the original training data back through and see if we get the same results.

Another area I wanna talk about is that we, we have to use great caution when we're implementing clustering methods for cybersecurity use cases, researchers have been studying whether clustering such as K means clustering can be applied safely within cybersecurity. And there really are no definitive answers. If an adversary knows the state of the clusters, it's easy to create a new data point near one of the clusters. So until further research can improve the robustness of clustering, make sure we're exercise great caution when we're employing such methods in cybersecurity.

Now, classical machine learning does not consider purposeful misleading by an adversary. Traditionally machine learning focuses on uncovering knowledge and discovering relationships within the data, assuming a non adversarial environment for most machine learning problems, such an approach is okay.

However, cybersecurity is an adversarial environment. Adversaries seek to exploit those machine learning vulnerabilities to disrupt systems. When machine learning is used in an adversarial environment, such as cybersecurity, it must be designed and built with the assumption that it will be attacked in all phases here, you can see the major types of attacks we discussed earlier and what components of machine learning they are attacking.

So we see the poisoning attacks, targeting the training and testing data, exploratory attacks, targeting the classifier to determine how it works, integrity attacks seeking to evade the classifier and output attacks, seeking to change the result from a class fire. Unfortunately, most of the methods to assess the performance of machine learning models evaluate the model under normal operation, instead of in an adversarial context.

As I, as I discussed earlier, classical machine learning development does not consider a purposeful adversary seeking to mislead it. However, when these traditional methods used in cybersecurity, again, the adversary seeking to exploit these vulnerabilities. So the adversarial machine learning field seeks to improve machine learning, algorithms, robustness security, and resiliency in adversarial settings, such as cybersecurity.

And, and this adversarial machine learning field is built on three main pillars, which are one recognizing the training stage and inference stage vulnerabilities. So know all those vulnerabilities we looked at and where two develop those corresponding attacks ourself to exploit these vulnerabilities and three devise, the countermeasures with adversarial machine learning, the security team proactively attempts to exploit the vulnerabilities much like red teaming or penetration testing used for securing traditional software development.

As we have seen machine learning offers tremendous benefits when applied to cybersecurity. However, we have to understand its limits and vulnerabilities. Furthermore, deploying poor machine learning models is easy, but deploying robust, secure machine learning models requires much effort and planning. The training data used for machine learning must be sanitized. Proper data collection and preparation can reduce the errors, introduced through poor data. Furthermore, strict version and access controls should be employed to protect both the training data and the models.

And then when we talk about cybersecurity, we need to use great caution when employing any clustering methods, as these methods are particularly susceptible to evasion attacks. Finally, cybersecurity is in an adversarial environment. We have to understand that traditional machine learning methods do not consider the willful misleading or disruption by an adversary.

So we, we therefore we, we should employ adversarial development methods when developing models for this environment such as cybersecurity that I'd like to thank you all that, that was, that was the end of mind. So I'll turn it over for any questions.

Like this?

Don't like this?

Why don't you like this?

Machine Learning: Cybersecurity’s Friend & Foe