Hello. Hello. Although our, our college has not the name tech in, in it also to, to some extent, we do a lot of tech. So for instance, we are participating in the capture, in the flag. Capture the flag, right?
So I was, I, yesterday I was at stage, on stage with, with Christopher and yeah, we do a lot of, also we have masters for instance, I'm going to introduce at, at the end quickly advertisement for our own course where we have a specialization cybersecurity, for instance. Yeah.
Okay, cool. But now to the topic. So I'm going to talk about a bit on, on some research that I undertook with, with the company. Yeah. I cannot tell you a name because they are very cautious in Ah, thanks. Thanks. That's good. Awesome.
Because they're very cautious. I cannot give you the data, but I, I hope I can provide some insights what we did and what are our next plans. And it's about bots. The presentation is not so typical academic, it's more application oriented. I think that that fits the audience. So what are bots?
What I mean is automated threats, it's defined in the old Bos automated threat handbook. Yeah. Or was everybody knows that, right? So open web publication security project was top 10 and so on. And they define an automated threat, as I just cited here, misuse of the inherent valid functionality and which is commonly, often reported as application, denial of service. And while doss is only a side effect instead of the primary intent.
So what we are dealing here with are attacks where the attacker doesn't want the service to shut down or, or is not a typical standard attack like a Doss or trying to gain access, trying to extract some data or something.
Their intent is to gain a competitive advantage over human competitors and place an order before others, before humans can do that because they want to get their hands on the hyped articles. Why do they do this? Several motivations. One is reselling business, for instance. They buy this relatively cheaply at, at, at your platform.
So it's for e-commerce business, and then they resell it with a high margin. Now, typical, you, you see it here, PlayStation five. Okay. Not now anymore, but before or sneakers is a lot of business. There are also, it's so industry, there are, there are enterprises that set up that write those bot, you can subscribe to them and they place on your behalf then the order, and then you get your hands on those, on those highly desired articles. Okay. And of course, a company doesn't want that because it's a reputational damage.
You want the legitimate customers able to buy their, the goods. Yeah.
And you also have operational issues because what you see is, is that's what you o observe. And also how they, how do they, how do do the attackers proceed is they reverse engineer your platform. So you try to find out how this works, try to find flaws in the design, and then implicit assumptions that every developer makes and try to exploit that, right? And then you see, for instance, a lot of spike patterns, like a denial of service, but with a different intent. But still you need to scale out costs, money, operational risk. You can have outages, which also inflicts financial damage then.
Yeah. And they are, they are smart.
Of course, they obfuscate their techniques. So it's when they started, they send a lot of requests from one account, but of course then they learned it.
You can apply some rate limiting or something. Then they learned it. They just need to distribute the effort across multiple accounts.
So they, for instance, they create a ton of accounts upfront, or they take accounts over, they take over accounts, and then they use a lot of, for instance, application techniques. Like similar names, but slightly different. Also all locations slightly different. Or they entertain a whole network of delivery addresses. Yeah. And how do we tackle this problem or how do do we do this? So the goal is, the ultimate goal is to classify behavior legitimate versus undesirable. And the focus here of this talk is also this classification problem.
However, in step three, at least I also talk about this because I want to remind you, of course, it's a holistic problem. Always. Security needs to be tackled from multiple angles. It's not silver, silver bullet.
Yeah. That solves everything. And in step four, then I talk about a bit what are next steps. Okay? The first thing that comes into your mind is a rule-based approach, right? You have all those observations and you can code them into, into rules, right?
For instance, during a hypothetical sales, when, when this happens, usually in a defined period of time, you observe things like many accounts slightly different. So you need to have some engine that checks for similarity, but they purchase the same article multiple times. Yeah. Then you can go, for instance, to order cancellation account blocking, something like this. And the second thing that you do is we set up an automated batch labeling while, while the hype article sales support focused on this event. Yeah. And you observe data during this event and responds directly.
The other, the automated batch labeling is you, you look at the entire customer and auto database and look for suspicious patterns there, and you code them, right?
This is okay, this is, you do something, okay, first step. But of course the performance is not so well you, you have false positives and it's reactive. Yeah.
Therefore, you want to go into something like deep learning. That's what we did. And we addressed two sub-problems of the bigger problem.
And the, and let's, let's first take a quick step back and take a look of what is the problem about. So some general observations here.
General, have a shortage of gold label standards, gold standard labels. That's, that's the correct term. So this means you have, because you observe something, you can come up with some rules. We do this automatic labeling, but this is always biased. It's noisy having a, you are not very confident. Is it really a bad guy?
So at the, at the very end, maybe if customer support, for instance, has verified the identity of the customer, so maybe you, you get a, a label which is of high qual quality, but you always have a shortage.
It doesn't scale well.
You, you need to deal with this problem. Then the issues often have a temporal dim dimension, which means you see this here, what's what's written is these are the customer journeys. A customer journey is a sequence of events that you can observe in the backend. And for instance, account creation at the very start. Then the customer does something like looks at article pages, puts something in cart, goes to checkout, doesn't address change and so on. You can observe this all in, in the backend. And yeah. Temporal dimension, meaning it's sequential and you have timestamps attached to this.
And this is very similar to a natural language processing problem, as we have also a sentence has also a sequence, right? And you, in order to understand the whole customer journey, it's a similar problem. You even order to understand the whole sentence, the semantics of the sentence.
You need to somehow process this with therefore, thereby it's, it's, it's intuitive to use natural language processing inspired deep learning methods to tackle this problem as well. Right? And we focus here on horizontal data view. I I distinguish between vertical data view, horizontal means only one sequence.
Of course, we look at many sequences, but only one at a time. And vertical would mean across multiple accounts. This would be the next step. Then later, how did we do this? So the first use case was we classified bulk account creation. So we wanted to find out or have a classification at the end, an algorithm that gets us input and email address and provides as output probability distribution of its creation. Yeah.
And that, that was the idea. Because as always, if you set up a machine learning algorithm, you go in with some hypothesis that your data will exhibit cer certain patterns.
And the, the patterns we observed manually, which were then the go into hypothesis were the bug the, the account creation happens also in a short period. So the attack is prepared for that. Right? So upfront before article sale, sometimes also long time ago, and they let them sleep, they can use them later, but they're usually created in a short period of time. And we use this, we use use birds.
So redirection encoder representation from transformer natural language processing inspired approach with a subsequent feed forward layer network. And what, so, and you see this, the, the output there at the lefthand side, you have a legitimate email, then you have a high entropy, which means it's more or less uniformly distributed legitimate. But if you have a spike in your probability distribution, then then you have also a low entropy. This is suspicious and that can, you can use to classify. Okay. And you can see then at, at the righthand side, it's a bit small.
I'm sorry, I was stupid. I took four to three format and not 16 to nine. I'm sorry.
You see the T is an E pro projection.
So, so you, the, the dimensional to reduction and you see the, it's, it's well clustered. Yeah.
You see, and the, the red ones are the low, the low entropy ones, and they are really bad of course needs some fine tuning and so on. But this was, was just to get support for our hypothesis. Okay. Use case two is account takeover detection. Yeah.
There, the, i I go in hypothesis as follows up, you assume that the, so what is an account takeover, right? Maybe first.
So the, the, it's it's taken over. It means it was break broken by brute forcing the password by a leakage. And now the bad guy can act on behalf or pretend it's the legitimate account owner. But of course has, has control over the account. So you expect some change of behavior in, in the, in the, in the cus in the customer journey, not only the sequence and so on.
We also used for instance, data behavior. What do they buy for instance? Yeah.
And the, the other hypothesis, if you split, the other idea was if you split the customer journey into two subsequent sequences, then for a legitimate account, you would expect that the probability of observing those subsequences not independent. Yeah. It's not independent. While if it's in a taken over account, because the attacker does not care what this guided for beforehand, you will can observe the sequence under the condition that it was taken over is the same probability as having the sequence without this. Yeah. So it's independent. Independent.
So you have different expectations in the probability distribution that, that that's what you can exploit. Therefore we train a model that predicts the probability of observing a subsequent under the condition of another subsequence.
Now, now did we do this? So it's just a schema.
Okay. High level. But so you have the raw data. You start with encoding the attributes, meaning a rule-based transformation of categorical into numerical variables. You get vectors, right? And then this whole a bit darker blue one stuff, this is all embedding, meaning you let give the, the network the opportunity to learn a semantical representation of what, what is the customer journey, right? So semantical, so it's, it's it's then mapped to a 256 dimensional vector space.
And which has a metrics, and of course similar customer journeys are close with respect to this metric to Yeah. If they're also similar in the real world, right? And so you start with embedding the attributes, you combine them to events. And then we did this shuffling and it's a self supervised learning, meaning we generate automatically generate self supervised targets. And the idea was we randomly take to customer journeys, split them up and shuffle the tail.
Yeah.
Thereby, so, so, so we, we simulate as if it was taken over, right? So another customer, like another account who he hears the sequence of them, we, we, we attach them to the other journey. And so exchange them thereby creating labels because we just mark them, okay, we have used those, this, these are the labels, right?
And then, then the embedding learns what is a normal customer journey versus versus what is a, a taken over one And you can, afterwards you do a supervised learning on those labels you have the prediction was also promising, worked relatively well. So holistic approach, this is this general thing to remind you of course is no silver bullet. Yeah. It may mean classification. Nice one. Yeah. But you need to take, have a whole infrastructure to tell tackle the problem. Of course. So what we had, for instance to the applications, oh, sorry.
They, they submit all these events into event, event pipeline. They are eventually also stored in the data lake, which, and together with other data sources, you can do the typical stuff, ETL then data preparation, machine learning training, have a train model which is exposes API where which you can call that for inference. Yeah. To get a prediction. On the other hand, you have a stream processing, which, which takes real time data on rules. Together with this prediction from a pre-trained model, you feed this in the rule engine.
And in the rule engine there are, the rule are stored, which say when, when this threshold is hit and so on, you do get, go to this response. You will want to have a gradual response here. Meaning if you're more confident, then you are more aggressive. Like ultimately org order cancellation, account activation. If you're less confident, you maybe just do some timeout or slow them down somehow.
Right? Okay.
And it's, everything is connected is always, it is, just to give you an idea. I mean the, the, the detection part is only the central detection response mechanism in the middle there. But of course you need to connect all the, all the other servers like order servers cus from customer support, you want to have the feedback. 'cause you will always have false positives and people will be complaining and you get gold standard labels from them also because they are directly in touch with the client. You can of course also apply some lower networking controls like rate limiting DDoS protection.
You get insights from the edge. You want to generally push that to the edge.
It's, I mean this, this is okay to classify in the backend, but still some kind of late in the process, right? Better to protect the accounts upfront and implement something like multiple factor authentication.
And of course you want to revise your art architecture and find the flaws that allow them in the first place to exploit the vulnerabilities. Right? So for instance, check are all endpoints, do they not reveal too much information? Do they all do input sanitization and stuff like this? Okay.
And you need to have a extensive stakeholder management because he's always, it causes friction at the customer side. This is not desirable for the company.
However, they also want to of course slow down the, the bad guys. It's always, you need to explain it to the business. They need to take conscious decision what they want, which, how aggressive the rules would be. Should be. Yeah. What what are we up to next? So one thing is we want to, I talked about the vertical data thing. So we want to take, we, we hope we can improve the performance by taking multiple accounts into account and which means it's quite intuitive.
I think also graph based machine learning, this is graph based problem.
Like on the right hand side below, you see for instance, bad guys, same. So multiple accounts mapping to the same entity because similar then, you know, already you're confident this is a bad, bad guy and the yellow one maybe is not so offic so obviously bad.
However, he also ships to the same location, which has an impact on the probability that this guy is bad. Okay? You can model this as a graph. And so the plan is to use also graph based modeling because this problem is so, so graph based. Okay?
And the, the other thing is use generative approaches to harden the platform. And there the idea is to use some deep reinforcement learning, right? That's what we want to, to, yeah. This the idea. So what we have is now we have, with this deep learning stuff, for some sub problems, at least we have gained representation, we have gained knowledge, and now we also want to reason about this and plan and take some action on the environment.
Therefore, the idea is to train an agent that whose objective is to maximize the reward.
A positive reward would be he achieves some sub goal, maybe placed the order ultimately, but also some sub goals. A negative reward of course would be if he gets detected. So we want to train an agent that tries to place the order ultimately, but undermines our own classifications, thereby detecting also or making obvious which adver adversarial effects are out there and use this insights to harden our platform. That's the idea, right? So it's a bit full, the the slide I, I guess.
But you, you get the feedback cycle. Cycle prob probably the challenge. Of course you require a copy more or less of this, of this web application because this is the environment that you want to model and it requires extensive training. Let's see.
But I, we want to, we want to try it.
It's research.
I, I hope Yeah. Happy for your feedback. Yeah. What do you think about it? Summary all pretty challenging security problem. Remember classification is only one a sub-problem. It requires a holistic approach, right?
But the, the first deep learning based approaches were very promising. Next steps are graph based and steep reinforcement learning approaches. And we also should take other additional concepts into account like explainability. Of course. Yeah. If the customer complains, he would, will like to know why. And you cannot say him. Yeah. Computer says no.
Yeah, right. Sorry. It's okay. And some other concepts we, we didn't consider yet. It's like few short learning, learning or learning with corrupted labels. These are some papers that came out 2020 and yeah. And we didn't consider yet. Okay. Cool. And the final advertisement slide, I hope it's okay. I just wanted to make a quick advertisement at our school because you mentioned it, right?
What do you have to do with tech? Right. We also have a master, this master of digital transformation and our use peers that we combine. We treat digital transformation holistically.
Not only tech, but tech is a big focus, but also for instance, societal impact and business. And it's, it's dual because, so the people are doing what they learn in theory, directly in practice. 'cause they have a company, the co, a company, a partner, and we have a research focus and three focus areas. Cybersecurity, AI, and, and robotics and data science. Okay. Thanks.
That's, that's it. Thank you.