Stop, stop, stop. Quick note before we start the show. You might have already realized that we were not as regular in publishing this podcast as we have been previously, and we have skipped some episodes, and this is for a good reason. We are currently producing lots of great content running up to our event, the cyberevolution, and also the cyberevolution is really a lot of work but great work to prepare for, and that's the reason why we will reduce the cadence of the publication for this podcast temporarily. So it will be bi-weekly starting now and we will return to a weekly schedule as soon as possible. There will be lots of great video content anyways because there is lots of great video content resulting for and from cyberevolution. And so please stay tuned. Watch this channel. Watch our now bi-weekly, temporarily bi-weekly, podcast and we will return to a weekly schedule, as mentioned, very soon. So enjoy the show. Thank you for listening.
Welcome to the KuppingerCole Analyst Chat. I'm your host. My name is Matthias Reinwarth, I'm the Director of the Practice IAM here at KuppingerCole Analysts. My guest today once again - and this is really an interesting topic - is research analyst Marina Iantorno. Hi, Marina, welcome to have you back.
Hi Matthias, thanks for having me here. It's my pleasure.
It's my pleasure. Because - not because only of that, but especially about the topic, because we have something that is rather new, rather fascinating. And this is really an outcome of developments that just happened in the recent three years or so. And that ended up with the term of synthetic data. First of all, please give me an introduction to synthetic data. What is it and how is it different from real world data, if it's synthetic and how is it generated
Well, as the word indicates, it is something synthetic. So then it stands for artificially generated datasets. So pretty much artificial intelligence creates datasets that mimic the real data. So that actually creates the same statistical properties, the same patterns, the same characteristics as real data that companies could have. So in the end, it is actually something that is showing information, like [...] information about individuals but not revealing their identities, which is actually great. And the idea is that AI creates these data given the advantage of mitigating bias, making more robust algorithms and, you know, balancing as well between whatever we are using or the companies are using. And there will be, you know, a more balanced use of data from underrepresented groups, which is actually very good in terms of AI ethics. It will improve fairness. And also this kind of data that could actually empower testing models in different scenarios. So it's something great that is actually on the rise right now.
Right. You mentioned it's something great. We need to look at it from different angles and from different aspects. But first of all, where is it useful? Of course, we as KuppingerCole Analysts, we are mainly concentrating, but not only concentrating on cybersecurity and identity and access management. Start with cybersecurity. Where can we use synthetic data in the context of cybersecurity, where are benefits of using that data?
Well, we know that AI is actually that touching a lot of industries nowadays, and I would say that this is something that will continue and it is ongoing and it would touch in the end, ultimately almost all the industries in the market. And cybersecurity is one of them. Now, cybersecurity is including, for example, automation in the monitoring and in the supply chain risk management. So when we use synthetic data, we can actually use this data to train and test different models. So here is where cybersecurity gets a benefit using synthetic data, because sometimes, for example, we can have datasets that have some gaps, for instance. And with synthetic data we can cover those gaps and it would allow us to have more robust models for training testers and optimization sets. So in this sense, what I would say is the important part is the quality and the quantity of the data that we use on the training sets and on the test set and on the optimization set. So then this will actually give us a lot of possibilities to improve the performance and the accuracy when we are creating, for example, a model to do automatic monitoring, for instance. And besides, synthetic data also can create realistic but non-sensitive threat intelligence feeds. And this one could be useful for organizations, especially for the training for employees, you know. So then if you want to do awareness training for cybersecurity, you can actually use the synthetic data to create these threat intelligence feeds, or simulations of that, right? And they can actually simulate phishing attempts, malware injections, data breaches, for instance. And we can see how the employees react towards that. So then if there are some weaknesses, companies could actually act very quick on that.
Right. And several episodes that I just did recently, there was always the topic of exchanging threat data, exchanging training data between organizations and that always comes with some kind of hesitancy to say, I don't want to share my data because of privacy concerns, because of giving away data that might hint at my intellectual property. So that would be an area of application where synthetic data could help. So a bit like anonymization on steroids.
Yes, totally. And I would say that the privacy is actually the part that we get more benefits out of this. Because we know that there are many industries in the markets that they have the obligation of protecting their data. So then let's say, let's talk about medical industry or banking. So these are industries that actually cannot reveal the identity of their clients or the people who actually they have in the dataset. So synthetic data will be able to actually pick the most important features that the dataset have. Talking a little bit technical, this is the same as principal component analysis that, you know, the machine is actually able to take the most important features and then replicate it, creating a mimic of this reality, without mentioning these people and without using, you know, like their real names or the real information. Now, many companies could actually get a benefit of this in terms of proving that they are complying with their regulations. And we know that the data regulations are actually changing constantly. There are many actualizations all the time. And then what is happening is in the use of synthetic data, it could actually be very easy to demonstrate that none of the data is actually shown. None of the identifiable personal information is there. So then it's a great advantage for industries now that are highly regulated.
Absolutely. And I highlighted that we want to look at this from different angles. So privacy, preserving privacy is an important aspect. But of course, there might be the other side, the bad side of the “force”, so to say, what is the generative AI in cybersecurity when it comes to the threat actors? Are there potential threats associated with the use of generative AI of this synthetic data when it comes to malicious actors?
Well, always when we have something new that is coming up, there are goods and bads, and there are pros and cons. So then, unfortunately, yes. So on the positive hand that we mentioned, as we said, like boosting the anomaly detection, helping with automation, creating robust models, malware detection augmentation, password information, so that there are like many things that we can mention as something positive. But on the other hand, we need to consider as well that as data and AI, and generative AI can be used for good reasons, it could also be used for bad reasons. And the other flip of the coin indicates that generative AI could be used to create content that is actually not appropriate. For example, deepfakes. This is something that is getting more that everybody's talking about this nowadays. And it is because attackers are using this more and more. Now, if an attacker uses a deepfake, it would be very easy to actually attack vulnerable systems afterwards because they can actually replicate with generative AI the voice of the person, the behavior in terms of their writing, for instance, or how the person writes, etc. And it would actually be a problem. In the sense of data, also, we need to consider that - we mentioned this in other podcasts - that we train the models, right? So when we train the models, we are using data and according to the quality of our data, we will get certain results. If the data is good, we will get good results. If the data is bad, we get bad results. If the data is biased, we will get a biased result. And in this sense, if an attacker accesses a dataset, then let's say that the attacker actually poisons the dataset, then he can actually use generative AI to inject synthetic data into the training datasets. And of course, he would provide a bad outcome then, you know, according to whatever this person wants to show. So then, yes, it would be like any issue. But again, we cannot be always hundred percent sure, you know, in this sense.
Right. Understood. Now, this is a topic that seems to be very state of the art. So we have artificial intelligence in there. We have cybersecurity in there, you've mentioned all these different patterns when it comes to training the models. That sounds very much also like a hype topic, something that we as analysts look at as something that happens in the future. I think this is not true. This happens right now. Are there real world implementations, other vendors? Are there use case that you can mention where this is really already the case to be used and really helpful?
Well, something that I would say here, and I agree with you is like, it is happening now. I remember, you know, many years ago when we watched the movies and we saw that maybe, you know, there were robots or machines that could actually replicate the data, etc., we saw it as something that maybe was not that possible or we saw it very far away. But it is actually happening. And there are many companies that are using this. A very famous case is Zalando, a German company, is this online retail company. They use synthetic data to actually improve the customer experience when they buy online. So then this is something very good, because remember, the synthetic data could actually replicate the important features that show the behavior of the customers. So then it is something good to train the models and improve the experience. OpenAI, for instance, everybody is talking now about ChatGPT, right? Or everybody who is on the field of technology. Now, OpenAI is one of the creators of ChatGPT and what they do to actually train ChatGPT, is using synthetic data to train the model and then improve the robustness and the generalization of the models, and not only in English but across different languages. And this is fantastic. You know, so then I would say that there are, of course, many other companies that are using this, but these two cases, I believe they are very relevant because, you know, all of us who we actually know it, and many people now are using ChatGPT on a daily basis, then it is good to know that the data that was used to prepare this model and the large language model, they could actually interact with you. Well, this model is using synthetic data.
Right. So now that we know that this is happening right now, that this is a thing, synthetic data is here to stay. Now let's take a look into the future, what, with this rapid advancement in AI and machine learning, where do you see - and now that you've watched that development rather closely from a data perspective, but also from the data analysis perspective - what are your expectations, how this is evolving, what are challenges and opportunities? So
Well, I believe that eventually all the companies will actually move towards synthetic data instead of using real data. And this is because of the privacy regulations. The privacy regulations get more strict year by year, and in different countries, not only in Europe. In Europe, the one that we have is GDPR, but there are like many other countries that are using their own privacy regulations. And in order to avoid issues, I believe that it is very helpful that the companies will start using synthetic data. Now, what we also know is that with the time, everything tends to improve - or we expect so, right? So then creating this synthetic data, with the time, will actually be offering as more realistic datasets, datasets that are actually really showing what is happening in reality. And there will be platforms that will offer the opportunity to customize the dataset to see what do you exactly want to see. And here I would say that the main challenge is the quality and the diversity that is used in the dataset. Because as we mentioned earlier, so if we have a data said that this very simplistic or it is biased, then of course it can lead into poor performance, right? Like in the end, in the outcome. And the idea would be actually to have good quality in the dataset and understand that AI is actually here to help us. Someone asked me recently, Do you think that AI will replace the humans at their workplace? And we mentioned this in a previous podcast, Matthias, we talked about this before. And you know, my thought on this, AI is not here to replace the humans. It is here to augment the capabilities, but also it is here to augment the capabilities of the businesses. And I believe that this is what will happen with synthetic data. Businesses will actually get an advantage of this, will use it in their favor, and they will start using synthetic data to create more robust models and to improve their business in general, because they can improve the operations, the customer experience, as Zalando is doing, training models that actually perform better. So then that's what I think.
Fascinating. And I think this discussion needs to be continued. We did the first episode on synthetic data just right now. So now we are complete, and now this topic is gaining more and more traction and it will get more and more important. And when we look at the intersection of synthetic data and cybersecurity, that is absolutely a topic that will be covered at our KuppingerCole November event in Frankfurt, the cyberevolution. And I think you will talk about that there as well and there will be the opportunity to talk with you about that topic and learn from peers. We learn from peers and maybe we can help other peers to learn from us, to learn from the from the analysis that you executed. So I'm looking forward how that topic evolves, running up to cyberevolution and beyond. But it will be a main topic at a cyberevolution in Frankfurt in November. Any thoughts from your site? Can you give a quick glimpse on what will happen there?
Yes, of course, there will be an entire block about artificial intelligence and how it is used in cybersecurity. And of course, synthetic data will be one of the topics that will be there as well. So we are really looking forward to meeting you there. And Matthias, I am actually counting the days to see you in person again there, too.
Same here. So looking forward to meeting you, Marina, there. Meeting hopefully some of the audience that listen to that podcast there as well. And it's really about learning, talking to peers and socializing to really building communities to learn from each other for the benefit of cybersecurity, for the benefit of using technologies like synthetic data to the best purposes possible. Thank you, Marina, for being my guest today. That was a really interesting new topic also for me, and I'm really looking forward to learning more about that. And I think there will be research out about that anyway rather soon, right?
Thank you, Matthias. Thanks a lot. And yes, I will see you in November then and hopefully we see some of our audience there. Thank you so much.
Thank you. Bye bye and have a great day.