Hello everyone and welcome to today's webinar. My name is Anne Bailey. I'm a Senior Analyst with KuppingerCole Analysts and I, in conjunction with Marina Iantorno, published a report on synthetic data, the market, the landscape, the trends, and of course the vendors who are participating in it. The topic of today's webinar, I'm going to give you an overview of those findings from our recent report, the market analysis and our insights into it. So let's dive in. So I have some notes for you.
First of all, your audio is controlled centrally and you are muted so you can sit back and relax but don't get too comfortable please. We would love to accept your questions. You can submit those at any time during the webinar and please look down at your control panel. You'll find the questions tab there at the bottom so please use that and I'll receive those questions and take those at the end. We are recording this webinar so you will have access to both the recording and the presentation in a few days. So let's take a look at where we're going. We'll keep it simple today.
We'll really focus on the results of the report, so both giving an introduction and an overview of the synthetic data market and those results of the report. I'd like to begin today with sharing some assumptions that we have leading into the synthetic data market and how that's changing and evolving. So first of all, where do we stand today? We are in the middle of the data-driven economy and of course this has been in full swing for 20 years but not every organization can participate partially and even fewer fully in the data-driven economy.
So let's take a look first at some principles, some must-haves for the data-driven organization. These organizations must be working with privacy-protected data. They must be able to adhere to a central privacy regulation and this can be done of course by anonymizing sensitive data or by turning to other alternative options. Data portability is a must to be able to enable data to move in a compliant way across the organization where data is typically siloed and cut off from the teams who may need to use it and from other teams who can benefit from it.
And connected to this is to free up the bottlenecks in data projects and it's really filled with a lot of manual and resource intensive effort and so data-driven organizations really need support to speed up the time to value on their data projects. So built into those assumptions you can start to hear some of the pain points that organizations face and I'll walk you through just a handful of them to set up the call and response that we have between the needs of the market and solutions which are becoming available.
So if we start with the quality issues and test data, the reality is that production data isn't perfect. It often has gaps and identifying those gaps and adjusting for them requires time and manual resources. And solutions in a traditional sense concerning test data management solutions, these aren't always able to adequately address those quality issues. These data gaps feed into the most time-consuming step in data projects, which is data preparation. And these data projects can span the range from software testing to machine learning and generative AI model development and validation.
And so not only is gaps in data an issue but the time and effort it takes to identify and gather siloed data to then mask and de-identify sensitive data making it compliant and able to be used for other purposes really slows down that time to value on those projects. And as a rule bottlenecks raise operational costs bottlenecks raise operational costs and it can slow down innovation. Now when we look specifically at machine learning and generative AI, they have some specific issues here. They are as a rule data hungry.
Now they require unique data sets for training as separate from testing and validation. And of course these models also require continuous improvement. One of those reasons is because these models are subject to drift where they produce results that shift over time according to changing inputs but also perhaps away from the purpose of the model which is also a moving target. This purpose may shift over time as well. Now as I've mentioned before, incompliant or high-risk data is a showstopper for data portability, for using data for other purposes.
Production and real data often contain sensitive information which must be protected. Some methods for example pseudonymization are not always resilient against re-identifying data subjects. And so the protection that this data requires has to be very robust. This is one factor that feeds into why data portability is so challenging. The data must be compliant but it also must be secure against unauthorized access or data breach. And a continuing challenge is that the responsibility for data is often unclear and therefore so is the keeping of data and monitoring how it is protected.
So let's continue and finally connect the pieces. So let's start to consider solutions which are being developed and are coming onto the market. For example, synthetic data. First of all, what is it? In a nutshell, you can think of synthetic data as artificially You can think of synthetic data as artificially generated data, but it's not just random data. It specifically replicates the statistical properties of real data or of desired data. It also contains no actual or sensitive information.
Therefore you're springboarding directly into compliant data set even though it may be imitating sensitive information fields. So now we can have our call and response here. What is the market need? What do synthetic data solutions provide? A common use case here is for data augmentation. So synthetic data can be used to fill and augment the gaps in real data. Addressing sparsity issues, class imbalance, bias, and of course tackling the challenge of sensitive data. And this is not the same as generating an entirely synthetic data set.
This is augmenting real data augmenting real data with the information that may be missing or may need to be adjusted so that you have the target testing data set that you need. Synthetic data solutions really can support in data sharing in helping real data become compliant so that you can share it more securely. Even moving towards monetization if that's a goal or a target.
A different model for data sharing is also possible with synthetic data because instead of sharing the data itself that's been generated or the compliant de-identified data you can also have the option of sharing the model used to generate data. So if you have teams spread across your organization working on different data projects the model can be shared with each to then fine-tune and generate fit-to-purpose data for each of those separate projects. Test data generation is of course a strong use case here.
Really looking at the time to value so test data can be prepared in a fraction of the time that traditional methods may require. Our privacy protection can see that many vendors of synthetic data solutions work in parallel using privacy enhancing technologies alongside generating artificial data to then replace sensitive data entirely. And so using the combination of these can really fit a wide variety of needs.
And finally specifically targeting machine learning and generative AI development synthetic data solutions are really specialized in this area generating high volumes of fit-to-purpose training and validation data. Moving more concretely towards the report that we published we've identified certain capabilities that are really essential for synthetic data solutions and these have been summarized here. There are of course many more and we can go into much farther detail in further conversations or can look to our research for those further details.
But to begin it's quite important that the solutions have data de-identification capabilities and again this is the combination of privacy preserving measures including things like data anonymization, tokenization, masking, encryption, subsetting, and many others. And this in combination with generating data generating artificial or synthetic data is a really strong offering. Now we can separate generation of data into two very broad categories.
One would be generation from source data and this would involve ingesting and profiling real or production data in order to generate synthetic data which replicates that source data. Another category of data generation is from scratch or from next to nothing. And here the generation would be supported by data models, by metadata, or dictionaries. Considering the type of data that can be generated is of course essential for your use cases but also to find the right match in vendor where they typically specialize in one or a few of these different data types.
Most common are generating free text, tabular data, and image data. Scalability is a very important feature to look at, otherwise the bottleneck issue is not addressed. And essential to the rest of the organization and their success in of course participating in the data-driven economy, there must be very good integrations between existing data systems and workflows including things like your data storage, streaming, BI and analytics platforms, and your enterprise applications. Workflows are also a big plus here.
So with this context let's move forward and consider the report itself and the results. I'll take a brief moment and make sure we're all on the same page. Here we're talking about our Leadership Compass report. This is a market comparison report where we assess the market, the major trends and changes that are happening in regulation, in technology, and any other factors that may affect the particular market segment that we're covering.
We also take a very close look at the vendors who are active in this space, what use cases they're most suited and what they're targeting to meet, their geographies, their capabilities, what's on their roadmaps, go through a very involved process to then present this to all of you, our readers. Our process here is we begin at the research phase. We're identifying the vendors, receiving in-depth briefings and demonstrations. We also receive responses to our technical questionnaire.
We then analyze this information, prepare it in a draft, and we go through a fact check with each participating vendor where they review and update any factual information. Where we then move on to publication. You can see any of our past Leadership Compasses and other research on couponercoal.com slash research. I encourage you to check it out. Here you can see the vendors that we did cover in the most recent Leadership Compass. You can see compared to some of our others, this is a very short list.
Of course, that reflects the fresh and dynamic nature of this market. The vendors on the left are the ones that have participated fully in a report. That means that they did provide a briefing, a demo, a questionnaire, and they went through the fact check process as well. What that looks like in our report, that they have a full profile available, and they're also included in our overall leadership chart and have been rated on each of those important capabilities that we discussed earlier. The vendors to watch are on the right.
Now, these are vendors that readers should know about, especially if you're considering going through an RFP. They are relevant enough to the market space, but perhaps don't fulfill all of the inclusion criteria that we need. We can go to the most iconic part of the report and the quickest summary of the market. This is our overall leadership chart. You can read it from left to right, moving from a follower, challenger to leadership section. This is a composite view of our ratings, combining product leadership, innovation leadership, and market leadership.
That means if we start at the right, we take the farthest right, Cloud TDMS. This vendor has a very strong product and has some very innovative features there. That's placed it very high in the rating, and it's followed next by AWS, dare I say, a household name. It has a very specialized product focusing on image generation, but its market leadership also rates it very highly. We get a mix of industry incumbents and specialists here in this view, which makes it a very interesting read.
Now, I encourage you to take a look at each of these vendors, because although we do our best to bring in a dynamic rating that takes into account these many different features, product, innovation, and market, it can lead to a one-dimensional view of the market. I would point out, depending on your use case, these different vendors can really shine. If we take, at the far left, Rendered AI, for example, this is another very specialized vendor in image generation.
At the moment, this may change in the market, but at the moment, image generation and privacy-preserving measures, these use cases are fairly decoupled. What that means, then, for our rating is that Rendered AI placed farther to the left than maybe it would otherwise.
However, if you're also an organization needing image generation, it's possible that your needs are more focused in that direction and less on privacy-preserving measures. That makes this organization an interesting vendor to look at. The same goes with Octopi's, right there in the middle. They have very strong privacy-preserving measures and other strong features that make their product very well-rounded.
So, depending on your needs, depending on your use case, can really change which of these vendors make a better choice for you. So, if you're an organization that's looking at image generation, make it to your shortlist. We offer other ways to assess the vendors that we include.
So, in each vendor profile, there's quite a bit of technical information that would help inform you on their methods, their technologies, their deployment, and licensing options. And we also include, for each vendor, a summary of their capabilities and how they perform.
And, for example, you can see here, at a quick glance, where the strengths lie and where their product is not as fleshed out. Of course, depending on your use case, this may or may not be a deal-breaker.
So, I encourage you, move beyond the initial graphic and explore these vendors, depending on your needs. Of course, it's very important to define your requirements, define your needs, before jumping in to the exploration and building of your longlist and shortlist. What I can recommend is turning to OpenSelect. This is a dynamic version of our Leadership Compass that allows you to filter by use case and to indicate which capabilities are most critical to your organization, which would then filter through the different vendors and offer a shortlist for your further investigation.
Now, this is just scraping the top of what we could talk about with synthetic data solutions. I'm very glad you're here, and I'm glad that I can also point you to a reading list where you can dive in a bit deeper.
Of course, at the top of the list is the Leadership Compass where you can see the full results, read a bit more about any vendor that may have caught your eye. We're also happy to support you with our other services as well. I'm a proud member of our research team where we handle all different types of research and present that in a variety of forms. We also have our events and webinars where you can engage with thought leadership, both on the stage and from the audience.
And we have an advisory department as well that really specializes in supporting IT professionals, their decision-making process, and doing maturity assessments. And with that, I thank you very much. I invite you to please submit your questions. We can handle them here, or you can also reach out to me at aba.cool.com. I'd be very happy to make a connection and continue the conversation. We look forward to seeing you at future webinars, future events, particularly the EIC, European Identity and Cloud Conference, which will take place in Berlin in May this year.
So I look forward to meeting you then. And take care. Have a wonderful day.