Well, good afternoon from England to everybody and welcome to this KuppingerCole webinar on acing the upcoming G GDPR exam. My name's Mike Small, and I'm a senior Analyst with KuppingerCole and my co-presenter today is Elta from Delphix and ELCA is a data masking solution engineer. So this is going to be a very interesting presentation and webinar on a little thought about aspect of GDPR, which is controlling access to non-production data. So Coppinger coal is an Analyst that was founded in 2004.
And we have a specific focus on areas relating to information, security, governance, and compliance and everything around digital transformation. And we provide research into all of the major topics, which are vendor neutral and constantly kept up to date. We run events and that with some details of the future events coming in the next slide, and we also provide advisory services, both to vendors and to end users of products.
There are a number of events that are planned in may. This year. We will have the annual European identity and cloud conference in Munich not to be missed.
Then from September onwards, there will be the consumer identity world tour, starting in the us coming back to Europe and then moving on to Asia Pacific and last but not least in November, there will be the cybersecurity leadership summit. So make a note in your diaries and don't forget to attend these today's webinar.
You, the participants will all be muted. You don't have to mute or unmute yourself. These will be controlled by the organizer. The webinar is being recorded and there will be a podcast of this recording available tomorrow. And if you want to ask any questions, you can find on the panel that there is a question panel. And if you put your questions in there, then I'll make sure that we note them at the end.
So now, without any further ado, let's start the agenda today is I am going to start off by outlining some of the risks and challenges and some of the steps that you need to take in order to ensure that you have full compliance for non-production data. This will be followed in the second part by ELCA, who will discuss a number of technologies and will also be giving a demonstration of how you can use some of the things that we will have described in the first part of the webinar to minimize the risks of handling personal data in non-production environments.
So I don't think anybody can have forgotten that from May, 2018, this European general data protection regulation coming into force means that organizations will around the world will all have to take much more stringent steps over how they deal with personal data. So how are you going to be ready to meet the challenge?
Now, carpenter coal has defined six important areas in which you need to take action. And the first of those areas is actually to discover the personal data that you hold.
This is, this is really quite critical because the definition of personal data under the GDPR has widened considerably from what it was before. So there is now a very clear and all encompassing definition of that. The second question is it, when you have found that data, you need to really understand what it is you collected it for and whether or not this is a reason that is valid under GDPR.
And if you are, depending upon having consent, have you got an audit trail that allows you to prove that you have got that consent?
And if you've managed to achieve those things, then you need to bring that data under some form of control. And in a sense, it is the area of control that this webinar will mainly be focusing on. So that control means that you have to be able to make sure that it is not processed in a way that does not comply with the lawful and fair processing set out in the GDPR principles. It also means that you have to be able to satisfy the data, subject rights, including rights of disclosure and erasure that are included in GDPR.
If you are using cloud services, then you need to be able to make sure that all your data that moves into the cloud is also controlled, discovered and controlled in the same ways.
And that leads you onto questions around what would you look for in terms of the cloud service provider in certification membership of, of communities and so forth, then all organizations or the vast majority of organizations are going to have to implement and put in place a data protection officer. But the regulation also advises on using data protection by default and by design.
And one of the areas that we will be covering in this, in, in this webinar will be considering one of the techniques that can be used to achieve that. And finally, you need to prepare yourself for notification when, and if a breach occurs and many organizations have not got very good tested plans for how to do that. And if you do have a tested plan, then you need to make sure that this is updated to include the new breach notification.
So that's what we are saying people should be doing, but what is actually happening?
Well, when we look at large organizations, what we see is that the majority of these organizations have taken very subject seriously, but have looked at it from a point of process review and reengineering to find out what are the processes that they're using? What are the tools that they are using and how should these processes now work under this new regime? They have also the large ones, mostly appointed a data protection officer if they did not already have one.
Then the first thing that we are seeing people are doing is as part of the discovery exercise, they are going through a process of refreshing consent. So this might mean sending out mail shots. It might mean sending out emails. It might mean when you now visit a website, you certainly are asked to re-give your consent.
And so that is the, the thing that we, we are seeing now, the interesting thing here is that the whole focus and quite understandably has been on production data, but production data is only part of the problem.
And so if we look at the model that the GDPR is really based on, the idea is that data subjects have their data, personal data lawfully collected. And this goes into some kind of a database where there is law processing, whether it is if you buy something or if you sign up to some kind of websites or social media, and eventually at some point, your personal data is considered to be no longer valuable for that, for that purpose. So it is in fact deleted.
And at the same time, you have to satisfy the subject access rights so that if the data subject requests to know what you're holding, you are able to give it.
Now, that's only part of the story because underneath all of this, we'll find that a lot of what is now defined as personal data is included in things like security events. So because the GDPR defines, for example, an IP address as being personal data, then security data. That includes that kind of information is being processed and potentially comes under GDPR. Some of it is being used for market analysis.
And again, if it includes personal data, then it is that processing is subject to GDPR because of the wide scope of the definition. And one of the forgotten areas is DevOps and test, and also quality assurance of applications. Because since organizations are all trying to get closer to their customers, they're developing apps, they're going through digital transformation, which means writing lots of new applications. All of those applications are being developed. And some of that development is taking place on premises.
And some of it is taking place off premises and through third parties.
So let's just take the question of DevOps, will this comply with GDPR? So if you look here, you can see what could be happening, that you you've got some master customer data somewhere or rather, and you may, may be making selective copies of that for use of test purposes or for develop DevOps. And you may even be copying some of that into the cloud to give it to third parties, to help you with that. And they may be making copies of copies. And once you've sent your data out, you no longer know where it is now.
So the question is, first of all, is that data that you are using for DevOps personal data. And the answer is almost certainly yes. And then you would say is the use of this data for DevOps processing. And since the definition of processing on the GDPR is so wide, the answer is almost certainly yes. Now is that processing lawful? Is it fair and lawful according to GDPR?
Well, there, it's more difficult. The answer may be maybe yes. And it may be no. So let's look in that area a little bit more deeply.
So is the use of this data in DevOps and QA for the performance of a contract?
Well, I don't think most people have given you their personal data for you on, under a contract for you to use it for development and test. So I don't think that's particularly like likely did you ask for consent to use their personal data for DevOps tests, tests and so forth?
Well, if you do then under GDPR, there has to be explicit consent per purpose. So you can't hide this in some kind of long and tedious thousand word end user license or terms and conditions. There has to be an explicit and short statement of that. So I think most people will say they aren't actually using it. They haven't got subject content for that use. So can you then say, this is a legitimate interest? So one of the fair and lawful processes is if the data controller has a legitimate interest, well, that may be the case.
I think you would need to consult your lawyers as to whether that would stand up if you were challenged. So let's now con continue. If you are going to use this and let us assume that you have in fact been able to prove that in fact there is a, it is fair and legitimate. Then if you are going to hand that processing out to a third party, have you thought of the fact that you need to have a GDPR compliant processing contract, just like you would have, if you gave it to, you gave that data to an outsourcer, to a cloud or to a managed service provider.
If you do use it that in this way, then can you support data, subject rights? So for example, if the data subject says, I want you to delete my personal data, then can you find all the copies that there might be of that data in order to successfully deleted? If you want to say, where is it then, can you actually provide that kind of information? And finally last but not least is there, or are there appropriate security measures used by the DevOps and used by the outsource development so that you could prove if challenge that data is secured in the same way as it should be.
And if, if the DevOps and test copies were breached or lost, would you be able to detect it and then satisfy the requirements for breach notification now? So I think that whilst most people can understand that all of these complexities can be seen as really justifiable when you are processing the data for some clear purpose, like e-commerce or whatever, once you are starting to use it for all of these other things, these heavy demands can be less, less, less, easy to justify. So is there a way around this?
Well, so what you can do is you can say, well, first of all, you can say that one may way of dealing with it might be through access governance. And we cope coal, have a lot of knowledge on access governance. And this tends to be good for protecting the production processes and maybe what you have on premise against some of the threats, but it certainly doesn't protect against data leakage. You may be able to protect some of the data in the cloud that way, if you've got things like cloud access security brokers, or you are using, for example, SAML or identity Federation.
So access governance is good, but it's not a complete solution. The next strategy, again, this is one that is advised by KuppingerCole for use of cloud data is to use encryption. So generally speaking, you can encrypt your master copy and that's good for protect, protecting against unauthorized actors.
So if the data is leaked, then it's no use unless you have the keys, the problems, the challenges that you have are first of all, that in order to use the data encrypted for DevOps and tests, that means you need to share the keys.
And so the keys and sharing the keys becomes a challenge in itself. And in any case, what is happening is that the unauthorized actors now are trying to attack through finding a way of hijacking accounts and getting authorized access. Because clearly if you can get at the application and you can get legitimate access to that, then you are going to get unin encrypted access to the data sequel injection, for example.
So there is another strategy that's not been much talked about, but which is actually recommended by the GDPR itself. And that is pseudonymization.
And in article four, it says that if processing of data is in such a money, that the personal data can no longer be attributed to a specific data subject without use of additional information that is pseudonymization. Then under article six, the existence of appropriate safeguards, which may include encryption and pseudonymization are acceptable as a justification for processing for another purpose, which, for which you did not originally have consent now. So that's an very important proviso in the GDPR and one which really everybody should take account of.
So if you, you use this pseudonymization anonymization, then it gives you a way of transforming that data so that even if it is lost or stolen, it is not actually possible to de recover from that PSEU data, what the original data was.
And so the personal data is no longer personal data. And even the UK information commissioner's office has a report, which you can see a link to there where, where they basically say to protect prior, prevent priv, say it is better to use or to disclose anonymized data than personal data.
So if you do that for your DevOps and for your outsource development and for your QA and test purposes, then you must that data in a way that does not allow the recovery of the original personal data. And so you are absolved from the, the need to meet with all the other requirements, which has only got to be good. So how can you do this?
Well, there are many techniques which include things like removing key attributes, randomization, and generalization. Well, the, the second part of this will talk in more depth about those. But what you really have to note is you must be very careful because there are many well publicized examples of where people thought they had anonymized data and where in fact, they hadn't that there were all kinds of different ways that even the papers could find to identify individuals from this kind of data.
So use anonymization, but do it right. So how do you decide whether a technique is acceptable?
So hopefully the EU article 29, working party produced an opinion on anonymization techniques. And you can find it by following the link from this, this thing, or you can Google it. And this sets out three tests. The first test is that it should not be possible to single out an individual from looking at that data I E from the other attributes.
So if, if you, if you look into something like where I live, my post code, my, my sex and my data birth will identify me quite clearly within the, the sort of 30 or 40 people that's identified by the post code. So then is it possible to link records relating to an individual? So if you can use multiple sources as a way of gathering data together, to be able to infer what the identity is, then that's not fully pseudonymized.
And finally, can you just infer the data from some kind of public records or whatever.
So you need to be able to pass all of those three tests, and if you've done that, then it will be acceptable under that opinion. However, don't forget what the ICO said, that everything is proportional, and you, you have to sort of take account of the risk, not, not, not necessarily expect to have and absolute solution. So our advice is that you need to properly plan for GDPR in order to prevent the possibility of extreme pain from the information commissioners and the data protection authorities and for your organization to suffer penalties.
And that means that you need to find and manage that personal data iceberg of this non-production data, that it is better to use and disclose pseudonymized and anonymized data than personal data. So use pseudonymization for your DevOps and tests and for your non-production purposes, where you don't have to have the personal data exposed, but if you do it, make sure you do it right. And as part of a privacy by design strategy. So with that, I'm going to hand over to ELCA who will now give the second part of this presentation.
So Levi, can you change the screen over please? Okay, go ahead.
Fantastic. Thank you so, so much for that content, in terms of explaining why anonymization is potentially useful for non-production environment. I'm gonna spend a few minutes here to give you some more context about how anonymization can be used in the, with GDPR. I'm also gonna do a quick demo, so you can see what we are talking about using real live data. So that it's pretty clear to you in terms of data masking. I just wanna level set before we start, what are we talking about? When we were masking the data?
There are a lot of different verbs used, whether it is scrubbing, whether it's the identification in, ultimately the information is changed in a way consistently across different data sources, whether those data sources are the same or different relational databases, whether they're files anywhere where there's personal data, it's impacted by the, the, the governing laws around GDPR.
And what we can do with data asking for non-production environments specifically, is that we can take the data that exists in those data sources, which is what you're seeing on the left side and end up using fictitious, but realistic data. So you have, instead of the name, George, you have a generated name called Roman, and that's consistently transformed. If you look at the sickle server employee table, rose six was also George.
And that was also consistently transformed to, to, to Roman, which is giving us this ability to be able to have the data that has integrity across the, these different data sources it's irreversible in, in, in, in, in this way so that you can in fact use it without any fear. And if there was any potential breach, this was a specific type of data set where it will not get you into any kind of compliance issues to specify further the specific articles that we are working within in the context of GDPR.
I think first and foremost is the impact assessment, data, production, data protection, impact assessment, in order to be able to protect anything, you need to be able to have ability to say where the data that's impacted. And in this, in the case of GDPR, it's quite broad, this is individually identifying information. It could be financially identifying information.
It could be healthcare information as a result, you know, ability to identify out of the box, a set of identifiers across all your data sources, whether there's hundred apps or thousands of apps is the first step in assessing the potential impact. In in fact, many of our clients who are, who started their GDPR journey started initially with how much impact do I have from GDPR perspective. I know I have these end number of data sources. What is the actual impact?
How many systems need to be actually updated and changed from a process perspective in order for us to be able to, to be able to comply with GDPR, the other protection.
The other article that we actually work within is to the design and having a designed and developed process into your overall business processes, article 25, the specific article where it requires you to have a system of records that you can actually refer back to if there's ever a compliance evaluation or an audit.
And the process being in place is the first step in that compliance mechanism and being able to anonymize their pseudonymized with that process in place is the, the second step in that, in, in that proof of GDPR specific outcomes and being able to use data masking anonymization or tokenization in place in, in the case of pseudonymization are the two ways we actually approach this. In order for this process to work, the protection mechanism should be able to be applied across all your data source types.
You cannot say that you've achieved your objectives if you're not able to mask an Excel spreadsheet or a mainframe file, just as well as masking a database like Oracle.
And lastly, article 33, I don't know if you're surprised to see that this is in fact, one is used by our clients in the context of GDPR, the question with article 33 alludes to whether, you know, how the information is gonna be actually notified what personal data was actually breached in the process.
And I've, I've been part of very few cases in terms of a breach, but the ones that I've had the opportunity to observe, the first question that comes up is what did we lose? And in the context of article 33 and be able to report on what's lost using the system of record in this case, a system that identifies what's sensitive and retains what's sensitive and updates what's sensitive on an ongoing basis was the actual usage or the system of record.
And then the next question was whether the data was protected, anonymized, ized, encrypted, etcetera, the, the life cycle in achieving these objectives tend to be essentially over the, the, the basic three steps that you see first item is to define the policy in this case, being able to state the identifiers that you're interested in protecting for your customer base for the specific countries that you're involved.
So if you're in UK, UK nine, if you're, if, if you are in, you know, if you're in Australia associated national identifiers in Australia needs to be part of your overall portfolio, what you are managing as a policy, being able to apply this policy across all data sources should be something that you're looking for in your, in your, in your product set so that you can in fact assess the impact of GDPR in your, in your systems and identify exactly what's sensitive. Being able to take that policy and discover it using regular expressions is the second component of this.
This discovery process is automatic. It's able to connect through a protocol across all of your data sources so that you are able to aggregate your overall data risk across the enterprise and maintain and look at it from the perspective of GDPR. And lastly, we do provide ability to anonymize or sodomize the data and having that capability without having to do any more work is, is, is, is, is the, is the reason why solutions like this are quickly becoming a, a process to implement for non-production data sets.
Here's a basic deployment chart with, with our data masking product.
The objective is being able to really, in fact state on the right side of this picture, the word non-production Mike has done a great job talking about the different ways in which to reduce the data set or make it in such a way where the protection is clearly understood.
And, you know, inference is not there ability to the, the linkage between the data sets are not easily achieved and be able to do all of that for non-production data on the right side is the, is the, is the bedrock of what we're trying to provide in the context of GDPR, you can anonymize that production data in a production zone, where you have production level controls, and then securely replicate that data on only the mass data that's associated with your, your overall anonymization process and, and load those, those non-production environments.
And usually this results in 80% reduction of production data in the enterprise, 80% of reduction of enterprise data production data in the enterprise. So it has a tremendous impact when you apply this process in terms of the amount of data that you're gonna be actually working with from the perspective of GDPR.
I wanna show you a quick demo of the product. So you have a sense of what it looks like.
The, the objective is to understand firsthand what we are trying to accomplish in here. So I have a sugar CRM app running on my laptop. Here. We have an individual where we have his personal information, his name, his birthdate. This is in the month day YYY format. This can be any format days being first month being second years being last cetera, there are some credit card information with 16 digits. There are two of those in this particular process, I'm gonna demonstrate anonymization versus serialization.
I'm gonna anonymize this credit card data so that it will never be able to be achieved to the original. I'm also gonna tokenize this credit card data on the bottom, right, with this information so that we are able to reverse it.
If necessary, our phone numbers can be masked or anonymized. I have some national identifying information here, emails address, as well as some unstructured data sets from the perspective of GDPR, whether the data is located in files or databases, whether they're looking in production or non-production, whether it's structured or unstructured data is irrelevant.
Ultimately, all of the rules and regulations that are, that are designed by GDPR to protect individuals right to their own data is, is, is, is the driving factor. So the fact that this is still exists in non-production environments, and it's unnecessary is the primary mechanism that we are gonna go after in this demo. So I have this data set.
What I wanna show you is our ability to define in one place a do set of domains, which we are gonna be using to, to search with in the context of what's in the domain, I'm gonna connect to a specific environment where I have connection to the data source that's associated with this particular application, sugar CRM, but you can access variety of different data sources and be able to access it just by being able to connect to it.
And once you're able to connect to it, the next thing that you can do is you can actually pull the specific objects in that data source using the connection parameters that you provided. In this case, we are gonna be working with the metadata of this data source, which is, which is, which is a database. And once you actually have the rule set, the, the next part of the process is to say, okay, out of all these tables, what is, is exactly sensitive from the perspective of GDPR and what we are doing in achieving that is that I just ran a profiler job.
We run these regular expressions against the metadata and the data of that specific data source, so that we can aggregate in one place specifically what's sensitive. And so that we can actually identify the impact of GDPR and what kind of risk that exists in this particular data source with its product, or non-production at the stage of the relevance is just that we know what the risks are.
And then the secondarily, now we have assigned an appropriate domain and a transformation algorithm that if we choose to either anonymize or Ize, it's our, it's our solution to it, to, to choose to do that. I have not changed any of the data. If you look at the application, it still has Chris Miller. His birth date is still March 27th, 1972, the credit card numbers look as they are and the social security numbers, et cetera, having changed to kind of protect the data. What you can do is generate mask data or anonymized data.
And as part of this, I'm also gonna mask more, but I'm gonna tokenize some of the data as well as unstructured dataset.
So what we are doing in the context of our, our, our, our, our value proposition here is that you can anonymize the information in your production environments once and for all, and then use that data to your heart's content in environments that usually it six or eight times, as many as what you have in production, and absolutely basically annihilate all of the data sets that can potentially be subject to the GDPR compliance by anonymizing, your fact, providing the compliance metrics of that data set.
You can see, I just changed this data. I no longer have someone named Chris Miller here.
His name is Frank TMA. The way that we transform these data sets, there is no relationship to the original. You can consistently transform the dataset across different data sources. This person's name was Chris Miller. Now it's Frank TMA. It's consistently master prank remaining here.
Let, just start my unstructured data masking here as well while we're talking so that you can see what I'm doing. I've, you know, masked this credit card. It is not the original credit card anymore. It still has a visa signature designation. It still has checked it still valid from a format perspective.
It's just, the data is completely fictitious. The secondary data in here has been anonymized anonymized by tokenizing the data set. I can choose to reverse this if I want to, but at its current state, without having an additional identifier, I cannot state what that data is.
And we are achieving the objectives that are set by the, the working group that Mike was talking about. We masked the address line one is zip plus four, zip plus four. In this case, we could have masked the entire address. If you wanted to. We masked email addresses. We didn't mask unstructured data.
What I wanna show you is the fact that that job also ran. Now, when I update this will be mask structured data and unstructured data. We'll redacting the information where it is sensitive. And we also, the data that that you're seeing at now, this data has, is compliant with GDPR rules and the individuals whose data we are using to do testing with will no longer be impacted.
So in terms of some use cases, I, before I turn over the mic, I wanted to just get, give you some sense of how our customers are using these data sets.
We have multiple customers globally, both in Europe, Canada, as well as in us that are working with GDPR compliance. This specific customer basically apply the anonymization mechanism across all our, all of their data sources. And they're generating data sets on an ongoing basis and really quickly changed their refresh process and anonymization process from 11 days to within half a day. And that half a day process is the transforming the data sets and delivering it regardless of its size because of our virtualization software.
So the target, because this is combined with a virtualization software, they've had a tremendous, also savings in terms of their storage, because we are projecting the data to the target.
We are not actually moving it and also avoided the associated GDPR compliance risk associated with this data set. The other customers, our ours is an even larger financial services firm.
It was even taking longer than the, the first firm that we talked about three weeks to achieve getting the, the masking and the, the, the delivery of the data by using our, by using our data platform, they were able to accelerate at that time timeframe. And more importantly, in the context of our presentation today, they were able to deliver that data very, very securely with, with compliance.
We are a, we are an enterprise software company, which we deliver data platforms to deliver data in a natural agile fashion, as well as in a secure fashion. And many of our customers combined the two to, to, to, to, to be able to achieve those objectives with that said, Mike, it's back to you.
Okay. Thank you very much indeed for that ELCA. And so now we're going to move on the questions and answers. Now the participants can ask questions using the question screen, and at the moment, I don't see any questions. So you must all be being very shy there. So I'll ask a question.
So the first question is we've talked about several different things here. We've talked about encryption, we've talked about data masking, pseudonymization, anonymization tokenization. Could you just give a, a better explanation or a more detailed explanation to just explain the difference between all of those different things? OK.
Absolutely. Mike.
So just like what you would have as tools in your tool bag, you would use a specific tool for a specific objective and all of those mechanisms, whether they're encryption organization or masking serve a, a, a different outcome, provide you with a different outcome and different protection mechanisms. If the data in question is, needs to be encrypted by access, and then you can provide that by that, for that specific role, let's say, we're gonna take an Oracle database.
You can use Oracle TDE to provide an encryption mechanism for specific columns and tables, so that when a database administrator accessing that data set, what they will see is encrypt the dataset instead of being actual, real dataset in the production environment. If you're sending the data between point a to point B, you wanna prevent the man in the Militack. You wanna be able to encrypt the dataset and share the keys appropriately with mechanisms that exist in place so that you, in fact, using the encryption mechanism to prevent anybody else seeing in the middle.
However, if the, if you don't wanna, you wanna protect the data for a non-production environment where the data is gonna be shared openly between the users, whether they are developers, testers to a are. And the number of users that exceed in non-production environment compared to production environment tends to be somewhere between four to five X, that actually has a full stack access to it for those mechanisms. The fact that the data is being encrypted, you can certainly encrypt the dataset, but that's gonna have an impact on the testing mechanisms and across those environments.
So in place of encryption, many of our customers actually use anonymization for the non-production environment. So that dataset is such that it's irreversible, it's realistic, but fictitious, and it's consistent. So that the name still look like names, addresses, look like addresses, salaries, look like salaries instead of a construct that is completely gibberish protected, but completely gibberish.
So talking anonymization or masking is a better fit. The non-production environment.
Lastly, tokenization is a, is a tool set we've used with our customers to extend their reach of control to when they send the data out. Let's say you have a, a break fix scenario and you have some bugs and a vendor specific apps. When you're sending the backup of that production environment. Are you sending that data in such a way where it's protected? Many of our clients actually tokenize the data set, and when they send it out, the, the, the sensitive data is tokenized. If there is a particular issue with a data, they need to actually reverse, they have control over that.
They can say, okay, this particular token is where the person specific to this record is causing this. They can always reverse that back. Another example is with vendors often enough when you're using vendors where there's been some known instances of breach that occurred at a vendor, you can in fact, tokenize someone, this data, when you're using a vendor for vendor, for fraud analysis, for pricing, et cetera, where revealing the individual's identity is not necessary for that vendor.
You can make it consistently transformed.
The vendor can reach its objectives, the objectives of the data set that they, they can analyze for you, and you can reverse it back. So all of those mechanisms, encryption, tokenization, and, and, and anonymization or masking is there. It's just the usage of it varies between production non-production or outside the company and the type of production you wanna protect, provide, and type of data you wanna provide at that. Thanks for that, Mike.
Oh, okay. So we've now got a question from Barbara remand, and that question is basically, how are you ensuring that it's not reversible?
Now, let me add something to this, that first of, first of all, it, it you've described three things. Tokenization is clearly reversible and encryption is clearly reversible. So I think the question is relating to the data masking and data masking.
So, so how are you ensuring that the data masking is non reversible?
Great question, Mike. And I really like that setup because you're absolutely right. There are technologies where diversability is desired, like encryption and tokenization. And we do provide tokenization out of the box, but with masking or anonymization, our objective was to create an irreversible outcome. How do we provide that? It's basically two mechanisms, the architecture and the way in which we transform data are the two mechanisms.
The architecture is designed in such a way where when we read the data in, it's read into memory and we transform in memory at the time the anonymization is occurring. And once the masking is done, that we reach that target data set.
Let's say, I took the name, John, and turned it into Frank. Now, Frank is the fictitious version. Essentially the system does garbage collection on the data set what's in memory and forgets that John ever exists as a result, no place in the, in, in the product set, that original data set was retrieved or stored anymore.
And we all, all we have in the target right now is Frank, which has no association with John at all. Now the secondary part is that the, the transformation process has been designed in such a way that it's irreversible it's.
It uses the input mechanism to determine couple of parameters. And based on those parameters, it identifies a new value from a list of mass values that the people collected. So a QA person would say, I want to test with these 10 million records. And we take those 10 million records. We apply an algorithm to, to the input that we receive from the source data set. And from that algorithm, we decide a particular index value. Let's say 1 75. So out of the 10 million values, we look at 1 75 consistently and use that as the value in place of the original.
And we determined this input to output outcome only during the anonymization process. And then we delete all of the production data when you do it.
That way, the data is irreversible because we never have the original anymore. And the data set that we are using is the only data set that they want to test with. So you have boundaries, you have plus and minus edges in terms of what you're looking at, as well as realistic and consistent data sets.
Okay.
So in, in, in sort of terms of vocabulary, the process you are describing is the pseudonymization process known as substitution. So you, you have to do this correctly because let me just give you a silly example. Let's say you had my name in this long record of yours. If you simply substituted my name, Mike Small for Joe blogs, but left all of the other data UN masked, then you would still be able to identify me.
So, absolutely. In, in a sense I'm, I'm saying that you have to use the product correctly in order to ensure irreversibility and presumably you help people to achieve that.
Absolutely. While we keep status of in, in the product itself, the first part is to say what's sensitive in here based on their policy. So we do know all instances of that dataset, not just the name as you're indicating Mike, but also addresses phone numbers, dates of birth, etcetera, what we deem overall, the complete picture of it.
You're absolutely right in your original statement, you know, just with your data, birth, your postal code and your gender. It is not hard to identify an individual in a given area because the data sets start to triangulate and a particular individual, especially if those postal codes do not have a lot of individuals in them, varied individual, as a result, it's important that we, you able to profile so that the discovery process is complete. You're able to anonymize or Ize. So the process gets you the outcome that is irreversible or reversible based on the decisions that you make.
And when you run the process, it's gonna tell you that it, it was successful. You can run the discovery process in an ongoing basis, Mike, so that every three months, you're still basically asking the question, what is sensitive in this environment? Because the, as you know, the data environments are dynamic. They don't always stay in the way that they are to be able to, to be able to run. So discovery process and an ongoing basis is the other way.
Other, other capability that allows us to be able to report on what was deemed sensitive for PFI Phi or PII.
Yeah, that, that's great. The reason why this is so important is if you look at the article 29 working party opinion, they effectively try review every conceivable way of doing pseudonymization and anonymization. And basically according to their opinion, they all have some failure associated with them. And so that is not sufficient to not use the thing, but it does mean that you have to be careful about how you use it.
And it's easy to think that you, you, you you've solved the problem, but you have to think carefully about the tools that you are using and so forth. So clearly you have a tool that uses a nonlinear process for substitution that if used correctly can provide very, very substantial safeguards for this non-production use of data where in fact, in the non-production use, you don't need to be, for example, to be able to do a market analysis and so forth.
Exactly.
Mike, I couldn't have said it better.
Yeah.
So, so now there's some other questions that have co-op. So rather than dwelling too much on this, I'm going to take the next one, which came from Ben Hilda. And he he's, he asks will daily use of the protected and encrypted data have impact on the load on the server, in place of the old way. So implying more costs in keeping the system usable. That's the question. So does it,
What's the question there?
Well, I, I, I think we take the first one. If you use the mask data, does it impose a load on the, on, on the server from which you are using the mask data?
So let me ask it this way.
And Mike, if you wanna add on anything, please feel free. We don't mask production data directly because we would be, we persist the, the data that we generate onto it. That means the production data will no longer be production, and you cannot serve your clients anymore. As a result where we persist this data to tends to be either a gold copy or staging environment in production.
So that the data set is separated from its original, in terms of putting this data to, to the, to a target database server, a non-production environment, we found that the data is used in full sizes, so that you'll have all of the cases that you want to test against. We find that the da, there are more instances of the data set this way, because the fear of losing the data fades away, you're ultimately using fictitious data sets.
And in fact, you would have because of the frequency of usage and the, the underlying data set, it, it actually tends to have a higher server load, but those server loads are, you know, changed in terms of the server resources to achieve the, the testing objectives from the perspective of performance, if need be. So we do see a change in the, in the usage and the potentially the resources of the testing systems on the target. Anything else you wanna add to that, Mike?
No.
No, I, I, I think I'd like to move on the next question, because I think this is a, an interesting one and Ben has just replied to say that your reply is clear. So this is an interesting one, which is how long does it take you to run the preliminary project of getting everything sort of set up anonymizing the data, looking through the content of the data, investigating the data. So if you wanted to, if you went to an organization and said, we're going to use it, how long would you tell them it would take to, to, to set themselves up, to use it successfully?
Sure.
So there are two steps, three steps into the process. The first is to defin your policy. And that's somewhere between two, two hours to two days.
The, the difference in two R to two days is you can use out of the box ones where we have it to two days being you're in a, you're in a funky, you know, you're in a different industry. For instance, we've seen recently entity where they were collecting data and they weren't the direct entity that's collecting it. So they didn't know everything that they needed to actually protect. So that took a longer time to kind of analyze what is really data, the data that's in there. But if you are using the GDPR out of the box, it's two hours, you're ready to go.
And within two days you would, if you have more customization in terms of the, the next process, let's say what's the impact or assessment.
We ran about 1400 apps discovery on 1400 data sources with four people in four weeks. So if you have a couple of hundred of these, you can easily accomplish that within a week. If you are more larger entities where you have a lot of applications within a month, you should actually have your impact assessments. How many applications are sensitive? How many of that, what kind of data is contained and the aggregated outcomes of that.
Lastly, a masking process to develop the masking process. It will be about two to three days running a masking process against the data source varies in the quantity of data. It could vary between a couple of minutes to what you've seen, what may run on the demo today to upwards of 18 to 20 hours where you're masking billions of records in, in, in a, in, in a high, in a large transaction in mind. Anything you wanna add to that, Mike, sir?
No, no, I, I think it was also interesting. And in, in response to that question, I would remind the question of Tim.
So, so again, go off that, one of the, you gave two examples in your description of case studies, which, which gave ideas of how long it took and in particularly the very large fortune for
500
Hundred company. And so that, that seemed to me to give a, a measure of the, the amount of time it takes. So it sounds, it, it can be quite quick, really in comparison to some projects that you see.
Well, okay. So I see we've come to the top of the hour now.
So I, I, I, I would like to say thank you very much ELCA for your very interesting presentation and your very clear description of what your product does. And I'd like to thank everyone for joining. And perhaps just to remind everybody of that comment from the UK information commissioner, that if you have to share personal data, it's much better to share it in some kind of anonymized form than just to give it out. And so with that, I, I wish everyone success with their GDPR projects. And thank you for joining this company, call webinar. Good afternoon, everyone.
Thank you for having me, Mike, take care.