Irun. Good morning everyone. I guess it's morning out there. So good morning everyone. I'm SRE Geral. I work as a senior director at Fractal ai. I also an author of a book on the same title, responsible AI published by Springer a couple of years back.
And I, I do a lot of conferences. I am member of United Nations. Yeah.
And we'll, let's get started. I'll just, if you have any questions or if you want to, you know, ask any queries, please feel free to ask and intervene me during the presentation. You don't have to wait at the end.
So, so this is one thing which I always start with. GDPR is one regulations, which is very, very popular and very, I, I guess it's the most advanced regulations when it comes to AI and usage of ai.
They are, they have made, they have clear cut guidelines on what are the different guide guardrails, which you need to follow when you need to implement any kind of AI project.
Similarly, EU guidelines are also there, which talks about different, you know, different kind of risk, which AI has and what, what are the unacceptable and what are low risk, medium risk kind of AI projects moving forward. So this is very important when talk about reliability and safety of AI system.
So, so when it comes to reliability and you know, safety of AI system, we have seen that there has been a lot of cases of data leakage. There has been a lot of cases of hallucinations and there has been a lot of cases when AI system gives a wrong answer. So in that cases, when we look at generative AI for that matter, we need to be very careful about what are the different principles of ai, which we need to be very, very short. So for example, bias and discriminatory output is very common.
So for example, to gender AI algorithm, if I write, give me five jokes on women, it'll not be big giving me any jokes, but if I give me five jokes on men, it'll give me the five jokes immediately.
Similarly, deep fakes, we have seen that how deep fakes are being used, which are generated by advanced AI algorithm are being used in codes, are being used during war in political situation.
And to even create a negative phenomena about a political party or about a person of interest, data privacy and security, we have seen that there was a case where Samsung's code got leaked or in public because people of, some people in some employees and developers in Samsung used charge GPT to debug their code. They pasted their code on charge GPT and charge G PT used it for retraining itself and, and then leaked that code in public platform. Copyright infringement a very, very important concern which I come through.
So while talking to a lot of my clients, they always talk about this very aspect of gent ai. They are very concerned with copyright because copyright not only can lead to regulatory risk, but also lead to lot of reputational risk.
So not only you will have to face a lot of backlash from people via social media like Twitter, Instagram, Facebook and so on and so forth. But then you will see a huge amount of fine being generated or being imposed on the company just because of copyright infringement.
Similarly, legal reputational and regulatory implications, which I just talked about, that they are issues when it comes to copy copyright, they are issues when it comes to hallucinations and that may lead to a lot of reputational loss, reputational risk, regulatory risk, and of course legal risk. This is up and above the revenue loss, which you will face security threats to ai. So there are four kind of threats, which I will not deep dive into it. So one is model poisoning where the model is being disturbed by an adversarial attack.
Data tampering wherein you can modify or delete some data in the machine learning model and create a prediction which it should not have given deliberate attacks, hacking and phishing and malwares and other tactics and AI driven attack.
Using AI to attack another AI is also very, very common these days. So we're coming to large language model. How might business manage customized LLM responsibly across diverse operation while complying with the industry and geographic regulation? So this is very important in era of generative ai.
How would you be able to use a customized generative AI algorithm or generative AI system, which is not only responsible but ethical but complies with industry or geographic regulations. So for example, how do I use generative AI but still follow uk, EU and other countries guidelines and regulation which has been imposed on me but still be able to use LLM or genive AI to, to create profit or to create productivity benefits for my company.
So these are a couple of things which we need to understand.
Number one, which I always believe data minimization you should only use collect the data, which is required. So you know, you have, I've seen that a lot of data scientists ask for the entire database, but can we only confined ourself to the data which is required mandatory rather than giving the entire data purpose limit limitation only use PII data wherever it is.
Very, very, very important. Do not use personal data, leave per pi, do not use any personal data where it is not very important. If you can do away with personal data, just do not use it. If you can use synthetic data, use synthetic data, don't use something which is no longer important. It's if I have seen that a lot of time, a lot of data scientists keep all the versions of the models, all the experiment, all the data which has been discarded by the model, you don't have to store it.
It is very important to purge or delete the data immediately, especially which are not used.
Accuracy, keep everything accurate so that you are able to, you know, you are able to give right ki you can, you are responsible because the more accurate the data is, the more accurate would be the model and more accurate would be the output. Similarly, we need to understand, we need to be proactive and not reactive. That take regulations, take responsible ai, proactively do it before somebody asks you to do it. Do not wait for regulations, but just do it as soon as possible. Privacy should be a default setting.
You should ensure that privacy is practiced all across the entire data science lifecycle and should not be an optional thing. Privacy should be embedded in the entire design of the workflow. So when you start EDA, when you start your business objective, when you start your model training, when you do your model testing, when you have model deployment everywhere, privacy should be part of the entire process, right?
Again, the last one, which is very important, visibility and transparency. It is not only important to have privacy, but also make sure that everybody knows that you are keeping or you, you're practicing right kind of privacy design and it is visible to everybody what is visible, the that, the, that you, you practice, right? Kind of privacy, not the data for sure. And make sure that it is transparent. People are able to audit and monitor the practice. So I'll skip this slide. So there are a couple of privacy enhancing technique. One is AI generated synthetic data.
Rather than using the real data, you can use AI generated data, which is as good as real data will come to that. So generate synthetic data to ensure privacy. You can take a snapshot of your data, use lot of methods like GaN variation, encoders, simulators, rule-based generator, generative AI to create data which is almost like a real data and use it for model training rather than using real data. Because if you use synthetic data and even if the data get leaked, you would not face major difficulties or regulatory problems. I'll pause for a second and see if anybody have questions.
Okay, I'll move forward. Okay, go ahead. So different, yeah, sorry, go ahead.
Okay, different types of synthetic data. So what are the different types of synthetic data? You can have text, you can have, you know, the text is like proper documentation. You can have sound, you can have images, or you can have proper tabular data.
Okay, so what is the benefit of synthetic data? So benefit of synthetic data data at a high level that it ensure a huge privacy.
It, it allow your data scientists to have improved data availability, it reduces risk of data breaches and privacy violation. And it always allow you to have better quality data by removing noise or outliers and things like that, right? I will again skip this. So how do you protect privacy? I just talked about this. For example, you can add noise to your data. You need to add noise to the data in a way that the okay, the, the probability or the output from original data should be almost equal to the output you get from noise data.
And the technique is called differential privacy, where you add noise to data, adequate amount of noise that your result and outcome are still very, very valid, but you are not able to identify somebody from the data itself. Similarly, from machine learning algorithm, you add the right kind of noise to the data in order to, to ensure that the machine learning algorithm output is similar when you have not added noise to the data. So we'll come to this.
So if you see that I, I did this experimentation where I added noise to the data and if you see that this green color bars are the, the output from the true data and the yellow bars are output from the data after treating it with privacy preserving algorithm like differential privacy, and in most cases the output are almost same. It does not get disturbed too much.
And that is what I was trying to say, that even if you add noise to the data in the adequate amount, you'll not be able to first thing identify somebody but still have the right kind of valid outcomes.
Similarly, there are other techniques that is you can create a machine learning model which are differentially private. So for example, most of we must have heard about tree based model and linear model. Most of these allow, I mean they are open source package, which allows you to add a noise to the model itself. So even if your model get leaked or model is in public platform, you will not face difficulty of data leakage or any kind of privacy breach. So these are, I, I'll not go deep into it, but if you see the left hand side, it's a normal objective function of a classification.
And here what we have done, we have added ZZ as a noise.
So the same objective function is here. The only difference is we have added some noise to the outcome of the data. Thus you will see the outcome would be similar but with noise in sense will, you'll not be, but that means the model is protected. So even if the model get leaked, you will not have an access to original outcomes and data, thus you will not. Thus there will be no issue of data leakage. And I again repeat the outcome from both of them are very, very, very similar. Which I should have a, sorry.
Yeah, the outcome from both of them are very, very similar. So I will pause here and just wait for any kind of questions.