Redefining CyberSecurity

Guidelines for Evaluating Differential Privacy Guarantees: NIST SP 800-226 | Differential Privacy and Its Potential in Protecting Sensitive Data | A Conversation with Damien Desfontaines | Redefining CyberSecurity Podcast with Sean Martin

Episode Summary

Tune into this episode as host Sean Martin and Damien Desfontaines discuss the transformative potential and challenges of Differential Privacy (DP) on cybersecurity. Learn how organizations can harness the power of DP to share and monetize their data while ensuring robust privacy, and what the prospective guidelines from NIST mean for data privacy.

Episode Notes

Guest: Damien Desfontaines, Staff Scientist at Tumult Labs

On Linkedin | https://www.linkedin.com/in/desfontaines/

On Twitter | https://twitter.com/TedOnPrivacy

On Mastodon  | https://hachyderm.io/@tedted

____________________________

Host: Sean Martin, Co-Founder at ITSPmagazine [@ITSPmagazine] and Host of Redefining CyberSecurity Podcast [@RedefiningCyber]

On ITSPmagazine | https://www.itspmagazine.com/itspmagazine-podcast-radio-hosts/sean-martin

____________________________

This Episode’s Sponsors

Imperva | https://itspm.ag/imperva277117988

Devo | https://itspm.ag/itspdvweb

___________________________

Episode Notes

This episode of Redefining CyberSecurity features a deep discussion between host, Sean Martin and guest, Damien Desfontaines on the topic of Differential Privacy (DP) and its implications in the field of cybersecurity. Damien, who currently works in a startup, Tumult Labs, primarily focuses on DP concepts and has rich prior experience from working in the anonymization team at Google. He shares key insights on how differential privacy — a tool to anonymize sensitive data can be effectively used by organizations to share or publish data safely, thus opening doors for new business opportunities.

They discuss how differential privacy is gradually becoming a standard practice for companies wanting to share more data without incurring additional privacy risk. Damien also sheds light on the forthcoming guidelines from NIST regarding DP, which will equip organizations with a concrete framework to evaluate DP claims. Despite the positive dimension, Damien also discusses the potential pitfalls in the differential privacy implementation and the need for solid data protection strategies.

The episode concludes with an interesting conversation about how technology and risk mitigation controls can pave way for more business opportunities in a secure manner.

Key insights:

___________________________

Watch this and other videos on ITSPmagazine's YouTube Channel

Redefining CyberSecurity Podcast with Sean Martin, CISSP playlist:

📺 https://www.youtube.com/playlist?list=PLnYu0psdcllS9aVGdiakVss9u7xgYDKYq

ITSPmagazine YouTube Channel:

📺 https://www.youtube.com/@itspmagazine

Be sure to share and subscribe!

___________________________

Resources

Inspiring post: https://www.linkedin.com/feed/update/urn:li:activity:7140071119859957762/

Guidelines for Evaluating Differential Privacy Guarantees: https://csrc.nist.gov/pubs/sp/800/226/ipd

___________________________

To see and hear more Redefining CyberSecurity content on ITSPmagazine, visit:

https://www.itspmagazine.com/redefining-cybersecurity-podcast

Are you interested in sponsoring an ITSPmagazine Channel?

👉 https://www.itspmagazine.com/sponsor-the-itspmagazine-podcast-network

Episode Transcription

Guidelines for Evaluating Differential Privacy Guarantees: NIST SP 800-226 | Differential Privacy and Its Potential in Protecting Sensitive Data | A Conversation with Damien Desfontaines | Redefining CyberSecurity Podcast with Sean Martin

Please note that this transcript was created using AI technology and may contain inaccuracies or deviations from the original audio file. The transcript is provided for informational purposes only and should not be relied upon as a substitute for the original recording, as errors may exist. At this time, we provide it “as it is,” and we hope it can be helpful for our audience.

_________________________________________

Sean Martin: [00:00:00] And hello everybody. You're very welcome to a new episode of Redefining Cybersecurity podcast here on the ITSP Magazine Podcast Network. This is Sean Martin, your host, and if you follow me, you know that I'm all about helping organizations and teams within the orgs. Operationalized security and privacy to, uh, to enable the business to achieve new heights and protect the revenue and growth that they that they generate and shows redefining cyber security. 
 

But certainly a tremendous overlap with privacy in terms of. What's possible, what's legal, what's ethical and not, and therefore you have, you have policies and controls and measurements and audits and all kinds of stuff on both sides of the fence. And in the middle is the Venn diagram, if you will, where things cross over. 
 

Today, we're going to talk a bit more about privacy, and I [00:01:00] suspect we'll dip into security as well because it's hard not to. And I'm thrilled to have Damian on today. Damian, it's a pleasure to have you on the show. 
 

Damien Desfontaines: Uh, thanks, Trent.  
 

And probably good to, uh, give me the chance to talk about this stuff. 
 

Sean Martin: Absolutely. And, uh, quick shout out to Katarina Korner, uh, who introduced us. So thanks to her for making the introduction and speaking of introductions, Damon, if you can share a few words with our audience about some of the things you've done, your current role. And why this, uh, this topic is, is one of interest to you. 
 

Damien Desfontaines: That sounds good. So, I'm Damien. I, uh, I'm French. I live in Switzerland, in the beautiful city of Zurich. And my main focus is on, uh, these concepts that we call differential privacy, which is an anonymization, a tool to anonymize data that helps you share or publish data, uh, from sensitive information in a safe way.[00:02:00]  
 

Um, I focused, uh, my focus on this, uh, in my current role, which is at a startup called Tumult Labs, uh, which is, uh, specializing in building and selling differential privacy technology and advice to organizations. Before that, I was leading the anonymization team at Google, uh, and in parallel, or anonymization consulting team at Google. 
 

And in parallel, I, uh, was working in my PhD thesis in anonymization, computer science, and differential privacy at ETH, Big University in Zurich. 
 

Sean Martin: Wow, that's  
 

a lot, a lot of, uh, a lot of research you've done. We'll include a link to your website with links to some of that stuff. And, um, a lot of work, a lot of, uh, A lot of partnerships as well, I think with, with folks looking at this. 
 

So, um, I'm very excited to hear about United look, I think you have, yeah, a lot of, a lot of experience to share today. [00:03:00] So we're going to focus in on, uh, it's missed SP. 800 226. So it's an initial public draft of the guidelines for evaluating differential privacy guarantees. And I saw a post from Katarina that that prompted me to say, well, I want to learn a little bit more about this. 
 

And I figured my audience would want to know more about it as well. So I think before we get into the draft itself, which is open for for, uh, review and comments until the 25th of January, by the way. Uh, can you describe what differential privacy is? Maybe kind of what it is and if it relates to things or is different from other things that, uh, you think folks would be need to know. 
 

I think that would be a great place to start.  
 

Damien Desfontaines: That sounds good. So differential privacy is a technology that solves the [00:04:00] problem of sharing or publishing. Information typically aggregates statistical information or machine learning models from very sensitive data. Typically, when you're an organization that maybe collects a whole bunch of data about usage of your app or demographic information for a government agency or any kind of data, really, you might want to take some of the data and compile it and share it with a third party. 
 

But of course, you can't just do that, like take all your database and give it to like anybody who asks for obvious security and privacy reasons, right? So, um, one of the, one of the ways you can achieve that, uh, we can achieve that kind of business objective is by anonymizing the data, making the data so that it's no longer personal data. 
 

You can share it with somebody and they can learn useful information out of it, uh, do some analytics, uh, statistics, science. But not actually learn that a specific person in the [00:05:00] database has some specific attributes like their demographic information or their credit cards, uh, spend or something like this. 
 

So, uh, in the past, or at least, uh, currently, even today, a lot of, a lot of times when it comes to amazing data, uh, people still use, uh, you know, legacy techniques like, oh, we can just. remove the names and of the obvious identifiers of the data before sharing it. The problem is that we know, we as in computer scientist experts, um, we know that it doesn't quite work this way. 
 

If you just remove some obviously identifying information like names, email addresses, phone numbers, and so on from the data, it's not going to prevent a motivated adversary from looking at the de identified data, or we call this de identified in Europe, in the US it's more called anonymization, um, and then re like finding who exactly is part of that dataset and [00:06:00] then recovering sensitive data from that information. 
 

And, uh, there's been many, many years of research in, that shows that. Um, even techniques that are a little bit, that seem a little bit more safe, like things that we call k anonymity or just aggregation, just compiling statistics. Even those can leak some information often, especially when you have large amounts of statistics about, uh, the database in question. 
 

One very high profile example where this happens is the 2010, um, census, decennial census in the US. So the US, every 10 years, they collect all of the information, a bunch of demographic information about every single person living in the US. And then they compile a whole bunch of statistics that they didn't publish. 
 

Uh, In, you know, in between, after they released some data in 2010, they realized that actually, when you take all of that data and all of these statistics, uh, you can use them and do, uh, what we call a reconstruction attack to take these statistics and go back [00:07:00] to the original data, allowing, you know, malicious people to take the statistics and retrieve information that was supposed to be, uh, secret. 
 

Defaultual privacy is a mathematical technique that solves these problems, uh, that provably, using, you know, strong mathematical foundation, Uh, guarantees that you can't do this anymore. That's the statistics that you're releasing have a controlled amount of, uh, you know, sensitive information in them in a way that you can guarantee that reconstruction attack are impossible or that you can't go back to having a single person's attributes based on the based on the shared data. 
 

Sean Martin: So let me ask you this, Damian. Uh, so a guideline. From this, that turns into a framework or something else that an organization can follow is important and, but only valuable in my opinion, if people follow it and [00:08:00] where, and where I think things become really interesting in the age of, of AI and massive data sets out there, multiple entities could come together, even an entity can come together and pull multiple data sets together and purposefully not follow that, uh, That guideline, right? 
 

So how, what are your thoughts on where this fits in? Is it, is it designed to help organizations that, that want to do the right practices and And achieve a level of, of trust that their customers and their, and their partners can attain, is that who it's targeted at? Or is there, is there a different purpose for this? 
 

Do you think? 
 

Damien Desfontaines: I think, I think you got this right. Basically, um, the kind of people who are going to be very interested in the guidelines and some understanding, like where, where the ends, where the draft ends up after the [00:09:00] comments. Is, um, people who want organizations who wants to, um, solve this problem of publishing or sharing more data, uh, for example, government agencies who want to publish more statistical information about the people they serve or nonprofits who compile sensitive information and then give it to lawmakers or public policy folks or organizations that want to monetize the data that they have on their own customers or users. 
 

So, um. If you're in this situation, um, today, especially with, you know, uh, GDPR and all of the other, uh, changes in the compliance space around security and privacy, you might want to make sure that before you do that, you have a clear picture of what the additional risk you're incurring. By sharing more data or publishing more data, you want to be able to make sure that the business value you're getting out of this or [00:10:00] the mission value, the fact that you're accomplishing more of your mission by publishing more data, for example, if you're a government agency, uh, is like, makes sense when it comes to the risk, the additional privacy risk you're taking and you're incurring to the people in your databases. 
 

So differential privacy is a way of solving that, but it's still a fairly recent concept. It's been invented, uh, a bit over 15 years ago, but it's only since a few years that there's really been, you know, uh, software libraries and tools that allow you to use this and to actually generate differential privacy, differential private data more easily. 
 

So, I think a lot of people who are not experts, a lot of, a lot of organizations who do not have the, you know, the deep expertise, then folks like other very big tech companies like Google or, uh, you know, vendors like us at the merge labs have, they want to ask, okay, if, if a certain vendor tells me they're doing the right thing with my data, if they're helping me. 
 

generate differential price statistics, or if [00:11:00] one of my partners claim they're animating data in a certain way, how can I evaluate this claim? How can I make sure that I'm doing the right thing or that my partners are doing the right thing? And this is what I think is really exciting about the fact that NIST is now going to publish guidelines about this. 
 

It's because it's going to give people, uh, the tools to understand how to evaluate Claims of, you know, being responsible with data sharing or data publication without needing to get into the deep weeds of how stuff works behind the scenes and what is the, what is the exact. Rights, uh, thing that's being done to the data. 
 

Sean Martin: So maybe, uh, some thoughts on how organizations might approach the guide. Uh, I guess who's the intended audience is, is it limited to. Privacy folks, or I mean, the other one that comes to mind clearly is, uh, data data [00:12:00] owners. So the D. P. O. or the folks who are responsible for privacy and data. Certainly, obviously, based on my introduction, people in security have have the controls and the dials to protect the data in certain cases. 
 

But we're talking about, um, enabling data. So maybe controls is the best way to look at this. So who, who should be reviewing the guidelines and understanding how it fits into the organizational operations?  
 

Damien Desfontaines: So I think the two types of roles that come to mind are, you know, the one you mentioned. On one side, you've got data owners, people like, you know, um, data protection officers who want to understand how technology like defunctional privacy fits within their overall, um, you know, approach to. 
 

Privacy governance, data governance, and, uh, how it can help how they can use it to help with their compliance obligations while enabling the business use cases they're interested in. [00:13:00] So those are a perfect sort of public for the guide, because it's going to help them understand what are the decisions that need to be made on a specific use case in order to use the enterprise unit. 
 

When somebody tells me they've used differential privacy, what do they actually mean? Like, which, you know, there's subtleties to the definition and to the context in which it can be applied. Uh, what's the exact threat model? What are the exact privacy parameters? What is what we call the unit of privacy? 
 

Um, do we have a reason to believe that the implementation actually satisfies that definition? Because implementing differential privacy is kind of really hard, like cryptography. So all of these questions Uh, I think are not only like the guidelines not only provides a list of which questions to ask and how to answer, but also gives them an understanding of like what's what they should be sort of checking for what they can be what they should be. 
 

What questions they should be asking and what are the sort of best answers, best practices in the field. [00:14:00] The other, um, public is the people who are actually working with data, data, data analysts, data scientists, and so on, because these are the often the folks that on the ground, uh, you know, in practice are going to generate either generate data. 
 

Uh, using differential privacy to share it with, uh, you know, other companies or other organizations or with the public at large, they are going to be compiling the data, they are going to be using the libraries. So they need, they're going to be the first that have to answer these questions in practice. 
 

Like what should be, what should my exact privacy notion be? What are the privacy parameters I should use and so on. Or they might also, uh, be the ones that receive the data that was shared by some other organization using differential privacy. And so understanding what happened behind the scenes for them and, you know, how, whether they should, you know, how well they should be protecting the data. 
 

Is it really anonymized or, uh, you know, are there some things that are, uh, you know, that indicate that scares to be taken when handling and passing the [00:15:00] data around? Um, I think all of those questions are typically what, what the, what the guidelines are tackling and, uh, you know, offering, offering some answers for. 
 

Sean Martin: So let me ask you this. So things like chat TPT, I know there are other models and interfaces and APIs people can plug into, but that's the one I'm most familiar with. I'm just wondering, can something like that access pools and buckets of data that potentially could expose Data that should otherwise remain private. 
 

Do you think?  
 

Damien Desfontaines: I mean, I, I think we are, we're beyond possibly now, right? I think, uh, I think it's known now that the, you know, there's a lot, very large sample model trains by various companies. Um, are trained on publicly available data, I think, mostly some, and [00:16:00] now that they are being also secretive about it, we don't quite know what data they're being trained on. 
 

But one of the hard bits about it is that we know that these models memorize verbatim some information from the training data sets. And that has very big implications for, you know, at least two aspects. One of them is copyright, which I'm not an expert of, but we've seen it with the recent, um, uh, the recent lawsuit from the New York Times against OpenAI, where they say, look, if we pass some walls to open to chat GPT, they just spits out the entire text of New York Times articles. 
 

So that's a problem for copyright, right? But the same problem is also a problem for privacy. Uh, because if you start, you know, writing a person's name and getting their phone number back, uh, well that's, that feels like something that should not happen, uh, with, with these large angle models. And my understanding is that these companies are developing sort of ad hoc protections, uh, ad hoc mitigations that try to prevent their users from doing that. 
 

But it's fundamentally a very hard problem because, [00:17:00] uh, I think memorization is very much, uh, a phenomenon that's going to happen, uh, with, like, with, like, with classical training of these, of these machine learning models. And so the only question is, like, can you be, you know, can you craft a prompt that's smart enough to extract that information from the model? 
 

And, and there, you know, attackers are often much, much better at doing that than defenders are at predicting it. Um, so to come back to differential privacy here. Techniques like differential privacy can potentially be used in this context. Uh, so there is a lot of papers around how to, uh, train a machine learning model in a differentially private way to prevent inadvertent leakage of information. 
 

Uh, But it's more like, it's, it's more science than, uh, you know, press a button and deploy your, and deploy it in practice, uh, as of now, right? Especially at the scale of these very large language models, I don't think, I don't think things are quite ready for prime time. [00:18:00] But this is definitely the direction in which, uh, you know, in, in the AI community, people should and are paying attention to, trying to understand whether we could use this formal, formally proven techniques to, uh, make sure that the, the, the machine learning models, Uh, learning about the global distribution of data, uh, learning useful statistics are able to like, you know, do what they need to do in terms of either generating outputs or, you know, doing classifying, being a classifier or something like this. 
 

Uh, while preventing the risk that some information is leaked in the process about specific, specific individuals.  
 

Sean Martin: So, from, from the organizations that provide the large language models, interfaces, uh, be it, be it web or APIs, is it, I know the answer is, it's their responsibility, but you said they're, they're doing ad hoc control. 
 

So, to me, that [00:19:00] sounds like. Crap, we forgot and we need to add this to ensure or we were uncovering something. So now we need to go back. In the absence of that, individuals, perhaps, but most certainly organizations who are using these, um, even these, uh, large language models and the data that comes out of their prompts, um, is there any risk to them not following something like this NIST guideline where they're actually receiving information now? 
 

Um, not, perhaps not even know, not knowingly. Yeah.  
 

Damien Desfontaines: So I should probably like have a couple of disclaimers here. I'm not a machine learning experts. I know how stuff works from a distance. Uh, and I, I have a vague idea of like how differential privacy can be implemented in model training, but it's really not my area of expertise. 
 

I really am typically focusing on statistical releases. And this is what thermal labs, the [00:20:00] company I work for is also focusing on. And I think in, in cases of. Uh, you know, an organization receiving data, uh, like how to, you know, how to handle the data you're receiving from the third party, I think very often it's sort of a compliance question. 
 

And I'm also not a lawyer. So I'm not sure. Uh, you know, I don't know what are the implications for this. Uh, I completely agree with you. That's, you know, in principle, the responsibility lies. On the, you know, on the shoulders of the people who are making information available in a certain way, or compiling information and then giving it out in a certain way and so on. 
 

Um, I think one of the places where the guidelines could be helpful, uh, or at least to like, push, push the, the, the industry in a certain directions is that sometimes you want a vendor to train to help you. Builds machine learning models on your own data so that then you can use that machine learning model. 
 

Either you can [00:21:00] share it with a third party, monetize it, or you can put it in your app as part of a feature or something like this. So in that case, you know, if the data that you have is personal data, if you don't have, uh, you know, if you're in, if you're covered by GDPR and you don't have a strong legal basis. 
 

To, uh, you know, to achieve that, uh, and so you're interested in doing some anonymization, uh, to be able to claim from a compliance standpoint that your data is no longer covered by GDPR because it's now like fully anonymized data, then training the machine learning model in a different shape right way is like, could be a real good option. 
 

And in some cases, I think, uh, you know, especially as we, as we As we make progress on making it easier to use and as we make it more of a best practice, I, my hope is that it's going to become a standard practice, not just something that you do if you, if you're in a very high compliance risk use case or, uh, if you're very, very sensitive about your [00:22:00] data. 
 

Um, I think today. You know, the use cases of differential privacy are growing, but still fairly small for now, the thermal flabs, we see interest from, you know, companies that have already well established. Uh, you know, privacy programs and who are who really want to sort of get ahead of regulation, get ahead of industry trends when it comes to, uh, what's your privacy story like as well as in extremely regulated areas like, uh, you know, government agencies and what data they can publish from the, you know, since the U. 
 

S. Census Bureau or the IRS in the U. S. And so on and so forth. Um, 
 

Sean Martin: Can you describe, think it's kind, describe a few, few. The use cases. I'm, I'm thinking like few use cases, financial services, um,  
 

Damien Desfontaines: yeah.  
 

Sean Martin: So in financial services, um, housing market, real estate, market healthcare. Yeah.  
 

Damien Desfontaines: Yeah. So [00:23:00] basically I think, uh, listeners of your podcast to do security and privacy are probably familiar with the situation where they want to share there with a third party, but they just cant. 
 

And it seems like there's, uh, you know, uh, so to give concrete use cases, like in financial, in the financial markets, you could imagine, uh, you know, wanting to take some data about credit card transactions or, um, some information, some financial information about individuals, and then monetizing that data in order to, you know, get something useful, learn something useful about market trends or, um, training a machine learning model to, uh, you know, in collaboration with others to detect frauds or to. 
 

Uh, do financial forecasting to understand where, uh, you know, how the, how certain, how certain trends are going to evolve in terms of spend and so on, um, in healthcare, we've seen people exploring differential privacy to do, uh, to basically cross the cross the compliance barrier, like, [00:24:00] uh, want to train a machine, train a model or get, uh, scientific information about how well a certain medication is working or. 
 

Uh, like build a machine learning model that classifies do some classification and diagnostic data, except in use cases where this would not have been possible to do without a very strong anonymization, um, story because healthcare data is so sensitive. And so we don't want to inadvertently, you know, like hospitals or people who hold medical information don't want this to inadvertently leak outside of, um, outside of the trusted trusted environments. 
 

A last use case that I can mention, that's one that's, uh, we've been working on at Tumult and we've published, you know, case studies and papers and presentations about it, is the Wikimedia Foundation, where they were interested in releasing data about how many people are visiting, uh, you know, Wikipedia pages every day. 
 

This is [00:25:00] super sensitive because you can imagine that, uh, you know, if you take some statistics like this and you figure out that a specific person, you know, read a specific article about, I don't know, uh, LGBT topics or certain political topics and so on, especially in certain areas of the world. This could be a really, really sensitive information. 
 

So they wanted to make sure that by publishing more statistics about that, they would not endanger the users. And they came to us and we had them publish every day, tens of millions of statistics, uh, on Wikipedia page views. So this is the kind of use case that differential privacy, differential privacy typically helps with in addition to the government use cases I mentioned earlier. 
 

Sean Martin: Yeah, I love it. And what, what I want to maybe get your thought on now is this word guarantee. Yeah. It, it strikes me. And as I was reading, reading the title of the publication, and I see that word guarantee in [00:26:00] there, um, it makes me cringe a bit. So nothing is absolute, even in technology, in my opinion. The only way is unplug power and then, and you know, but somebody could plug it back in. 
 

So the absolute's gone from that even. So talk to me a little bit about the guarantee and what, what does that mean? What, what, what does that mean to whom? I guess is the big question.  
 

Damien Desfontaines: So I'll give you like a two sided answer to that question. One positive, and then I'll add some caveats. The positive answer is that you can see differential privacy as the equivalent of, you know, modern cryptography where cryptography was maybe like 20 or 30 years ago. 
 

For a long time in cryptography, uh, the, the way that the fields would move forward is that people would come up with like new techniques to, you know, uh, obfuscates or encrypt data, but they wouldn't formally prove, they wouldn't have like a mathematical, a mathematical proof that. You know, [00:27:00] decrypting this was impossible under certain computational assumptions, right? 
 

They would like, build a scheme that sounded reasonable to them and then people would come and they would like, break it. Because, uh, turns out that when you don't have sort of a strong mathematical founding for what you're doing, often you're not doing things very well. Today, if you come into a room full of cryptographers and you say, I designed a new cryptographic technique, uh, and here's how it works, and I think it looks really hard to decrypt, so I think it's a good one, people will just laugh you out of the room, right? 
 

It, nobody will take you seriously. 30 years ago, it wasn't really the case. It wasn't, it wasn't like that. Uh, it was kind of like, it took a long time for people to sort of realize you need a strong mathematical found, like foundation to actually make strong security happen. Um, defaulter privacy is the same thing happening in the anonymization space. 
 

For, you know, almost 20 years, people have been, you know, coming up with new definitions of what it means for data to be anonymized, only for like a paper coming up the year after and say, [00:28:00] Oh, look, like here's a very concrete example in which this, this solution actually is completely broken. It still allows me to retrieve information about individuals. 
 

And then differential privacy came around. And this is like fundamentally different because all of a sudden you could formalize what it means for somebody, what it means for data to be protected. What exactly is the maximum information that a vector can gain? And it has some really nice guarantees, like it's not, it doesn't depend on What the attacker can do. 
 

It doesn't depend on what auxiliary data the attacker has. If you, if you use multiple times, you can still keep track of your total risk over time. All of all notions that all of the, all of these sort of nice properties were not, you know, you couldn't, you couldn't show them. In fact, they were wrong with the. 
 

previous definitions that were shown. So in a sense, this is what, you know, this is what, uh, the authors of these guidelines are referring to when they say guarantee. It means that there is a strong mathematical foundation for it. And this is why, you know, we're pushing very strongly in favor of like, we should adopt this for all of the [00:29:00] anonymization use cases that we have right now, because all of the other stuff like ad hoc stuff is just gonna, you know, either we know it's broken, either we will know in five years that it's broken. 
 

Uh, no. So that's the, that's the thing.  
 

Sean Martin: Let me, let me ask you this. Oh, that's fine. All right. Now you're going to go to negative. All right. Go to negative. Then I'll ask my question.  
 

Damien Desfontaines: Yeah. You said something, you said something super true earlier. It says nothing's absolute. Right? Like the, the, um, you know, there's no, there's no, there's no system that's going to be like completely 100 percent certain. 
 

And I, like, this is where I fully agree. And this is exactly what this guideline is actually saying. Right. The guideline is basically telling, like, if you look at the, the way they formalize it, they say, you know, deep entrepreneurship from a mathematical standpoint, here's what it is, and here's, like, what it's, you know, what it Protects what it provides you and they have this sort of concept of the pyramid and they say this is at the top of the pyramid is the mathematical definition with the privacy parameter exit on the unit of privacy and the math but [00:30:00] then there's like in order for this to make sense in order for the top of the pyramid to like make any sense you need to also make sure that the bottom of the pyramids is also covered and here are things like You know, the data that you're doing anonymization for anonymization on well, it should be well protected because if an attacker manages to like get to the original data, it doesn't matter how well you did your anonymization, right? 
 

It's like completely you lost, right? Anonymization is just completely pointless. You should understand your threat model. You should understand like, okay, who's getting the data? What are they going to use it for? Who are they going to share it with? And understand like how this exposure like matters for the system you're building. 
 

uh, between the mathematical definition and the actual implementation. Again, it's like cryptography, right? RSA is very simple to, like, explain to, uh, somebody with, like, a very pretty basic, uh, background in math. It's not very complicated from a theoretical standpoint. But implementing it correctly in a way that's safe against all possible [00:31:00] attacks, for example, with you know, uh, false injection and timing side channel attacks and so on and so forth. 
 

That's very hard. That's, you know, if you do this naively, you will 100 percent write code that is not actually safe. Default for privacy is exactly the same way. And the guideline explained, like, here's what you should be looking for. You know, for when you use like, here are the typical vulnerabilities that this implementation step can produce and you know, you should be aware of them and make sure that you have a strong story to mitigate those and so on. 
 

So this is where sort of, we have to be careful to say, like, we have to recognize the. Like, how much better having a strong mathematical foundation puts our privacy story, our compliance story, our, you know, ethical story, even when it comes to sharing and publishing data. And on the other side, we shouldn't just, you know, say, Oh, this uses this, this is the enterprise, everything's fine. 
 

And we don't have to like, you know, look at it closely, uh, to, to, to make sure that the, you know, the basics are [00:32:00] also covered. 
 

Sean Martin: Yeah, because I'm glad you described that whole other side of the coin because I was thinking about PKI and, and I mean, super difficult to deploy at scale, especially, right? And a lot of room for error. And then you have the, oh, by the way, I, I don't know. Posted my, my key in, in GitHub anyway. So anybody can access it, the decryption, uh, regardless. 
 

And then you have the whole, well, the, the data access point, if you're not putting controls around the data itself, and then that, that middle layer, the, the business logic, even if you have the best processes at either end, if you're exposing them in different ways in the middle through the logic that, uh, that exposes the data. 
 

Absolutely. In  
 

Damien Desfontaines: the NIST guideline, they even go like sort of one level below and the very bottom of the pyramid of the conceptual pyramid they have is, [00:33:00] you know, the most privacy preserving system is the one where you don't collect the data in the first place. So, even if you're using all of these fancy techniques, if what you want to do is, you know, completely minimize your privacy, which your compliance isn't too good by your users, you should only collect the bare minimum you need to. 
 

You know, have the data, you shouldn't like collect data that you don't need, uh, just because you can. And I really love that, you know, even in the guideline that's like, who's, you would think that's not the main topic, this is still like the fundamental basis of like all privacy work, right? Uh, be responsible with what you collect in the first place, even if you consider yourself as part of the trusted circle, like not the, not the attacker. 
 

Sean Martin: Yeah, so let's um, let's wrap here. And unless there's something else you want to, you want to share before you go, you can certainly do that. But I want to touch on how this fit into an organization's operations. How [00:34:00] disruptive is a change to adopt the guidelines that are that are being defined? Is it do apps have to be rewritten? 
 

Do databases have to be redeployed, restructured? Do I don't know. I mean, I can go up and down the whole stack, right? But how impactful is this?  
 

Damien Desfontaines: I think it's mostly impactful for the use cases in which you're using anonymization, like for the use cases that were already identified as We are sharing data with third parties. 
 

We are publishing data to the world. Or even we are maybe, you know, anonymizing some data so we can keep it forever in our own systems and still be within our compliance obligations. So this is where differential privacy is going to affect is going to improve uh, you know, the privacy and compliance story there. 
 

Uh, for all of the stages around like, you know, data collection, access controls and so on This is not going to be a massive, this is not going to be like, you know, really any change. It's really going to be about [00:35:00] the use cases of data sharing, data publication, maybe data retention. One thing that we observe at the MULF labs is, um, you know, we see our customers adopt differential privacy and the way it typically works is they have a particularly, you know, tricky data release that they're maybe not comfortable with doing. 
 

And they're like, Oh, if we use differential privacy, this would make us more comfortable about doing this. And then they go and they hire us for, uh, you know, helping them do a feasibility study and, or a pilot. And then, you know, we show them like, here's what you can do. Here's what you can publish here. 
 

What's the privacy risk is what the utility of this, the business value it gives you, et cetera. And then once they start doing that, they realize, oh, this can actually work in practice. Now, here is like five other use cases that we want. That, you know, we'll be interested in either replacing our existing anonymization methodology that is either we don't really feel good about, or it's just not giving us the accuracy and the utility that we would like. 
 

We're gonna, let's try to see if we can do something better with differential privacy. [00:36:00] And, uh, oh, oh, we have these things that we wanted to do for a long time, but we couldn't because it felt too iffy from a privacy or compliance standpoint. But if we have a rock solid compliance story using differential privacy, this would make us feel better. 
 

much better about doing this. So let's try to do that to, you know, uh, increase business value, accomplish more objectives, and so on. Uh, the one place that we've, uh, seen that, that I can talk about, because that's, that's what we do is obviously Continental, is with the Wikimedia Foundation. You know, we started helping them publish statistics about, uh, which web pages from Wikipedia were most, which article from Wikipedia were Uh, most visited and now, you know, since we started doing this one pilots with them two years ago, uh, since then they published, I think four, four, five or six, like new data, you know, data releases because now they understand how it works. 
 

They can systematize the process and they can, they can, like, they can publish more data and share more data and faster than they used to do because, you know, the, [00:37:00] one of the advantages of differential privacy besides all of the mathematical stuff is that. It gives you a systematic way of handling data releases. 
 

It gives you a framework that sort of, uh, it simplifies it. It makes some things that are very fuzzy, like how dangerous it is, what I'm doing and how much utility am I getting out of that. Uh, what can I tell the people who are receiving the data about what has been done to the data? It simplifies a lot of this stuff. 
 

So once people how, once we train data analysts and engineers to do one differential priority release, uh, you know, then they feel empowered to do two, three, and then people start identifying proactively new use cases that unlock values for the organization. Um, and so this is what we're seeing, and this is what, uh, makes me super excited about being in this space. 
 

Sean Martin: Well, it's a, it's a point that I, I try to end with or at least touch on in every episode, which is there's an opportunity for technology and [00:38:00] risk mitigation and controls to actually enable business to take place safely and securely. And it seems like one of those opportunities for sure to, uh, unlock some things. 
 

Unlock some things that, uh, weren't possible before. So I love it. Well, Damien, it's a pleasure to meet you. I know we, for me, it seems like a lot, but I'm sure we only scratched the surface with this. So, uh, we'll, we'll leave links to your profile. Folks can get in touch with you there. And, uh, of course I'm going to, I'm going to link to SP 800 226 and, uh, the, the post from Katarina who shed some light on this as well, and. 
 

Yep. Another area for people to investigate and learn and, and adopt and apply. It seems like we have an opportunity to do some really cool  
 

Damien Desfontaines: stuff here. 
 

That sounds good. And if any of your listeners, you know, is interested in either, you know, [00:39:00] has follow up questions about this stuff or is, uh, some of the stuff I said about, like, oh, I want to share or publish some data, but I can't for compliance reasons. 
 

Like if that resonates with you, come talk to us. We're, uh, we're pretty good at solving this kind of stuff. 
 

Sean Martin: Very good. Thank you, Damien. And, uh, thanks everybody for listening and watching. And of course, please share with your friends and enemies and subscribe and stay tuned. Many more conversations here on redefining cyber security. 
 

Thanks everybody. Cheers.  
 

Damien Desfontaines: Uh, thanks again for the invitation.