Open main menu

Wikipedia β

In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.



Differential privacy is a mathematical definition for the privacy loss that results to individuals when their private information is used in the creation of a data product. The term was coined by Cynthia Dwork in 2006,[1] but the correct reference is actually an earlier publication by Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith.[2] The work is based, in part, on work by Nissim and Irit Dinur[3] which showed that it is impossible to publish information from a private statistical database without revealing some amount of private information, and that the entire database can be revealed by publishing the results of a surprisingly small number of queries.

A result of Dinur and Nissim's "database reconstruction" work was the realization that the approach of providing privacy in statistical databases using semantic definitions of privacy (mostly dating to work of Tore Dalenius in the 1970s) was impossible, and that new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database needed to be developed. A result of the work and subsequent research is the development of techniques that make it possible, in many cases, to provide very accurate statistics from the database while still ensuring high levels of privacy.[4][5]

Principle and illustrationEdit

Differential confidentiality is a process that introduces randomness into the data.

A simple example, especially developed in the social sciences,[6] is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

  1. Throw a coin.
  2. If head, then answer honestly.
  3. If tail, then throw the coin again and answer "Yes" if head, "No" if tail.

The confidentiality arises from the refutability of the individual responses.

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata release and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.[7][8]

Formal definition and example applicationEdit

Let   be a positive real number and   be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let   denote the image of  . The algorithm   is  -differentially private if for all datasets   and   that differ on a single element (i.e., the data of one person), and all subsets   of  ,


where the probability is taken over the randomness used by the algorithm.[6]

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example, assume we have a database of medical records   where each record is a pair (Name, X), where   is a Boolean denoting whether a person has diabetes or not. For example:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query   that returns the partial sum of the first   rows of column   in the database. In order to find Chandler's diabetes status the adversary executes   and  , then computes their difference. In this example,   and  , so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Continuing this example, if we construct   by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish   from   by computing   for each dataset. If the adversary were required to receive the values   via an  -differentially private algorithm, for a sufficiently small  , then he or she would be unable to distinguish between the two datasets.


Let   be a positive integer,   be a collection of datasets, and   be a function. The sensitivity [9] of a function, denoted  , is defined by


where the maximum is over all pairs of datasets   and   in   differing in at most one element and   denotes the   norm.

In the example of the medical database above, if we consider   to be the function  , then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

Trade-off between utility and privacyEdit

There is a trade-off between the accuracy of the statistics estimated in a privacy-preserving manner, and the privacy parameter ε.[10][11][12][13] This tradeoff must also account the number of queries (done and expected in the future), by multiplying the epsilon parameter to the number of queries.

Other notions of differential privacyEdit

Since differential privacy is considered to be too strong for some applications, many weakened versions of privacy have been proposed. These include (ε, δ)-differential privacy,[14] randomised differential privacy,[15] and privacy under a metric.[16]

Differentially private mechanismsEdit

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[17] and posterior sampling[18] sample from a problem-dependent family of distributions instead.

The Laplace mechanismEdit

Many differentially private methods add controlled noise to functions with low sensitivity.[9] The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function  , which has mean zero and standard deviation  ). Now in our case we define the output function of   as a real valued function (called as the transcript output by  ) as   where   and   is the original real valued query/function we planned to execute on the database. Now clearly   can be considered to be a continuous random variable, where


which is at most  . We can consider   to be the privacy factor  . Thus   follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have   as the  -differential private algorithm we need to have  . Though we have used Laplacian noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.[19]


Sequential compositionEdit

If we query an ε-differential privacy mechanism   times, and the randomization of the mechanism is independent for each query, then the result would be  -differentially private. In the more general case, if there are   independent mechanisms:  , whose privacy guarantees are   differential privacy, respectively, then any function   of them:   is  -differentially private.[20]

Parallel compositionEdit

Furthermore, if the previous mechanisms are computed on disjoint subsets of the private database then the function   would be  -differentially private instead.[20]

Group privacyEdit

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in   rows, which amounts to adversary with arbitrary auxiliary information can know if   particular participants submitted their information. This can be achieved because if   items change, the probability dilation is bounded by   instead of  ,[19] i.e., for D1 and D2 differing on   items:


Thus setting ε instead to   achieves the desired result (protection of   items). In other words, instead of having each item ε-differentially private protected, now every group of   items is ε-differentially private protected (and each item is  -differentially private protected).

Stable transformationsEdit

A transformation   is  -stable if the hamming distance between   and   is at most  -times the hamming distance between   and   for any two databases  . Theorem 2 in [20] asserts that if there is a mechanism   that is  -differentially private, then the composite mechanism   is  -differentially private.

This could be generalized to group privacy, as the group size could be thought of as the hamming distance   between   and   (where   contains the group and   doesn't). In this case   is  -differentially private.

Adoption of differential privacy in real-world applicationsEdit

Several uses of differential privacy in practice are known to date:

  • U.S. Census Bureau, for showing commuting patterns.[21]
  • Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings [22] (RAPPOR's open-source implementation).
  • Google, for sharing historical traffic statistics.[23]
  • On June 13, 2016 Apple announced its intention to use differential privacy in iOS 10 to improve its intelligent assistance and suggestions technology.[24]
  • Some initial research has been done into practical implementations of differential privacy in data mining models.[25]

See alsoEdit


  1. ^ Dwork, Cynthia (2006). "Differential Privacy": 1–12. 
  2. ^ Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third conference on Theory of Cryptography (TCC'06), Shai Halevi and Tal Rabin (Eds.). Springer-Verlag, Berlin, Heidelberg, 265–284. DOI=
  3. ^ Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '03). ACM, New York, NY, USA, 202–210. DOI=
  4. ^ HILTON, MICHAEL. "Differential Privacy: A Historical Survey" (PDF). 
  5. ^ Dwork, Cynthia (2008-04-25). "Differential Privacy: A Survey of Results". In Agrawal, Manindra; Du, Dingzhu; Duan, Zhenhua; Li, Angsheng. Theory and Applications of Models of Computation. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–19. doi:10.1007/978-3-540-79228-4_1. ISBN 9783540792277. 
  6. ^ a b The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth. Foundations and Trends in Theoretical Computer Science. Vol. 9, no. 3–4, pp. 211‐407, Aug. 2014. DOI=10.1561/0400000042
  7. ^ Dwork, Cynthia. "A firm foundation for private data analysis." Communications of the ACM 54.1 (2011): 86–95, supra note 19, page 91.
  8. ^ Bambauer, Jane, Krishnamurty Muralidhar, and Rathindra Sarathy. "Fool's gold: an illustrated critique of differential privacy." Vand. J. Ent. & Tech. L. 16 (2013): 701.
  9. ^ a b Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith In Theory of Cryptography Conference (TCC), Springer, 2006. DOI=10.1007/11681878_14
  10. ^ A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 351–360. ACM New York, NY, USA, 2009.
  11. ^ H. Brenner and K. Nissim. Impossibility of Differentially Private Universally Optimal Mechanisms. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2010.
  12. ^ R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. The Proceedings of the VLDB Endowment (PVLDB), 4(11):1087–1098, August 2011. VLDB Endowment.
  13. ^ N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 493–501, San Diego, CA: ACM Press, August 2011.
  14. ^ Dwork, Cynthia, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. "Our data, ourselves: Privacy via distributed noise generation." In Advances in Cryptology-EUROCRYPT 2006, pp. 486–503. Springer Berlin Heidelberg, 2006.
  15. ^ Hall, Rob, Alessandro Rinaldo, and Larry Wasserman. "Random differential privacy." arXiv preprint arXiv:1112.2680 (2011).
  16. ^ Chatzikokolakis, Konstantinos, Miguel E. Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. "Broadening the scope of Differential Privacy using metrics." In Privacy Enhancing Technologies, pp. 82–102. Springer Berlin Heidelberg, 2013.
  17. ^ F.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007.
  18. ^ Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, Benjamin Rubinstein. Robust and Private Bayesian Inference. Algorithmic Learning Theory 2014
  19. ^ a b Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1
  20. ^ a b c Privacy integrated queries: an extensible platform for privacy-preserving data analysis by Frank D. McSherry. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD), 2009. DOI=10.1145/1559845.1559850
  21. ^ Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. "Privacy: Theory meets Practice on the Map". In Proceedings of the 24th International Conference on Data Engineering, (ICDE) 2008.
  22. ^ Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova. "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response". In Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS), 2014.
  23. ^ Tackling Urban Mobility with Technology by Andrew Eland. Google Policy Europe Blog, Nov 18, 2015.
  24. ^ "Apple - Press Info - Apple Previews iOS 10, the Biggest iOS Release Ever". Apple. Retrieved 16 June 2016. 
  25. ^ Fletcher, Sam; Islam, Md Zahidul (July 2017). "Differentially private random decision forests using smooth sensitivity". Expert Systems with Applications. 78: 16–31. arXiv:1606.03572 . doi:10.1016/j.eswa.2017.01.034. 

External linksEdit