Privacy-preserving data mining
R. Agrawal and R. Srikant
SIGMOD Rec.  29  439--450  (2000)
A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.
On the design and quantification of privacy preserving data mining algorithms
D. Agrawal and C. C. Aggarwal
    247--255  (2001)
The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to retain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is acceptable in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribution based on the perturbed data. We show that when a large amount of data is available, the EM algorithm provides robust estimates of the original distribution. We propose metrics for quantification and measurement of privacy-preserving data mining algorithms. Thus, this paper provides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative effectiveness of different perturbing distributions
An Architecture for Privacy Preserving Collaborative Filtering on Web Portals
W. Ahmad and A. Khokhar
    273-278  (2007)
Popular E-commerce portals such as Amazon and eBay require user personal data to be stored on their servers for serving these users with personalized recommendations. These recommendations are derived by virtue of collaborative filtering. Collaborative filtering (CF) is a method to perform automated recommendations based upon the assumption that users who had similar interests in past, will have similar interests in future too. Storing user personal information at such servers has given rise to a number of privacy concerns [13] which are effecting business of these services [15]. In this paper, we present a novel architecture for privacy preserving collaborative filtering for these services. The proposed architecture attempts to restore user trust in these services by introducing the notion of 'distributed trust'. This essentially mean that instead of trusting a single server, a coalition of servers is trusted. Distributions of trust makes the proposed architecture fault resilient and robust against security attacks. Moreover, the architecture employs an efficient crossing minimization based biclustering algorithm for collaborative filtering. This algorithm is easily amenable to privacy preserving implementation. The privacy preserving implementation makes use of a threshold homomorphic cryptosystem. The proposed algorithm is fully implemented and evaluated with encouraging results.
Potential impact of the HIPAA privacy rule on data collection in a registry of patients with acute coronary syndrome
D. Armstrong and E. Kline-Rogers and S. M. Jani and E. B. Goldman and J. Fang and D. Mukherjee and B. K. Nallamothu and K. A. Eagle
Archives of Internal Medicine  165  1125  (2005)
Background Implementation of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule has the potential to affect data collection in outcomes research. Methods To examine the extent to which data collection may be affected by the HIPAA Privacy Rule, we used a quasi-experimental pretest-posttest study design to assess participation rates with informed consent in 2 cohorts of patients eligible for the University of Michigan Acute Coronary Syndrome registry. The pre-HIPAA period included telephone interviews conducted at 6 months that sought verbal informed consent from patients. In the post-HIPAA period, informed consent forms were mailed to ask for permission to call to conduct a telephone interview. The primary outcome measure was the percentage of patients who provided consent. Incremental costs associated with the post-HIPAA period were also assessed. Results The pre-HIPAA period included 1221 consecutive patients with acute coronary syndrome, and the post-HIPAA period included 967 patients. Consent for follow-up declined from 96.4% in the pre-HIPAA period to 34.0% in the post-HIPAA period (P<.01). In general, patients who returned written consent forms during the post-HIPAA period were older, were more likely to be married, and had lower mortality rates at 6 months. Incremental costs for complying with the HIPAA Privacy Rule were $8704.50 for the first year and $4558.50 annually thereafter. Conclusions The HIPAA Privacy Rule significantly decreases the number of patients available for outcomes research and introduces selection bias in data collection for patient registries.
Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography
L. Backstrom and C. Dwork and J. Kleinberg
    181--190  (2007)
In a social network, nodes correspond topeople or other social entities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes.
A face is exposed for AOL searcher no. 4417749
M. Barbaro and T. Zeller and S. Hansell
New York Times  9    (2006)
Enhancing privacy and preserving accuracy of a distributed collaborative filtering
S. Berkovsky and Y. Eytani and T. Kuflik and F. Ricci
    9--16  (2007)
Collaborative Filtering (CF) is a powerful technique for generating personalized predictions. CF systems are typically based on a central storage of user profiles used for generating the recommendations. However, such centralized storage introduces a severe privacy breach, since the profiles may be accessed for purposes, possibly malicious, not related to the recommendation process. Recent researches proposed to protect the privacy of CF by distributing the profiles between multiple repositories and exchange only a subset of the profile data, which is useful for the recommendation. This work investigates how a decentralized distributed storage of user profiles combined with data modification techniques may mitigate some privacy issues. Results of experimental evaluation show that parts of the user profiles can be modified without hampering the accuracy of CF predictions. The experiments also indicate which parts of the user profiles are most useful for generating accurate CF predictions, while their exposure still keeps the essential privacy of the users.
Practical privacy: the SuLQ framework
A. Blum and C. Dwork and F. McSherry and K. Nissim
    128--138  (2005)
We consider a statistical database in which a trusted administrator introduces noise to the query responses with the goal of maintaining privacy of individual database entries. In such a database, a query consists of a pair (S, f) where S is a set of rows in the database and f is a function mapping database rows to {0, 1}. The true answer is ΣiεS f(di), and a noisy version is released as the response to the query. Results of Dinur, Dwork, and Nissim show that a strong form of privacy can be maintained using a surprisingly small amount of noise -- much less than the sampling error -- provided the total number of queries is sublinear in the number of database rows. We call this query and (slightly) noisy reply the SuLQ (Sub-Linear Queries) primitive. The assumption of sublinearity becomes reasonable as databases grow increasingly large.We extend this work in two ways. First, we modify the privacy analysis to real-valued functions f and arbitrary row types, as a consequence greatly improving the bounds on noise required for privacy. Second, we examine the computational power of the SuLQ primitive. We show that it is very powerful indeed, in that slightly noisy versions of the following computations can be carried out with very few invocations of the primitive: principal component analysis, k means clustering, the Perceptron Algorithm, the ID3 algorithm, and (apparently!) all algorithms that operate in the in the statistical query learning model [11].
Collaborative filtering with privacy via factor analysis
J. Canny
    238--245  (2002)
Collaborative filtering (CF) is valuable in e-commerce, and for direct recommendations for music, movies, news etc. But today's systems have several disadvantages, including privacy risks. As we move toward ubiquitous computing, there is a great potential for individuals to share all kinds of information about places and things to do, see and buy, but the privacy risks are severe. In this paper we describe a new method for collaborative filtering which protects the privacy of individual data. The method is based on a probabilistic factor analysis model. Privacy protection is provided by a peer-to-peer protocol which is described elsewhere, but outlined in this paper. The factor analysis approach handles missing data without requiring default values for them. We give several experiments that suggest that this is most accurate method for CF to date. The new algorithm has other advantages in speed and storage over previous algorithms. Finally, we suggest applications of the approach to other kinds of statistical analyses of survey or questionaire data.
Collaborative filtering with privacy
J. Canny
    45-57  (2002)
Server-based collaborative filtering systems have been very successful in e-commerce and in direct recommendation applications. In future, they have many potential applications in ubiquitous computing settings. But today's schemes have problems such as loss of privacy, favoring retail monopolies, and with hampering diffusion of innovations. We propose an alternative model in which users control all of their log data. We describe an algorithm whereby a community of users can compute a public "aggregate" of their data that does not expose individual users' data. The aggregate allows personalized recommendations to be computed by members of the community, or by outsiders. The numerical algorithm is fast, robust and accurate. Our method reduces the collaborative filtering task to an iterative calculation of the aggregate requiring only addition of vectors of user data. Then we use homomorphic encryption to allow sums of encrypted vectors to be computed and decrypted without exposing individual data. We give verification schemes for all parties in the computation. Our system can be implemented with untrusted servers, or with additional infrastructure, as a fully peer-to-peer (P2P) system.
Revealing information while preserving privacy
I. Dinur and K. Nissim
    202--210  (2003)
We examine the tradeoff between privacy and usability of statistical databases. We model a statistical database by an n-bit string d1,..,dn, with a query being a subset q ⊆ [n] to be answered by Σiεq di. Our main result is a polynomial reconstruction algorithm of data from noisy (perturbed) subset sums. Applying this reconstruction algorithm to statistical databases we show that in order to achieve privacy one has to add perturbation of magnitude (Ω√n). That is, smaller perturbation always results in a strong violation of privacy. We show that this result is tight by exemplifying access algorithms for statistical databases that preserve privacy while adding perturbation of magnitude Õ(√n).For time-T bounded adversaries we demonstrate a privacypreserving access algorithm whose perturbation magnitude is ≈ √T.
Differential privacy
C. Dwork
IN ICALP  2  1--12  (2006)
In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius' goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one's privacy incurred by participating in a database. The techniques developed in a sequence of papers [8, 13, 3], culminating in those described in [12], can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy
Limiting privacy breaches in privacy preserving data mining
A. Evfimievski and J. Gehrke and R. Srikant
    211--222  (2003)
There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.
Private distributed collaborative filtering using estimated concordance measures
N. Lathia and S. Hailes and L. Capra
    1--8  (2007)
Collaborative filtering has become an established method to measure users' similarity and to make predictions about their interests. However, prediction accuracy comes at the cost of user's privacy: in order to derive accurate similarity measures, users are required to share their rating history with each other. In this work we propose a new measure of similarity, which achieves comparable prediction accuracy to the Pearson correlation coefficient, and that can successfully be estimated without breaking users' privacy. This novel method works by estimating the number of concordant, discordant and tied pairs of ratings between two users with respect to a shared random set of ratings. In doing so, neither the items rated nor the ratings themselves are disclosed, thus achieving strictly-private collaborative filtering. The technique has been evaluated using the recently released Netflix prize dataset.
How To Break Anonymity of the Netflix Prize Dataset
A. Narayanan and V. Shmatikov
As part of the Netflix Prize contest, Netflix recently released a dataset containing movie ratings of a significant fraction of their subscribers. The dataset is intended to be anonymous, and all customer identifying information has been removed. We demonstrate that an attacker who knows only a little bit about an individual subscriber can easily identify this subscriber's record if it is present in the dataset, or, at the very least, identify a small set of records which include the subscriber's record. A successful deanonymization of the Netflix dataset has possible implications for the Netflix Prize, which promises \$1 million for a 10\% improvement in the quality of Netflix movie recommendations. Given movie ratings of Netflix users who have made their ratings public, or perhaps obtained from public sources such as the Internet Movie Database (IMDb), a contestant may be able to identify their records in the Netflix dataset (if they are among the Netflix subscribers whose anonymized records have been released). With the complete knowledge of a subscriber's IMDb ratings, it becomes much easier to "predict" how he or she rated any given movie on Netflix.
Health Services Research and the HIPAA Privacy Rule
Health services researchers conduct studies designed to improve the quality of health care, reduce its cost, improve patient safety, decrease medical errors, and broaden access to essential services. The evidence-based information produced by these researchers helps health care decision-makers make more informed decisions and improve the quality of health care services. Studies in health services research are often accomplished by analyzing large databases of health care information collected or maintained by health care providers, institutions, payers, and government agencies. With the implementation of the Federal Privacy Rule, health services researchers and database custodians have sought information about the Rule and how it may affect the use and disclosure of data for health services research. As of April 14, 2003 , the Privacy Rule requires many health care providers and health insurers to obtain additional documentation from researchers before disclosing personal health information for research and to scrutinize researchers' requests for access to health information more closely. Although the Privacy Rule introduces new rules for the use and disclosure of health information by covered entities, researchers can help to enable their continued access to health data by understanding the Privacy Rule and assisting health care entities covered by the Privacy Rule in meeting its requirements. This factsheet discusses the Privacy Rule and how it permits certain health care providers, health plans, and other entities covered by the Privacy Rule to use and disclose personal health information for health services research. Additional information about the Privacy Rule can be found in related publications, including: Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule Clinical Research and the HIPAA Privacy Rule Research Repositories, Databases, and the HIPAA Privacy Rule Institutional Review Boards and the HIPAA Privacy Rule Privacy Boards and the HIPAA Privacy Rule
Privacy-preserving collaborative filtering using randomized perturbation techniques
H. Polat and W. Du
    625-628  (2003)
Collaborative filtering (CF) techniques are becoming increasingly popular with the evolution of the Internet. To conduct collaborative filtering, data from customers are needed. However, collecting high quality data from customers is not an easy task because many customers are so concerned about their privacy that they might decide to give false information. We propose a randomized perturbation (RP) technique to protect users' privacy while still producing accurate recommendations.
Preserving privacy in collaborative filtering through distributed aggregation of offline profiles
R. Shokri and P. Pedarsani and G. Theodorakopoulos and J.-P. Hubaux
    157--164  (2009)
In recommender systems, usually, a central server needs to have access to users' profiles in order to generate useful recommendations. Having this access, however, undermines the users' privacy. The more information is revealed to the server on the user-item relations, the lower the users' privacy is. Yet, hiding part of the profiles to increase the privacy comes at the cost of recommendation accuracy or difficulty of implementing the method. In this paper, we propose a distributed mechanism for users to augment their profiles in a way that obfuscates the user-item connection to an untrusted server, with minimum loss on the accuracy of the recommender system. We rely on the central server to generate the recommendations. However, each user stores his profile offline, modifies it by partly merging it with the profile of similar users through direct contact with them, and only then periodically uploads his profile to the server. We propose a metric to measure privacy at the system level, using graph matching concepts. Applying our method to the Netflix prize dataset, we show the effectiveness of the algorithm in solving the tradeoff between privacy and accuracy in recommender systems in an applicable way.
Summary of HIPAA Privacy Rule
United States Department of Health and Human Services
    1-23  (2003)
State-of-the-art in privacy preserving data mining
V. S. Verykios and E. Bertino and I. N. Fovino and L. P. Provenza and Y. Saygin and Y. Theodoridis
SIGMOD Rec.  33  50--57  (2004)
We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made.
Local perspective of the impact of the HIPAA privacy rule on research
M. S. Wolf and C. L. Bennett
Cancer  106  474--479  (2005)
BACKGROUND The operational and economic impact of the Health Insurance Portability and Accountability Act (HIPAA) of 1996 was evaluated. The setting was a natural experiment which involved a single-site, clinical research study that was initiated before the enactment of HIPAA and subsequently modified to be compliant with the new policy. METHODS A formative assessment was conducted of the recruitment process to a clinical trial evaluating the efficacy of an educational strategy to inform Veterans about the National Cancer Institute/Department of Veterans Affairs cosponsored Selenium and Vitamin E Cancer Prevention Trial (SELECT). Personnel time and costs were determined based on weekly accrual for study periods before and after the implementation of HIPAA. Root cause analysis was used to assess the recruitment protocol and to identify areas for improvement. RESULTS The implementation of HIPAA resulted in a 72.9% decrease in patient accrual (7.0 patients/wk vs. 1.9 patients/wk, P < 0.001), and a threefold increase in mean personnel time spent recruiting (4.1 hrs/patient vs. 14.1 hrs/patient, P < 0.001) and mean recruitment costs ($49/patient vs. $169/patient, P < 0.001). Upon review of the modified HIPAA-compliant protocol, revisions in the recruitment procedure were adopted. The revised protocol improved weekly accrual by 73% (1.9 patients/wk vs. 7.1 patients/wk, P < 0.001) and resulted in improvements in personnel time (5.4 hrs/patient) and recruitment costs ($65/patient). CONCLUSION Enactment of HIPAA initially placed a considerable burden on research time and costs. Establishing HIPAA-compliant recruitment policies can overcome some of these obstacles, although recruitment costs and time are likely to be greater than those observed before HIPAA. Cancer 2006. © 2005 American Cancer Society.