RESEARCH ARTICLE Towards Preserving Privacy of ...

5 downloads 0 Views 2MB Size Report
Cryptographic based approaches can be classified as those based on homomorphic encryption scheme and garbled circuit (GC) based approaches. There are.
RESEARCH ARTICLE Journal of Medical Imaging and Health Informatics

Copyright © 2017 American Scientific Publishers All rights reserved Printed in the United States of America

Vol. 7, 1–8, 2017

Towards Preserving Privacy of Outsourced Genomic Data Over the Cloud Sidra tul Muntaha1 , Abid Khan1 ∗ , Umar Manzoor2 , Kinza Sarwar1 , Mansoor Ahmed1 , Mouzna Tahir1 , Adeel Anjum1 , Saif ur Rehman1 , Masoom Alam1 , Nadeem Javaid1 , and Mohammed A. Balubaid2 1

Department of Computer Science, COMSATS Institute of Information Technology, Park Road, Chak Shahzad, Islamabad, Pakistan 2 King Abdul Aziz University, Jeddah, KSA

Genomic data holds sensitive information such as ancestry information as well as the information regarding the tendencies to particular diseases. Such privileged information should not be revealed to an unauthorized user as it can be abused. Now a day, it is quite common to outsource genomic data for the purpose of accelerating genomic research. In this paper, we proposed a method to protect genomic data privacy, using Paillier cryptosystem and order preserving encryption (OPE). The proposed scheme has been evaluated using performance and security analysis. The results reveal that there is some performance overhead, however, the leakage of single nucleotide polymorphism (SNPs) is controlled, which was a problem in existing privacy-preserving schemes for genomic data. We have also compared our approach with existing technique and the compared the percentage of leakage of SNPs, which are out of requested range. In existing techniques, the whole short read is returned, in which requested range falls. So all those SNPs, which are not in requested range would be considered as leaked SNPs.

Keywords: Genomic Data, Cigar String, Single Nucleotide Polymorphism (SNPs), Paillier Encryption, Order Preserving Encryption.

1. INTRODUCTION In modern subatomic biology and genetics1 2 genome contains whole innate information of an organism. It is concealed either in Deoxyribonucleic Acid (DNA) or Ribonucleic Acid (RNA). The human genome contains nucleotides by which genetic information of an individual is presented. These nucleotides are stored in their predefined sequence. The nucleotides, which constitute an individual’s genome, are adenine (As), cytosine (Cs), guanine (Gs), and thymine (Ts). Approximately 3.2 billion of these base pairs (nucleotides) combine to form genome of an individual. The DNA sequence produced by the DNA sequencing technology consists of billions of short reads. Typically, each short read consists of 100 to 400 nucleotides (A, C, G, T), depending on the type of sequencer. These reads are then randomly sampled from a human genome. To produce a sequence aligned map (SAM) file, each read is aligned to its genetic location. A patient SAM file contains hundreds of billions of short reads. To carry out large-scale research in the medical domain, organizations that hold the genomic information needs to share genomic sequences that are specific to a person, without breaching the privacy of data subjects. A person tendency to a ∗

Author to whom correspondence should be addressed.

J. Med. Imaging Health Inf. Vol. 7, No. xx, 2017

variety of diseases can be deduce from his SNPs as discovered in recent studies.3 Cloud computing is becoming a cheap alternative, as a result, many companies and individuals are tempted to store their data using various cloud computing storage services. However, privacy sensitive data4 5 e.g., personal health records, and financial data6 7 among many others may contain private information, which shouldn’t be revealed to unauthorized parties. When genomic data is outsourced, the user loses control of that data. Cloud service provider (CSP) is always assumed as untrusted.8 It can have motivations either intentional (like revealing the sensitive genomic data to any unauthorized medical unit by colluding with medical unit) or unintentional (like leakage of some extra information for which researcher have not requested). Several studies9–11 have attempted to address the privacy concerns of genomic data. However, these schemes have the following limitations: • These schemes require the involvement of patient very actively, e.g., the patient himself performs the decryption of end results. Furthermore, it is quite risky to keep sensitive information such as secret keys.12 • If a single key is decomposed and shared by many users then there are chances of collusion risk, e.g., A collusion attack may occur if the secret key is shared between researchers and CSP.

2156-7018/2017/7/001/008

doi:10.1166/jmihi.2017.2186

1

RESEARCH ARTICLE

J. Med. Imaging Health Inf. 7, 1–8, 2017

With secret key sharing schemes, in which N parts of the key are distributed among multiple parties and only K parts are sufficient to recover the key.13

approach. Section 6 concludes the paper and provides the possible future extension of our work.

So our goal is to propose a scheme in which not only the patient involvement is minimized, but also the genomic data privacy is preserved. Our proposed scheme first obfuscate and then encrypt the genomic data before it is outsourced to a cloud storage server. A medical unit transfers genomic data to a CSP. The CSP can then answer the researcher queries by collaborating with a mediator and genome repository manager (GRM). A mediator is responsible for de-obfuscation, masking, and re-encrypting the data. Challenges in performing genetic tests: To perform computational genetic tests, there are a number of challenges investigated. 1. Accessibility: Due to sensitivity and the huge size of genomic data, it is challenging that how and where data should be stored. Should it be given to an individual or to any trusted party? Would it be safe there? Answering these questions is a challenge. 2. Privacy: DNA information of any individual is extremely sensitive that it should not be disclosed. However, only authorized users should be allowed to run specific genetic tests, it should not surrender one’s whole genomic sequence, but a subset of it. 3. Long-term data safety: Genomic sequences, besides identifying its owner, also reveals a lot of information about his family members. This stimulates the problem of long-term data safety, even if genomic data is in encrypted form. As encryption scheme get weaken gradually. 4. Accuracy and accountability: Genomic tests should be conducted accurately. Furthermore, guarantees must be provided to assure that tests are performed on the intended genomic data only. In order to address this in this paper we have used orderpreserving encryption along with Paillier cryptosystem. Thus, the privacy of genomic data is preserved.

2. RELATED WORK

Specifically, the contributions of this paper are: 1. Unlike existing schemes,9 10 12 35 which have assumed a weaker attacker model. We assume a stronger attacker model in which the entities involved can collude with each other to infer a patient genomic data. 2. We present a framework for genomic privacy using OPE and obfuscation. We provide algorithms to obfuscate SAM file, encrypt SAM file and query encrypted SAM file. 3. We evaluate our proposed scheme and provide a comparison with an existing scheme. For the proposed scheme performance and privacy analysis is also provided. • For performance evaluation, we measure the: obfuscation time, encryption time, the overhead of query time over encrypted genomic data for different key size. • For evaluating privacy preservation, comparison of the privacy leakage of our scheme with a recently proposed scheme is provided to show how the proposed scheme avoids leakage of SNPs. Organization of the paper: The rest of the content of the paper are organized as follows: Related work along with its limitation is presented in Section 2. Necessary background is provided in Section 3. Section 4 describes our order preserving genomic privacy approach. In Section 5, we provide an evaluation of proposed scheme, performance, and privacy leakage analysis is provided for the proposed scheme and a comparison with the existing scheme is made to demonstrate the effectiveness of our 2

Broadly speaking, the research on genomic privacy techniques can be classified as software and hardware-based approaches. Software-based approaches can be classified as cryptographic and non-cryptographic approaches. Cryptographic based approaches can be classified as those based on homomorphic encryption scheme and garbled circuit (GC) based approaches. There are a number of non-cryptographic techniques as well such as those based on de-identification14 and DNALA (DNA lattice anonymization),15 differential privacy,16 program specialization10 etc. Ayday et al.9 17 18 framework protects patient’s genomic data stored at storage and the processing unit by using homomorphic encryption and privacy preserving integer comparison to conduct a test on genomic data. Homomorphic encryption allows the operation to be performed on encrypted data without actually needing to decrypt it. Thus, the operation performed on encrypted data matches the results of operations performed on plaintext. For homomorphic encryption, a variant of Paillier cryptosystem is used in Refs. [9, 17, 18], which is a public key cryptosystem supporting some homomorphic operations. The proposed scheme has also used DGK cryptosystem, which is an optimized integer comparison cryptosystem. Moreover, the relationship between the storage cost and the privacy of the patient is also studied by Ayday et al.18 Kantarcioglu et al.19 proposed scheme perform scientific examinations on coherent genomic data using homomorphic public key encryption (HPE) scheme. The scheme has achieved data quality by avoiding data perturbation, in which noise is added to original data to provide privacy. Furthermore, data privacy is achieved using homomorphic encryption. To calculate the likeness of DNA sequences, Jha et al.20 proposed technique, which uses garbled circuits to calculate edit distance of two strings privately. Garbled circuits (GC) is a method proposed by Yao21 for secure multiparty computation. Yao GC protocol is only secure against an adversary from the Honestbut-curious adversary model. This protocol was scalable, but extremely slow, requiring huge memory. Bruekers et al.22 scheme performed a search and matching in DNA databases using Freedman et al.23 in a privacy preserving manner. The scheme is based on homomorphic encryption. De Cristofaro et al.24 achieved genomic privacy using conditional oblivious transfer, garbled circuits, and homomorphic encryption. The protocol has also been implemented in a toolkit, which is called as “GenoDroid.” Real-world practicality and usability were investigated for some of these methods in GenoDroid. Canim et al.25 used cryptographic hardware to store and secure the biomedical data. The technique presented removed the need for multiple third parties by collocating services to store and to process sensitive biomedical using a secure co-processor (SCP). The protocol presented can be used to process and perform a series of experiments on genomic data. It was claimed that such an approach could be run in an efficient manner for typical biomedical investigations. Non-cryptographic approaches to genomic privacy includes methods based on de-identification,14 DNALA (DNA lattice anonymization),15 differential privacy,15 and program specialization10 etc. Malin et al.14 used de-identification,

RESEARCH ARTICLE

J. Med. Imaging Health Inf. 7, 1–8, 2017

(i.e., removal of direct identifiers like name, residential address or social security number) technique, to protect the privacy of genomic data. However, such a protection mechanism is considered as weak and vulnerable to various re-identification attacks.26 27 There are identifying software programs that can link de-identified records to named people.28 Improved techniques such as generalization lattices are proved to be more effective. Malin et al.15 proposed DNALA (DNA lattice anonymization) which is an extension of k-anonymity schema. The proposed scheme anonymized DNA sequences by generalizing each sequence and its most similar sequence to a common sequence providing protection of k-anonymity with k being equal to 2. However, such techniques rely on sanitizing and generalizing genomic data, which reduces the information in the data, resulting in less usability for bioinformatics testing. Recently, the use of differential privacy has been presented by Fienberg et al.16 to make sure that two entire genomic databases have distinct statistical characteristics. However, the privacy level achieved is not high enough. Chen et al.29 scheme outsourced genomic data to the cloud and preserved the privacy of sensitive DNA information. The scheme has used a seed and extends method for sequence alignment (main focus was at alignment). Furthermore, Wang et al.10 proposed a privacy preserving architecture based on program specialization. This program specialization was performed on sanitized data, whereas reads do not carry information before mapping onto the reference genome.

3.2. Paillier Cryptosystem Paillier cryptosystem48 is a probabilistic public key cryptographic technique. It is named as Pascal Paillier who invented it in 1999. It is believed to be difficult to compute nth residue classes. That’s why it is computationally difficult. In our work, Paillier cryptosystem is chosen, because we are going to deal with human genomic data, which is in huge amount. This cryptosystem is efficient in managing a huge amount of sensitive data. It is an additive homomorphic cryptosystem; meaning that if only the public key and encryption of messages m1 , m2 is given, one can calculate the encryption of m1 + m2 . Paillier cryptosystem48 has three steps: generation of key, encryption and decryption. 3.2.1. Key Generation and Encryption in Paillier Cryptosystem: Choose two large prime numbers randomly as p and q, which are kept private. Public parameter n = p · q, g ∈ B. Steps to encrypt plain text m where m ∈ Zn into ciphertext c ∈ Zn∗2  1. Choose r, where r ∈ Zn∗  2. Compute c = g m · Rn mod n2 . 3.2.2. Decryption Let c be the ciphertext to decrypt, where c ∈ Zn∗2 . Compute the plaintext message as m = Lc  mod n2 ).  mod n.

3. PRELIMINARIES In this section, we provide necessary background knowledge. Since our scheme is based on order preserving encryption and Paillier cryptosystems, we will explain the working of these two schemes. For further details on these two schemes, we refer the reader to see Refs. [30, 31]. 3.1. Order Preserving Encryption Scheme Order preserving encryption (OPE) is an encryption scheme in which encryption function retains the numerical ordering of the ciphertext to the corresponding order of the plain text. OPE has historically been used in the form of one part codes where a single copy for encryption and decryption is required as plain text and their corresponding ciphertexts, both arranged in alphabetical or numerical order. In World War I one-part codes were also used. Agrawal et al.,37 proposed a formal description of OPE. This schema allows query on encrypted data in range, for this reason, these schemes are of great interest. That is, in a data structure that allows range queries, a distant untrustworthy database server is capable of indexing the (sensitive) data it receives, in encrypted form. By efficient, we mean in logarithmic time, in the size of the database, as for large databases performing linear work on each query is restrictively slow in practice. Besides allowing efficient range queries, OPE also allows efficient indexing and query processing as it does for un-encrypted data. Using standard treebased data structures, the server locates the requested ciphertexts in logarithmic-time, since a query just consists of the encryption of a and b positions. Since, then there has been many applications of OPE.

4. ORDER PRESERVING ENCRYPTION BASED GENOMIC DATA PROTECTION In this section, we describe the proposed protection scheme. Firstly, we describe the entities involved in the scheme and their roles. Secondly, we present the attacker model considered for our scheme. Then we describe the detailed working of our proposed scheme. 4.1. Entities and Their Roles The entities involved are: (i) patient, (ii) medical unit (MU), (iii) genome repository manager (GRM), (iv) cloud service provider (CSP), and (v) mediator (MD). The entities and their roles are shown in Table I.

Table I. Roles of involved entities. Entity

Role

Patient

A person whose genomic sequences is to be out-sourced

CSP

An entity responsible for providing cloud services like storage, application and computations of genomic sequences

MU

Sends genomic patterns to CSP, and answers queries by collaborating with a mediator

GRM

Assists in retrieval and storing of patient genomic sequences

MD

A unit who carries out de-obfuscation, mask and re-encrypts genomic sequences.

3

RESEARCH ARTICLE 4.2. Attacker Model Unlike previous schemes,9–12 a stronger attacker model is considered in our proposed scheme. We assume that the involved parties can collude with each other to infer genomic data. We consider an attacker with the following capabilities: • An honest but curious party at the GRM may attempt to deduce a patient’s genomic sequence stored in the repository. Furthermore, he may attempt to determine the type of test being performed. • A curious party at MU disgruntled employee or an attacker, who may be interested in obtaining private genomic data for which he/she is not permitted. • MU and GRM may collude to infer information about genomic data. It is assumed that GRM, MD, and MU are honest but curious entities. They follow the protocol and provide only authorized and correct information. 4.3. Proposed Privacy Preserving Scheme for Outsourced Genomic Data Genomic tests must be performed in a privacy-preserving manner as it holds private information, which is specific to a person’s genetic condition and provides information about tendencies to particular diseases. This privileged information could be misused by unauthorized parties. Therefore, protecting privacy of genomic data is of utmost importance. Existing privacy schemes preserves are mostly either based on homomorphic encryption such as Paillier crypto-system,9 11 or are based on program specialization10 techniques. These schemes require the frequent involvement of patient, who holds the keys for decryption. It is quite risky to keep sensitive information like secret keys at patient side.12 In Ref. [12] privacy of genomic data is preserved by partitioning the secret key and then giving it to different entities to performed encryption and querying. This distribution of partitioned keys may result in collusion and intruder may get whole information. Data stored at central repository is only in encrypted form, so any possibility of key leakage can risk the central repository. By outsourcing obfuscated and encrypted data, we have limited the privacy risk of genomic data, as central repository contains encrypted data, which is in obfuscated form. So, if an attacker gets the private key and attacks the central repository, he can’t get any useful information. In our proposed approach, genomic data is outsourced to GRM in encrypted plus obfuscated form. The unit, which obfuscates the genomic data, has no direct link with the researcher or MU who performs the query on genomic data. It means that it is impossible for MU or a researcher to have de-obfuscation keys, which can be used by an attacker after having secret keys for decryption of genomic data. Without having both types of keys i.e., private and de-obfuscation keys, an attacker cannot gain sensitive information from GRM. At the end when a response is to be sent, clipping of unauthorized SNPs is done to avoid privacy leakage. This approach made the genomic data more secure, which is the utmost requirement of the patient. 4.4. Working of Proposed Scheme In our scheme, a medical unit (MU) sends genomic data to a cloud service provider (CSP). User queries can be answered by the CSP by collaborating with an MD and GRM. Genomic data is stored at GRM in a form which is meaningless and it has to be transformed into meaning full information as depicted in 4

J. Med. Imaging Health Inf. 7, 1–8, 2017

Figure 1. Now we discuss how can these attacks be prevented by the proposed scheme. SAM files are generated from binary aligned map (BAM) files and after obfuscating and encrypting; these files are stored at GRM. To abstain the manager from inferring information and associating it to the conducted test, genomic data is kept hidden. Identities of the patient are hidden and genomic data is obfuscated, so it can’t provide useful information to the GRM. If a dishonest employee of MU tries to infer information the obfuscated data, it retrieves will be of no use. Similarly, if MU and GRM collude, with the help of private key of MU, genomic information can be gained but in the absence of obfuscation keys, the data thus obtain is of no use. In Figure 1 patient sends his genomic data to a certified institute (CI), which is a trusted party. CI aligns the genomic data and generates SAM files. It sends the aligned genomic data to the obfuscation unit (OU), which obfuscates genomic data by adding some raw data into it. After obfuscating the genomic data, this data is sent back to CI. An OPE based scheme is used at the CI to encrypt cigar string (CS). While, Paillier cryptosystem is used to encrypt the remaining of the short read. CI then outsource encrypted obfuscated genomic data to GRM and mapping tables are sent to MD. When MU queries the GRM, it coordinates with MD. Mediator holds mapping tables (which holds the start and the end alignments of short reads in SAM file), decryption and de-obfuscation keys. With the help of keys and map- ping tables, mediator generates the encrypted upper and lower bounds, against which GRM will provide the data to MU. This result does not contain any extra information, which is out of requested range. Figure 2 shows the flow of information in the proposed scheme. It shows how a query is processed and how the end results are generated. The following steps illustrate the flow of information and at the end, MU gets the encrypted results from the GRM. Step 1: An authorized entity can query the GRM for a particular range. The GRM forwards the encrypted range to MD. Step 2: Mapping tables are used by MD to generate E (U and L) and then send this information to GRM. Step 3: GRM retrieves E (obfuscated) short reads (short genomic sequences). Step 4: The retrieved short reads are sent to MD. Step 5: CI keys are used to decrypt and de-obfuscate the short reads. The sensitive bits of data are masked for privacy concern. Furthermore, the unwanted bits of data are also masked so that the returned length remains the same as given by GRM. Step 6: Encrypt end result via MU public key. Step 7: MD sends E (end result) to GRM. Step 8: GRM sends E (end result) to MU. MU makes a request to GRM for its required genomic sequence. After verifying the authenticity of requesting source, GRM forwards E (requested range) to MD, which generates encrypted upper and lower bounds E (U and L) of requested sequence. The E (U and L) is then sent back to GRM. Based on E (U and L), GRM retrieves encrypted obfuscated E (obfuscated) short reads that fall within the range provided by MD. Retrieved short reads are then sent back to MD, which performs decryption and deobfuscation using the key of CI. MD then masks the useless and sensitive bits that are not to be shown to MU. By masking, the length of short reads will remain same as of short reads provided by GRM. So no curious party could predict sensitive bits. The end result is generated after masking short reads. Encryption is

J. Med. Imaging Health Inf. 7, 1–8, 2017

Fig. 1.

RESEARCH ARTICLE

Working of the proposed scheme.

performed on end result via the public key of MU and delivered to MU through GRM. 4.5. Algorithms Our proposed scheme is based on following algorithms. • GenKeys → CI generates keys for encryption ECI and obfuscation KO keys. CI shares its obfuscation key (KO) with OU for performing obfuscation. Similarly, CI’s private key is shared with a MD, which is considered as trusted entity. With this key MD can perform decryption and de-obfuscation. This algorithm is executed to generate the OPE keys, obfuscation keys and encryption keys. These keys will be used to preprocess SAM file. • PreprocessFile (File, Keys) → CI preprocesses input SAM File’s short read one by one. Cigar string of short read is obfuscated and OPE is used to encrypt start and end alignment whereas encryption will be used to encrypt remaining information. • Query (SAMFiles, startPos, endPos) → GRM executes query for the requested range of MU over the preprocessed SAM files.

is obfuscated and its start alignments and end alignments are encrypted by OPE. These encrypted alignments will allow GRM to query the ranges requested by MU. Then rest of the short read is encrypted with Paillier’s cryptosystem. Now, this preprocessed record is written to output file. This process is repeated for all records and at the end of preprocessing we have another SAM file, which is obfuscated and encrypted. After preprocessing, SAM file is available to GRM and now MU can request the nucleotides range from GRM. Since alignments are encrypted and do not match the ranges requested by MU. GRM passes on these ranges to MD to get the upper and lower bounds. Then using these upper and lower bounds GRM executes the query over the SAM file. Short reads returned by the query to GRM reveals no information to GRM and it can then pass on these results to MD for de-obfuscation and decryption. Response to the query returns only cigar strings containing the nucleotides of requested range thus MD also gets no additional information from short reads.

SAM file is preprocessed before it is handed over to genome repository manager. The input SAM file is a collection of short reads. So, in preprocessing phase, each short read’s cigar string

Algorithm 1 (Preprocess File). 1: procedure PreprocessFile(samFile, outputFile, obfuscationKey, opeKey, encryptionKey)

Fig. 2.

Flow of the information in proposed scheme.

5

RESEARCH ARTICLE 2: 3:

J. Med. Imaging Health Inf. 7, 1–8, 2017

ObfuscateSAMFilesamFile, outputFile, obfuscationKey EncryptSAMFilesamFile, outputFile, opeKey, encryptionKey

Algorithm 2 Obfuscate SAM File. 1: procedure ObfuscateSAMFile(samFile, outputFile, obfuscationKey) 2: samfileReader = SAMFileReadersamFile 3: samfileWriter = SAMFileWriteroutputFile 4: samRecord = samfileReader.readSamRecord  5: while samRecord = EOF do  End of file not reached 6: if samRecord.is Header( ) then 7: continue 8: else 9: obfuscatedCigar = ObfuscationUtil.obfuscatecigar String, obfuscationKey 10: samRecord.setCigarStringobfuscatedCigar 11: endif Algorithm 3 (Encrypt SAM File). 1: procedure EncryptSAMFile(samFile, outputFile, openKey,encryptionKey) 2: samfileReader = SAMFileReadersamFile 3: samfileWriter = SAMFileWriteroutputFile 4: samRecord = samfileReader.readSamRecord( ) 5: while samRecord = EOF do  End of file not reached 6: if samRecord.isHeader( ) then 7: continue 8: else 9: startAlignment = OPE.encryptsamRecord.getStart Alignment,opeKey 10: endAlignment = OPE.encryptsamRecord.getEnd Alignment,opeKey 11: samRecord.setStartAlignmentstartAlignment 12: samRecord.setEndAlignmentendAlignment 13: samRecord = EncryptionUtil.encryptsamRecord, encryptionKey 14: samfileWriter.writesamRecord 15: endif Algorithm 4 (Query Encrypted SAM File). 1: procedure QuerySAMFile(encSamFile, startPosition, endPosition) 2: samfileReader = EncryptedSAMFileReaderencSamFile 3: samRecord = samfileReader.readSamRecord( ) 4: CigarString list 5: while samRecord = EOF do  End of file not reached 6: if samRecord.isHeader( ) then continue 7:

8: 9: 10: 11:

else start Alignment = samRecord.getStartAlignment( ) endAlignment = samRecord.getEndAlignmer( ) if startAlignment or endAlignment is between startPosition and endPosition then list.addsamRecord.getCigarString( )) endif endif

12: 13: 14:

5. EVALUATION To evaluate our proposed scheme, we conducted multiple tests using CentOS-6.3 (64 bit), CPU Intel Core (TM) i3-M330 @ 2.13 GHz. JDK17 + Apache with 2 GB RAM, 2 Processors (2 cores each). Implementation is performed using Java and Ruby on Rails programming languages. We used Picard API.30 For encryption, we used Paillier asymmetric encryption with key length 1024 bits. Our data consists of 82 MB SAM file which is freely available on GitHub.32 33 We conducted experiments for performance analysis and privacy leakage analysis. 5.1. Performance Analysis We have performed tests on encryption with multiple sizes of short reads. We evaluate the performance of our scheme as follows: 1. Time needed to obfuscate file with a different number of short reads. 2. Time needed to encrypt the file with a different number of short reads. 3. Time taken in query response (with and without encryption). 4. Encryption with different key size. Table II shows the effect on performance evaluating parameters by changing the size of short reads. When we increase or decrease the number of short reads and keep the start and end alignments same, the number of records returned would be different. The greater the number of short reads more will be the time taken to obfuscate and encrypt. We calculated overhead between query time with encryption and query time without encryption. Figure 3 shows the overhead introduced by the proposed scheme. Starting size of 100 short reads, the number of returned records are 100, if there is no encryption time, then the time taken to perform the query on plain text and give back the results to the medical unit is 95 ms. If short reads are encrypted, then time taken to perform query on cipher text and give back the results to medical unit is 198 ms. When we increase or decrease the number of short reads and keep the start and end alignments same, the number of records returned would be different. The greater the number of short reads more will be the time taken to give back results.

Table II. Effect of varying short reads length on performance. Short read # 100 200 300 400 500

6

Query time without encryption (ms)

Obfuscation time (ms)

Encryption time (sec)

Start position

End position

No. of records returned

Query time with encryption (ms)

95 117 140 157 165

885 946 990 1030 1066

22 44 655 875 109

1 1 1 1 1

4000 4000 4000 4000 4000

100 111 111 111 111

198 201 202 205 206

RESEARCH ARTICLE

J. Med. Imaging Health Inf. 7, 1–8, 2017

400

Query Time (ms)

350 300

198

201

202

205

206

157

165

250 200 150 100

95

117

140

Query Time Without Encryption (ms) Query Time (ms)

50 0 100

200

300

400

500

Short Reads In File (Count) Fig. 3.

Overhead (query time with and without encryption).

We performed encryption with different key size (64, 128, 256, 512, 1024). A longer key length means a greater search space for someone trying to brute force the key. To make a brute force attack (possible against any encryption algorithm) unworkable a key should be large enough. Brute force attack is about trying all probable keys until unless a match is found. Expanding the space of possible keys is the only defense against exhaustive search, i.e., have longer keys. 5.2. Privacy Leakage Analysis In order evaluate privacy leakage analysis we have compared our results with the work of Ayday et al.11 In the case of our technique, all nucleotides, which are out of the range specified by the user in the query, are clipped and only desired nucleotides are returned. Whereas in the case of existing technique, all nucleotides of short read are returned whose start alignment or end alignment falls in the user requested range. These unwanted nucleotides returned in query response in existing technique are considered as “leakage.” We have executed multiple queries by changing the requested range of nucleotides on same data. The graph below gives the comparison of our technique with existing technique. Figure 4 shows the comparison of the query result with Ayday et al.11 In this figure, the red bar represents the count of nucleotides returned by the query in Ayday et al.11 scheme and the green bar shows the nucleotides returned by the query in our technique. The difference of both counts represents the leakage

of nucleotides. In our technique, there is no leakage. To better illustrate this point consider a short read in SAM file with: Start Alignment: 15 End Alignment: 64 Nucleotides: GTTCCTGCATAGATAATTGCATGACAATTGCC TTGTCCCTCCTGAATGTG. Let us say user queries the nucleotides within the range of 2564, then with our technique, only AGATAATTGCATGACAATTGCCTTGTCCCTCCTGAATGTG will be returned and GTTCCTGCAT are removed in response. But with existing technique, all nucleotides will be returned and GTTCCTGCAT are considered to be leaked nucleotides.

6. CONCLUSION We have described a genomic privacy-preserving scheme. The presented scheme can be used for storing, retrieving, and processing of aligned, obfuscated raw data. We have used Picard tools (which converts BAM files to SAM files), obfuscation, and Paillier encryption scheme. By using Picard tools,49 SAM files are generated and then obfuscation is performed on generated CS. We then used OPE for positions, so that position of nucleotides is preserved. At the end, Paillier encryption is applied on the remaining genomic sequence. The proposed scheme stores the patient’s genomic data at GRM and lets the authorized medical units to privately retrieve the data from the GRM for genetic tests. Even if data leakage at GRM occurs it will not be understandable by the attacker, because any such leaked data will be in encrypted and obfuscated form. The proposed scheme will accelerate genomic research, because if patients are guaranteed that their genomic privacy will be preserved then they will be more willing to store their genomes at GRM. By enhancing user’s trust the proposed scheme will accelerate the genomic research. We have evaluated the proposed scheme by providing performance and privacy analysis. The results reveal that there is some performance overhead, but leakage of SNPs is controlled. While comparing our technique with existing technique of Ayday et al.,11 it is clear that percentage of leakage of SNPs, which are out of requested range, is reduced. In existing technique,11 the whole short read is returned, which falls in the requested range. So all those SNPs, which are not in requested range will be considered as leaked SNPs. However, in our work only requested SNPs will be returned and rest of unwanted SNPs will be clipped. This results in zero percent leakage of SNPs. Using Paillier encryption scheme, OPE, obfuscation and at the end clipping the unwanted nucleotides, the risk of privacy breaches for outsourced genomic data is reduced. One obvious limitation of the proposed scheme is the overhead introduced by Paillier cryptosystems, ideally, we would like to reduce this overhead. In future, we would like to implement an efficient variant of Paillier cryptosystems such as Ref. [34] and provide a quantitative evaluation of the overhead introduced.

References and Notes

Fig. 4.

Comparison of the query results.

1. J. H. E. Cartwright, S. Giannerini, and D. L. Gonzlez, DNA as information: At the crossroads between biology, mathematics, physics and chemistry. Phil. Trans. R. Soc. A 374.2063, 20150071 (2016). 2. P. Ball, The problems of biological information. Phil. Trans. R. Soc. A 374.2063, 20150072 (2016). 3. Y. Duan, N. Youdao, J. Canny, and J. Zhan, P4p: Practical large-scale privacypreserving distributed computation robust against malicious users abstract, Proceedings of 19th Usenix Security Symposium (2010).

7

RESEARCH ARTICLE 4. H. H. Lhr, A. R. Sadeghi, and M. Winandy, Securing the e-health cloud, Proceedings of the 1st ACM International Health Informatics Symposium, ACM (2010). 5. A. M.-H. Kuo, Opportunities and challenges of cloud computing to improve health care services. Journal of Medical Internet Research 13.3 (2011). 6. L. Ogiela, Intelligent techniques for secure financial management in cloud computing. Electronic Commerce Research and Applications (2015). 7. S. Das, The Cyber Security Ecosystem: Post-global Financial Crisis, Managing in Recovering Markets, Springer, India (2015), pp. 453–459. 8. A. Shraer, C. Cachin, A. Cidon, Y. M. IditKeidar, and DaniShaket, Venus: Verification for untrusted cloud storage, Proceedings of the 2010 ACM Workshop on Cloud Computing Security Workshop, ACM (2010), pp. 19–30. 9. E. Ayday, J. L. Raisaro, P. J. McLaren, J. Fellay, and J.-P. Hubaux, Privacypreserving computation of disease risk by using genomic, clinical, and environmental data, Presented as part of the 2013 USENIX Workshop on Health Information Technologies, (Berkeley, CA), USENIX (2013). 10. R. Wang, X. Wang, Z. Li, H. Tang, M. K. Reiter, and Z. Dong, Privacypreserving genomic computation through program specialization, Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS ’09, ACM, New York, NY, USA (2009), pp. 338–347. 11. E. Ayday, J. Raisaro, U. Hengartner, A. Molyneaux, and J.-P. Hubaux, Privacypreserving processing of raw genomic data, Data Privacy Management and Autonomous Spontaneous Security, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (2014), pp. 133–147. 12. E. Ayday, J. L. Raisaro, J.-P. Hubaux, and J. Rougemont, Protecting and evaluating genomic privacy in medical tests and personalized medicine, Proceedings of the 12th ACM Workshop on Workshop on Privacy in the Electronic Society, WPES ’13, ACM, New York, NY, USA (2013), pp. 95–106. 13. A. Shamir. How to share a secret. Communications of the ACM 22, 612613 (1979). 14. B. A. Malin, Technical evaluation: An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. JAMIA 12, 28 (2005). 15. B. Malin, Protecting DNA sequence anonymity with generalization lattices. Methods of Information in Medicine 687 (2005). 16. E. Ayday, E. De Cristofaro, J. Hubaux, and G. Tsudik, The chills and thrills of whole genome sequencing. arXiv 1306.1264 (2013). 17. E. Ayday, J. L. Raisaro, and J.-P. Hubaux, Privacy-enhancing technologies for medical tests using genomic data. NDSS, The Internet Society (2013). 18. E. Ayday, J. L. Raisaro, and J. P. Hubaux, Personal use of the genomic data: Privacy versus storage cost, Proceedings of IEEE Global Communications Conference, Exhibition and Industry Forum (Globecom) (2013). 19. M. Kantarcioglu, W. Jiang, Y. Liu, and B. Malin, A cryptographic approach to securely share and query genomic sequences. IEEE Transactions on Information Technology in Biomedicine 12, 606 (2008). 20. S. Jha, L. Kruger, and V. Shmatikov, Towards practical privacy for genomic computation, IEEE Symposium on Security and Privacy, SP 2008, May (2008), pp. 216–230.

J. Med. Imaging Health Inf. 7, 1–8, 2017

21. O. Goldreich, S. Micali, and A. Wigderson, How to play any mental game, Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, ACM (1987), pp. 218–229. 22. F. Bruekers, S. Katzenbeisser, K. Kursawe, and P. Tuyls, Privacy-preserving matching of DNA profiles. IACR Cryptology ePrint Archive 2008, 203 (2008). 23. M. Freedman, K. Nissim, and B. Pinkas, Efficient private matching and set intersection, Advances in CryptologyEUROCRYPT04, Lecture Notes in Computer Science, Springer (2004), Vol. 3027. 24. E. De Cristofaro, S. Faber, P. Gasti, and G. Tsudik, Genodroid: Are privacypreserving genomic tests ready for prime time? Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, WPES ’12, ACM, New York, NY, USA (2012), pp. 97–108. 25. M. Canim, M. Kantarcioglu, and B. Malin, Secure management of biomedical data with cryptographic hardware. IEEE Transactions on Information Technology in Biomedicine 16, 166 (2012). 26. G. J. De Moor, B. Claerhout, and F. de Meyer, Privacy enhancing technologies: The key to secure communication and management of clinical and genomic data. Meth. Info. Med. 42, 14853 (2003). 27. L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, Technical Report LIDAP-WP4. Data Privacy Laboratory, Carnegie Mellon University, Pittsburgh, PA (2000). 28. B. Malin, Re-identification of familiar database records, AMIA Annu. Symp. Proc. (2006). 29. Y. Chen, B. Peng, X. Wang, and H. Tang, Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. NDSS, The Internet Society (2012). 30. Y. Huang, D. Evans, J. Katz, and L. Malka, Faster secure two-party computation using garbled circuits, Proceedings of the 20th USENIX Conference on Security, Usenix Association Berkeley, CA, USA (2011), pp. 35–35. 31. P. Paillier, Public-key cryptosystems based on composite degree residuosity classes, Advances in cryptologyEUROCRYPT99, Springer, Berlin, Heidelberg (1999). 32. https://www.github.com/wtsi-npg/illumina2bam/tree/devel/testdata/bam, accessed on 7/8/2014. 33. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/, accessed on 8/14/2014. 34. L. Nguyen, ReihanehSafavi-Naini, and K. Kurosawa, A provably secure and efficient verifiable shuffle based on a variant of the paillier cryptosystem. J. UCS 11.6, 986 (2005). 35. E. Ayday, E. De Cristofaro, J. Hubaux, and G. Tsudik, The chills and thrills of whole genome sequencing. arXiv 1306.1264 (2013). 36. Picard tools, https://www.github.com/broadinstitute/picard, accessed on 7/19/2014. 37. Y. Huang, D. Evans, J. Katz, and L. Malka, Faster secure two-party computation using garbled circuits, Proceedings of the 20th USENIX Conference on Security, USENIX Association Berkeley, CA, USA (2011), pp. 35–35.

Received: xx Xxxx xxxx. Accepted: xx Xxxx xxxx.

8