TY - JOUR T1 - Secure Wavelet Matrix: Alphabet-Friendly Privacy-Preserving String Search JF - bioRxiv DO - 10.1101/085647 SP - 085647 AU - Hiroki Sudo AU - Masanobu Jimbo AU - Koji Nuida AU - Kana Shimizu Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/11/04/085647.abstract N2 - Motivation Privacy-preserving substring matching is an important task for sensitive biological/biomedical sequence database searches. It enables a user to obtain only a substring match while his/her query is concealed to a server. The previous approach for this task is based on a linear-time algorithm in terms of alphabet size |Σ|. Therefore, a more efficient method is needed to deal with strings with large alphabet size such as a protein sequence, time-series data, and a clinical document.Results We present a novel algorithm that can search a string in logarithmic time of |Σ|. In our algorithm, named secure wavelet matrix (sWM), we use an additively homomorphic encryption to build an efficient data structure called a wavelet matrix. In an experiment using a simulated string of length 10,000 whose alphabet size ranges from 4 to 1024, the run time of the sWM was an order of magnitude faster than that of the previous method. We also tested the sWM on all sequences of one protein family in Pfam (9,826 residues in total) and clinical texts written in a natural language (77,712 letters in total). By using a laptop computer for the user and a desktop PC for the server, we found that its run time was ≈ 2.5 s (user) and ≈ 6.7 s (server) for the protein sequences and ≈ 10 s (user) and ≈ 60 s (server) for the clinical texts.Availability https://github.com/cBioLab/sWM ER -