Abstract
Motivation The β-sheet is an element of protein secondary structure, and intra- and inter-molecular β-sheet interactions play pivotal roles in biological regulatory processes including scaffolding, transporting, and oligomerization. In nature, β-sheet formation is tightly regulated, because dysregulated β-stacking often leads to severe diseases such as Alzheimer’s, Parkinson’s, systemic amyloidosis and diabetes. Thus, the identification of intrinsic β-sheet forming propensities could provide valuable insight into protein design for the development of novel therapeutics. However, structure-based design methods may not be generally applicable to such amyloidogenic peptides mainly due to high structural plasticity and complexity. Therefore, an alternative design strategy based on complementary sequence information is of great significance.
Results We developed B-SIDER (β-Sheet Interaction DEsign for Reciprocity), a database search method for the design of complementary β-strands. The method makes use of the structural database information and generates a query-specific score matrix. The discriminatory power of the B-SIDER score function was tested on representative amyloidogenic peptide substructures against a sequence-based score matrix (PASTA2.0) and two popular ab initio protein design score functions (Rosetta and FoldX). B-SIDER was able to distinguish wild-type amyloidogenic β-strands as favored interactions in a more consistent manner than the other methods. B-SIDER is then prospectively applied to the design of complementary β-strands for the splitGFP scaffold. Three variants were identified to have stronger interactions than its original sequence selected by directed evolution, emitting higher fluorescence intensities. Our results support that B-SIDER can be applicable to the design of other β-strands, assisting in the development of therapeutics against disease-related amyloidogenic peptides.
Availability B-SIDER is freely available at http://bel.kaist.ac.kr/research/B-SIDER.
Introduction
The β-sheet is one of the major units of protein structure (Bhattacharjee and Biswas, 2010), and plays a variety of functional roles in transportation, recognition, scaffolding and enzymatic processes (Marcos, et al., 2018). Recently, the mechanism of β-sheet formation has received much attention because of its close relations with several critical diseases such as Alzheimer’s disease, Parkinson’s disease, type 2 diabetes and systemic amyloidosis (Chiti and Dobson, 2017; Richardson and Richardson, 2002). Such diseases are known to be linked to the precipitation of dysregulated β-stacking between neighboring β-strands (Colletier, et al., 2011; Liu, et al., 2012; Matthes, et al., 2014). In this regard, the information about the amino acid propensity of intrinsic β-sheet forming motifs and its use in the design of their complementary sequences are crucial for understanding the mechanism of β-sheet formation and developing potential therapeutics specifically targeting aggregation-prone regions (Giorgetti, et al., 2018).
While structure-based protein design approaches have shown notable successes in several cases (Huang, et al., 2016), their application to de novo β-sheet designs still remains challenging (Dou, et al., 2018; Marcos, et al., 2018). Structure-based design approaches require a well-defined protein structure, but amyloidogenic peptides usually have highly disordered structures (Jang, et al., 2016). Structural identification of such peptides has long been hindered by high degrees of structural plasticity, transiency, and complexity due to self-oligomerization (Dovidchenko and Galzitskaya, 2015; Zheng, et al., 2016). It is thus necessary to exploit the complementarity across neighboring β-strand pairs using sequence information.
Intriguingly, significant conservation and covariations of residue pairs between neighboring β-strands were identified in many protein families (Mandel-Gutfreund, et al., 2001). For instance, pairs of β-branched residues and cysteines are preferred at nonhydrogen-bonded positions. Aromatic residues tend to be paired with valine or glycine (Steward and Thornton, 2002). Several computational algorithms have been developed to predict aggregation-prone regions based on the internal β-sheet forming patterns. While different in details, they make use of either statistical potentials such as Tango (Fernandez-Escamilla, et al., 2004), PASTA (Trovato, et al., 2006), SALSA (Zibaee, et al., 2007), BETASCAN (Bryan Jr, et al., 2009), and Waltz (Maurer-Stroh, et al., 2010) or physicochemical properties of amino acids (Tartaglia and Vendruscolo, 2008). Additionally, consensus methods and machine-learning approaches have also been developed (Kim, et al., 2009; Tsolis, et al., 2013), showing fine agreements with experimental results.
It has been reported that β-strand interactions can be stabilized by introducing the β-sheet favored pairs (Kortemme, et al., 1998; Minor Jr and Kim, 1994; Quinn, et al., 1994; Stranges, et al., 2011) and charge pairing between neighboring β-strands (Shammas, et al., 2011; Wang and Hecht, 2002; West, et al., 1999). Recent studies showed that fragments derived from the amyloidogenic region can be used for β-stacking modeling (Gallardo, et al., 2016; Liu, et al., 2012). While the use of the amino acid pairing information in protein design has been attempted in elsewhere, practical applications of such patterns have been limited mainly owing to the lack of comprehensive quantification for residue pairing and noisy patterns of β-sheet forming residue pairs (Bhattacharjee and Biswas, 2010; Fujiwara, et al., 2012; Hutchinson, et al., 1998). The β-sheet forming peptides appear to have poor sequence commonalities and imperfect repeats (Bryan Jr, et al., 2009). Therefore, careful curation of meaningful patterns is required for the practical protein design strategy of complementary β-strands.
Herein, we present a database search method, B-SIDER (β-Stacking Interaction DEsign for Reciprocity), to design complementary β-strands. The method generates a query-specific score matrix from the structure database. To utilize the pairing information and overcome the pattern noise, we hypothesized that significant complementary pairs can be amplified by superposing a subset of sequence fragments. Moreover, the recent growth boom of β-sheet structures (Marcos and Silva, 2018) allows the solid statistics of β-sheet forming residue pairings (Sormanni, et al., 2015). Based on the hypothesis and statistics of β-sheet forming residue pairings, we developed a fast and reliable computational method for the design of complementary β-strand sequences. The methodology augments β-sheet forming residue preferences through overlaying complementary fragment sequences (Fig. 1). We retrospectively validated our approach using a set of curated amyloidogenic targets and compared it with two popularly used structure-based methods (Rosetta and FoldX) and a sequence-based aggregation prediction algorithm (PASTA2.0). Our algorithm was shown to clearly distinguish favorable β-sheet forming sequences entirely based on the query sequence, whereas the structure-based energy functions exhibited inconsistent results depending on targets. The utility and potential of our method were demonstrated by designing novel complementary peptides for splitGFP. The designed sequences showed stronger interactions with neighboring strands of the scaffold and consequently higher fluorescence emissions than the original peptide selected by directed mutagenesis (Cabantous, et al., 2005).
Methods
Computational algorithm for the design of complementary sequences
Collection of β-strand information
Non-redundant structures determined by high resolution X-ray crystallography were collected from the PDB: < 90 % sequence identity, < 3 Å resolution. Given the query target sequence, the non-redundant structure database was used to extract pairing information from matched sequences. Initially, the target sequence is divided into linear moving-windows whose residues in length range from 3 to the entire target sequence length. Any structures with identical target sequence fragments to the split queries were collected, followed by further filtering based on the definition of β-sheet secondary structure (the distance between backbone nitrogen-oxygen atom pairs < 5 Å). In order to remove redundancy, protein structures that contain the matches were compared using TMalign (Zhang and Skolnick, 2005). If TM-score > 0.7 and sequence identity > 90 %, one of the matched sequences was removed.
While the method is applicable to both parallel and anti-parallel β-sheets in theory, we mainly focused on anti-parallel β-sheets in this study since anti-parallel cases are more frequently observed compared to parallel ones (Hubbard, 1994). Disease-related amyloidogenesis is also known to be initiated with anti-parallel β-sheets and soluble oligomeric amyloid species mainly exist as anti-parallel (Cerf, et al., 2009; Gordon, et al., 2004).
Complementary sequence score
The β-sheet complementarity score function is derived from the environment-specific substitution score (Choi and Deane, 2010). We hypothesized that each position of a β-strand is independent of one another, and their complementarities are determined by residue pairs from neighboring strands. Given the query sequence, all of the identified neighboring sequences are pooled together as described in the previous section. The amino acid frequency at each complementary position is counted as where Ai,P is the frequency of a certain amino acid i at a specific complementary position p. Oi,p is the total count of the amino acid at p. The background frequency of the certain amino acid (Bi) is counted from the HOMSTRAD database (Mizuguchi, et al., 1998) and calculated as
The complementary sequence score of the amino acid at the position (Si,p) is calculated as
It should be noted that the complementarity score is completely data-driven, i.e. if an amino acid never appears at a certain position, a high penalty score is imposed. We only consider complementary amino acids which are found at least once in the entire identified sequences. The final score is the sum of the scores at all of the positions.
Protein expression and complementation assay
Gene construction
The gene coding for splitGFP (Cabantous, et al., 2005) consists of the 1-10th strands (GFP1-10) and 11th strand (GFP11) template (Table S1). They were cloned into pET-28a (Novagen) vector between the Nde-I and Xho-I restriction sites. We introduced additional mutations to GFP1-10 to inhibit aggregation and convenient expression (Kim, et al., 2015). The GFP11 strand was fused with a P22 virus-like particle scaffolding protein (McCoy and Douglas, 2018) for soluble and stable expression. Mutations on GFP11 were introduced by PCR using the mutagenic primers (Table S2), and the resulting genes were cloned into the pET-28a vector. Six histidine residues were fused to the N-terminal of the GFP1-10 and GFP11 genes as an affinity purification tag.
Protein expression and purification
All of the constructs were transformed into BL21 (DE3) E. coli strains. The transformed cells were grown overnight and inoculated into a Luria-Bertani media containing 50 μg/ml of kanamycin at 37 °C. Then, cells were grown until an optical density of the cells reached 0.6-0.8 at 600 nm, followed by addition of 0.7 mM of IPTG (isopropyl β-D-1-thiogalactopyranoside) for induction. After incubation for 16-18 hours at 18 C, the cells were harvested and suspended in a lysis buffer containing 50 mM Tris (pH 8.0), 150 mM NaCl, and 5 mM imidazole. The suspended cells were disrupted by sonication, and insoluble fractions were removed by centrifugation at 18,000 g for 1 hour. The supernatants were filtered using 0.22 μm syringe filters and purified through affinity chromatography with Ni-NTA agarose Superflow (Qiagen). The solutions were applied to the resin-packed columns and washed with a buffer containing 50 mM Tris (pH 8.0), 150 mM NaCl, and 10 mM imidazole, until no protein was detected by Bradford assay. Then, an elution buffer (50 mM Tris (pH 7.4), 150 mM NaCl, 300 mM imidazole) was applied to the column. The buffer exchange was performed by PD-10 column (GE health-care) to PBS (phosphate buffered saline, pH 7.4). The concentrations of the proteins were determined by measuring the absorbance at 280 nm. All the purification processes were performed at 4 C. The purities of proteins were then evaluated by SDS-PAGE.
Complementation assay
The assembly of splitGFP variants was monitored and measured by fluorescence complementation assay. Excessive amount of GFP1-10 (50 pmol) in 180 μl and 20 μl of equal molar concentration of each GFP 11 strand (3 pmol) were co-incubated in PBS buffer (pH 7.4). Fluorescence kinetics (λ89 = 488 nm /λ8: = 530 nm) were monitored for 12 hours at 25 °C by TECAN infinite M200 microplate reader at 5 minutes intervals (Cabantous, et al., 2005) with shaking for 2 seconds between intervals. Each experiment was performed in triplicate with Nunc F 96 Micro-well black plate, blocked with a solution of PBS containing 0.5 % of Bovine serum albumin (BSA) for 30 minutes before the assay.
Results and Discussion
Overview of the design process
We hypothesized that repetitively observed amino acid pairing patterns indicate the “smoking-gun” of strong preferences to β-sheet. It was also assumed that the sequence with most frequent patterns would directly form a β-sheet without considering other environmental contributions.
There are two major steps in the algorithm: 1) The extraction of β-sheet complementarity information and 2) the construction of scoring matrix (See Fig. 1 and Fig. S1). When a query sequence is given, it is fragmented into several pieces of short peptides longer than three residues in length and matched neighboring strands are collected. This fragmentation and overlaying processes naturally impose weights on complementary-prone positions and amplify pattern signals (Fig. S1). After the collection of matched sequences, a position-specific complementarity scoring matrix is constructed. The obtained scoring matrix is used to evaluate and design the complementarity of β-strand interactions.
Validation of the score function on retrospective cases
In an effort to validate the complementarity score, we manually curated a test set of naturally occurring β-strand pairs whose environmental effects are minimal. It is known that β-strand pairing is in general greatly hindered by local environments (Zaremba and Gregoret, 1999), but amyloidogenic peptide segments are known to form natural β-sheets in themselves (Trovato, et al., 2006). We thus selected a set of widely known amyloidogenic structures whose aggregation-prone regions have been identified (Table 1 and Table S3).
To assess the complementarity of the native sequences, we compared their scores with those of random sequences. The natural amyloidogenic segments are known to be highly aggregation-prone, so they are expected to be highly preferred, i.e., having fairly low scores in the random sequence score distributions. Figure 2 shows that all the native sequence scores are ranked extremely low in all of the distributions. On average, the native sequences are within 4.1 % of the distributions (Fig. 2). The results indicate that the scoring function is extremely useful in detecting favorable β-strand counterparts.
We also compared the B-SIDER score with two structure-based all-atom energy functions and a sequence-based score matrix. For structure-based methods, we picked Rosetta (Talaris 13) (Alford, et al., 2017; Kuhlman, et al., 2003) and FoldX (Schymkowitz, et al., 2005), which have been popularly used in de novo protein designs (Fleishman, et al., 2011; Rocklin, et al., 2017). PASTA2.0 (Walsh, et al., 2014) is a method to predict aggregation-prone regions using the scoring matrix derived from residue pairing patterns of β-sheets. To avoid any biases, 1,000 random sequences were newly prepared per target. “FastRelax” protocol (Tyka, et al., 2011) from Rosetta (Ver. 3.7), “BuildModel” command from FoldX (Ver. 4.0) and the scoring matrix from PASTA2.0 were used against the native and the random sequences. The predictive power of a score function was assessed by the percentile value of the native sequence score against the random sequence score distribution.
Figure 3 shows that structure-based score functions are in general worse than the sequence-based scoring matrices. The results indicate that the Rosetta energy score function is not sufficiently accurate for ranking complementary β-strands (35.8th percentile on average), whereas the predictive powers of PASTA2.0 and FoldX were moderate, showing 10.8th and 14.7th, respectively. B-SIDER was shown to be the most accurate in an extremely consistent manner. While the assessment of PASTA2.0 is also fairly consistent, the query-specific nature of B-SIDER may give better results.
Considering the Rosetta relax protocol performs flexible backbone refinements, the use of the fixed-backbone calculation seems to be better for the evaluation of β-sheet complementarity. It should be noted that the inconsistent results of Rosetta imply that FoldX prediction would be also highly driven by structure preparation, i.e. design with ill-defined models may not be generally successful. On the other hand, B-SIDER and PASTA2.0 do not depend on query structures, and thus, it can be applied to general cases such as β-sheet interactions with high structural plasticity and poor structural integrity, which are the common features of amyloidogenic peptides. Furthermore, the process of collecting complementary motifs of B-SIDER also appeared to be powerful, making it possible to distinguish favorable complementary sequences not easily detected by one-to-one residue pairing.
Prospective appplication of the algorithm to splitGFP
As shown in the retrospective test, B-SIDER is extremely useful in discriminating naturally β-strand forming sequences. As a proof of concept, we prospectively designed novel complementary β-strands for splitGFP. SplitGFP is a fragmented protein pair derived from superfolderGFP (Cabantous, et al., 2005), comprising a scaffold containing 10 β-strands (GFP1-10) and its complementary β-strand peptide (GFP11). GFP11 specifically interacts with GFP1-10 and the strand tightly forms a stable β-sheet structure, which facilitates the chromophore maturation in an irreversible manner (Köker, et al., 2018). This assembly process results in the emission of the green fluorescence. Because GFP11 is known to be disordered in solution, its conformational transition from the disordered to induced β-sheet is similar to amyloidogenic peptides (Ito, et al., 2013; Xu, et al., 2005). This model system thus efficiently assesses whether designed sequences by B-SIDER have favorable β-sheet interactions.
Original GFP11 was designed by directed mutagenesis, and it shows a high intrinsic propensity to form hydrogen bonds with the neighboring β-strands of GFP1-10 (Miller, et al., 2015). In our case, the queries are the neighboring strands of GFP1-10 (Fig. 4A). It is known that the residues pointing inward (1, 3, 5, and 7th positions) directly interact with the chromophore and thus they were not subject to mutation. B-SIDER identified 2,637 non-redundant sequences from the structure database. The native sequence is ranked at a modest score among randomly chosen 1,000 possible sequence variants (46th percentile), indicating that there could be room for complementary sequences with stronger interactions than the original one (Fig. 4B). We then selected 10 sequences with the lowest B-SIDER scores (top_vars; Table 2). Amino acid compositions of the 10 variants are mostly hydrophobic or branched amino acids (Fig. S2). Additionally, one sequence with a high score (> 75th percentile) was randomly selected as a negative control (neg_var, 77th).
The selected variants were successfully expressed and purified (Fig.S3) except for four clones (top_var6, 7, 8, and 10) which were observed to be insoluble, perhaps due to aggregation. Among those expressed, three variants (top_var1, 2, and 9) showed faster assembly patterns and higher signals compared to the original GFP11 (Fig. 5). No functional aberrance with excitation and emission was observed (Fig. S4). All the successful variants, which emitted stronger fluorescence levels than the original one, were shown to have the pair of phenylalanine and threonine at the positions 6 and 8, respectively. These results demonstrate that the designed variants indeed formed complementary β-strands in a more favorable fashion than its original peptide as predicted. The other variants showed slightly lower signals than the original one, but still gave rise to clear fluorescence signals (Fig. 5). The negative control (neg_var) barely emitted any signal, suggesting that the score indeed indicates the complementarity of β-stacking interactions. We also assigned scores of the GFP11 variants using other scoring methods. As shown in the retrospective test set, Rosetta and FoldX were not able to discriminate top_vars as favorable (Fig. S6). However, PASTA2.0 was again fairly accurate in this case.
Conclusion
β-sheet forming patterns are crucial for understanding the aggregation mechanism of disease-related β-sheets and developing potential therapeutics against them. Unlike α-helices, however, there has been no established design principle for the complementarity of β-sheets. In this study, we developed B-SIDER, a database search method for the design of complementary β-strands based on the intrinsic β-sheet forming propensities. Statistical patterns of interacting residue pairs between neighboring β-strands enable to quantify the complementary interaction. We demonstrated that the statistical potential can be directly applied to the design of complementary β-strand sequences. Using splitGFP as a model system, we successfully designed fragment variants, which led to stronger fluorescence emissions than the native one originally identified by directed mutagenesis. The results clearly indicate that B-SIDER can be useful for the detection and design of β-stacking interactions between unstructured fragments. Therefore, our approach can find wide applications to protein designs where structure-based methods are not effective, including the development of protein binders specifically against disease-related intrinsically disordered proteins.
Funding
This work was supported by the Korea Research Fellowship Program [2016H1D3A1938246 to Y.C.], Global Research Laboratory (NRF-2015K1A1A2033346), and Mid-Career Researcher Program (NRF-2017R1A2A 1A05001091) of the National Research Foundation (NRF) funded by the Ministry of Science and ICT.
Acknowledgements
This work was performed using Alphacom high-performance computing cluster in the department of biological sciences at the Korea Advanced Institute of Science and Technology (KAIST).