Abstract
The prevalence of antibiotic resistance in urinary tract infections (UTIs) often renders the prescribed antimicrobial treatment ineffective, highlighting the need for personalized prediction of resistance at time of care. Here, crossing a 10-year longitudinal dataset of over 700,000 community-acquired UTIs with over 6,000,000 personally-linked records of antibiotic purchases, we show that the resistance profile of infections can be predicted based on patient-specific demographics and clinical history. Age, gender, and retirement home residence had strong, yet differential and even non-monotonic, associations with resistance to different antibiotics. Resistance profiles were also associated with the patient’s records of past urine samples and antibiotic usage, with these associations persisting for months and even longer than a year. Drug usage selected specifically for its own cognate resistance, which led indirectly, through genetic linkage, also to resistance to other, even mechanistically unrelated, drugs. Applying machine learning models, these association patterns allowed good personalized predictions of resistance, which could inform and better optimize empirical prescription of antibiotics.
The resistance of bacterial pathogens to commonly used antibiotics is a growing public health concern, threatening the efficacy of life-saving antibiotic drugs 1,2. Antibiotic use and misuse can benefit resistant strains, exacerbating the problem over time 3–5. At the single patient level, the efficacy of antimicrobial treatment is critically dependent on correctly matching antibiotic choice to the specific susceptibilities of the pathogen 6–8. Ideally, correct prescription should be based on direct measurement of the antibiotic susceptibilities of the infecting pathogen. In practice, though, to save time and resources, drugs are often prescribed empirically in absence of culture susceptibility measurements, risking incorrect and ineffective treatment.
This problem is of particular importance in Urinary Tract Infections (UTIs) which are often treated empirically despite substantial levels of resistance. UTIs are among the most common bacterial infections, with over 150 million annual cases globally 9. One of three women will have at least one symptomatic UTI by age 24, and more than one-half will be affected during their lifetime 10 Treatment of these infections accounts for about 8% of non-hospital usage of antibiotics, often as part of empirical prescription 11,12. The common etiological agents of UTIs are diverse, including Escherichia coli, Klebsiella pneumoniae and Proteus mirabilis, as well as gram-positive bacteria such as Enterococcus faecalis 13–18 These pathogens are often resistant to several antibiotics, with resistance rates of infections exceeding 20% for commonly used drugs 14,17,19, emphasizing the challenge of choosing the most appropriate antibiotic treatment for each patient 20.
The risk of an infection being resistant to different antibiotics can be correlated with patient demographics and clinical and epidemiological factors. Known demographic factors associated with resistance include older age 21, gender 22, ethnicity 23–26 and residence in a retirement home 22. Known clinical and epidemiological factors associated with resistance include presence of a urinary catheter 18,22,27, immunodeficiency 22, diabetes 22 and travel to developing countries 25 Notably, most of these associations were identified based on small patient cohorts, typically with high frequencies of antibiotic resistant infections, such as retirement homes, rehabilitation centers, or hospitals.
Beyond the patient’s demographics and current clinical factors, antibiotic resistance can also be associated with the patient’s past clinical history, including recurrent UTIs and hospitalizations as well as past resistance profiles and antibiotic prescriptions. Risk of resistance to specific drugs have been shown to increase for patients with recurrent UTIs 22,26,28 and past hospitalizations 22,29 Focusing on hospital settings, studies have further shown that past resistance profiles can be used to predict resistance in future infections 30 Availability of antibiotic purchase data reveals patterns of antibiotic use and misuse 31 and shows that risk of resistance increases with short-term prior use of antibiotics 5,21,22,29,32–34. At the population level, a recent large-scale study revealed correlations between risk of trimethoprim/sulfa resistance and the volume of past purchases of the same drug (cognate) as well as of other drugs of different pharmaceutical classes (noncognate) 17 Such associations of usage of a given antibiotics with future resistance to other antibiotics can appear indirectly through genetic linkage among resistance mechanisms, but mathematically resolving direct and indirect selection for resistance has been challenging. Negative associations, where drug use is anti-correlated with resistance, have also been observed, but it has been difficult to discern the direction of causality 17,35. Finally, the time extent of these positive and negative associations of resistance with prior antibiotic usage and prior resistant samples is not well resolved, and it is also unclear whether and how these associations vary across resistances to different antibiotics.
Here, we present a systematic big-data analysis of a large population of UTI patients to unravel predictive features of antibiotic resistance. We analyze a patient-level longitudinal dataset of community and retirement-home acquired UTI cultures collected by Maccabi Healthcare Services (MHS), Israel’s second largest Health Maintenance Organizations, serving a diverse population of ~2 million patients. We first analyze correlations between demographic factors and antibiotic resistance. Then, comparing resistance data of multiple infections from the same patient, we unravel long-term “memory” of resistance over time. We also combine these culture records with patient-linked records of antibiotic use to quantify the extent and time of direct and indirect impacts of antibiotic use on resistance at the single-patient level. Finally, we develop machine learning models combining these demographic and historical factors for personalized predictions of resistance.
Results
We retrieved data of all positive urine cultures of MHS patients for the ten-year period between 01-July-2007 and 30-June-2017, as well as patient demographics and record of antibiotic purchases for these patients (Methods). Among all ~2 million MHS patients, there were 711,099 recorded positive urine samples from 315,047 patients total. For each positive sample, one or more bacterial species were isolated and characterized (total of 736,793). The dataset included the species identification of these isolates as well as VITEK2 antibiotic resistances reinterpreted in accordance with CLSI guidelines (Sensitive, Intermediate, and Resistant). For each UTI sample, we also defined “sample resistance” as the maximal resistance for each antibiotic across all isolates from the same sample (96.4% of samples were identified as single species and their resistance profile is simply defined as the resistance profile of their single isolates). In our analysis, we focus on the 16 most frequently measured antibiotic resistances (Table 1 and Supplementary Fig. 1). All of MHS’s country-wide clinical tests are performed centrally (Methods), allowing reliable comparison across patients and time.
At the population level, resistance frequencies of different pathogens to different antibiotics varied mildly across time. Three species, E. coli, K. pneumoniae and P. mirabilis, account for 85% of isolates (70%, 10%, 5%, respectively; Fig. 1a). These pathogens varied in their resistance profiles (Fig. 1b). Notably, for almost any of the 16 considered antibiotics, the chance of resistant infection is significant, indicating that antibiotic treatment efficacy could often be undermined. Notable exceptions are amikacin, meropenem and ertapenem, for which resistance frequencies were negligible (<1% of UTIs); yet these are last-line antibiotics of restricted use. These population-level frequencies of resistance were fairly static over time and only mild changes were observed in certain antibiotics and specific species (Fig. 1c and Supplementary Fig. 2). The diversity of pathogens and resistance patterns underscores the widely accepted notion that antibiotic prescriptions must be tailored based on the resistance profile of the infection, motivating the development of methods to better predict resistance 20
The effect of demographics on antibiotic resistance
Considering the distribution of demographic factors across the UTI dataset, we find gender-specific age distributions (Fig. 2a). UTIs were much more common for females than males (~88% females, consistent with previous literature 9,23) and had qualitatively different age distributions. For females, UTIs appeared primarily in three distinct demographic groups: young children; reproductive-age women, including pregnant women; and the elderly, including retirement-home patients (Fig. 2a, top). In contrast, for males, there were only two peaks: an early-life peak of UTIs, dominated by infants, and a late peak of events for the elderly (Fig. 2a, bottom). Considering geography, the dataset spanned multiple cities which varied in their distribution of demographic factors (Supplementary Fig. 3). In total, this set of community-acquired UTIs showed diversity of demographic factors, allowing analysis of their impacts on resistance.
Age, gender, pregnancy and residence in retirement home had strong, yet differential, association with resistances to the 16 antibiotics. For each antibiotic, we performed multivariate logistic regression for the odds of resistance (η = PResistance/(Psensitive + Pintermediate) as a function of age (quadratically), gender, retirement home residence, pregnancy, city, date of sampling (quadratically) and season of sampling (Methods: Logistic regression “Demographics” model). For all of the 16 antibiotics, risk of resistance strongly increased with age and with retirement-home residence and decreased for females and pregnancy (Fig. 2b; and Supplementary Table 1 for regression coefficients and 95% CI). The effect of age on the odds of resistance differed widely among the 16 measured antibiotics, ranging from odds ratios (OR) of 2 in trimethoprim/sulfa and amoxicillin to more than 16 in antibiotics such as ciprofloxacin and amikacin (Fig. 2b and Supplementary Table 1). For most antibiotics, females had lower odds of resistance, yet the odds ratios varied substantially among the different antibiotics (from 0R=0.94 CI: 0.92-0.96 for trimethoprim/sulfa to 0R=0.12 CI: 0.10-0.15 for meropenem). These lower odds of resistance for females were even lower with pregnancy (as much as 0R=0.46; 95% confidence interval, 0.44-0.49 for ciprofloxacin; Supplementary Table 1). While the city of residence, when considered alone (Methods: Logistic regression “City, unadjusted”), was correlated with risk of resistance, these associations stemmed from differences in demographics among cities; they were much reduced in the complete regression model (compare adjusted and unadjusted coefficients, Supplementary Table 1). The date of sample had some association with resistance to specific antibiotics, most notably ampicillin and ceftriaxone, while season had almost no effect on resistance for any of the drugs (Fig. 2b).
The dependence of resistance on age showed complex and antibiotic-specific patterns. To better understand how the demographic factors interact and what is the effect of age as a continuous variable, we calculated, for each of the 16 antibiotics the odds of resistance η of the urine samples across ten age groups for five categories: men not residing in retirement homes, men residing in retirement homes, women not pregnant and not residing in retirement homes, women in retirement homes, and pregnant women (Fig. 2c and Supplementary Fig. 4). For some antibiotics, the odds of an infection being resistant were non-monotonic with age, with an additional peak of higher resistance risk at infancy or young age (e.g., nitrofurantoin, piperacillin/tazobactam; Fig. 2c and Supplementary Fig. 4). The effect of age was not always independent of other variables. We identified interaction between gender and age leading to heterogeneous patterns for males and females (e.g. amikacin, meropenem) and even to opposite effects of gender on odds ratios in specific ages groups (e.g. ciprofloxacin; Fig. 2c and Supplementary Fig. 4). While, across all antibiotics, resistance was higher for residents of retirement homes, the correlation with age within this group was reversed: the odds of resistance for retirement home residence decreased with age (Fig. 2c and Supplementary Fig. 4). We concluded that among the different demographic factors associated with risk of resistance, age, gender and residence in retirement homes are the strongest, with resistances to different antibiotics differentially affected by these factors and by interactions among them.
Correlation of resistance among same-patient urine samples
Moving from demographics to clinical history, we analyzed correlations of resistance across same-patient infections, revealing short- and long-term “memory” and a constant patient-specific tendency for resistance. For each antibiotic A, we analyzed all same-patient pairs of samples and calculated the odds ratio for resistance of the second sample given the resistance of the first sample (ORpairs = (NR→R/NR→S)/(NS→R/NS→S), where the N’s are number of same-patient sample pairs with the specified resistance phenotypes; for example, NR→S is the number of sample pairs in which the first sample is resistant to antibiotic A and the second sensitive; Methods). Calculating ORpairs as a function of the time difference t = tl – t2 between the two samples in each pair, we find that, for all antibiotics, these odds ratios are highest for short time differences between the samples, then decay with two distinct time scales and finally converge, at the longer time differences, to a non-zero asymptotic effect (the odds ratios are well fitted by the sum of a fast exponent, a slow exponent and a constant, ORpairs ≃ C0et/τ0 + C0et/τm + C∞; Fig. 3a,b and Supplementary Fig. 5). The shortest time scale τ0, which is about a week, stems from successive samples likely taken from the same infection. For all antibiotics, resistance of a prior sample within this short time increased the odds of resistance of the second sample by not less than two orders of magnitude (Fig 3a,b,d, red). The second time scale τm represents much longer “memory” of correlations among samples, lasting more than six months for most antibiotics and even longer than a year for some (Fig. 3c). These long-term associations among sample measurements are consistent with repeated infection with the same or similar strains. The effect of this memory component was much more consequential and long lasting than previously thought. The maximal odds ratios of this memory term (Cm) ranged between 6 and 14 for different antibiotics and typically remained larger than 2 even for samples taken a year apart (Fig 3a,b and Supplementary Fig. 5). At much longer times, this memory decayed, and ORpairs converged to a constant, but interestingly it did not diminish, but rather converged to values larger than 1 (Fig. 3a,b,d, green), representing timeless patient specific tendencies for resistance. In sum, we found an antibiotic-specific long-term memory of resistance between infections that can last for months to a year, as well as timeless correlations among even more remote samples, which can greatly improve predictions of resistance.
Association of resistance with past antibiotic purchase
Next, we crossed the infection dataset with patient-linked antibiotic purchase data. For each patient with recorded UTI samples, we retrieved all records of antibiotic purchase made during the twenty year period from 1-Jan-1998 to 30-Jun-2017. For analysis, we used 34 drug groups (Methods; Supplementary Table 2). For each UTI sample, we counted the number of purchases made by the same patient of each of the 34 drugs at distinct time intervals prior to the sample (Methods). Then, we applied multivariate logistic regression to correlate resistance to each of the 16 antibiotics with these drug purchase counts (Methods: Logistic regression “Purchase history”; Fig. 4a and Supplementary Fig. 6a).
We identified strong long-term patient-level associations of resistance with past purchase of both cognate and noncognate antibiotics. For 22 out of the 34 drugs, we identified strong associations of prior drug purchase with future resistance, often lasting for months and even more than a year (Fig. 4a and Supplementary Fig. 6a). For example, the associations between purchase of ciprofloxacin and its cognate resistance having odds ratios of 1.5 after half a year and remaining as large as 1.2 even two years past purchase (Fig. 4a). Some weak negative associations were also identified (ciprofloxacin resistance was negatively correlated with past use of amoxicillin, Fig. 4a). Yet, the magnitude of these negative correlations decreased after adjusting for demographics, suggesting that they stemmed indirectly from correlations of purchases and resistance with demographics (Methods: Logistic regression, “Purchase history adjusted for demographics”; Supplementary Fig. 6b). Notably, drug purchases were associated not only with their expected cognate resistances. Indeed, use of both first-line antibiotics (such as ciprofloxacin, ofloxacin and levofloxacin) and last-line antibiotics (such as amikacin), increased the risk of a future resistance to a wide range of mechanistically diverse antibiotics. These abundant long-term positive associations between resistances and noncognate drugs did not stem from correlations of purchases and resistance with patient demographics; they remained strong even when adjusting for demographics (Supplementary Fig. 6b). Together, these results support strong and long-lasting associations of antibiotic resistance with past use of both cognate and noncognate antibiotics.
Exposing direct drug-to-resistance associations by disentangling linkage among resistances, showed that drug usage specifically selects for its cognate resistance at the single-patient level. Across the sample dataset, resistances to different antibiotics within class and even resistances to antibiotics of different classes were highly correlated (cross resistance; Supplementary Fig. 7). These inherent correlations among resistances suggest that observed associations between resistance to a given drug A and past purchase of a different drug B may arise indirectly through selection for resistance B and association between resistance to B and resistance to A. Mathematically discerning these direct from indirect effects is not possible when only a single resistance is considered 17,36 Yet, as our dataset contained measurements of multiple resistances for each sample, we were able to disentangle direct from indirect associations by adjusting the logistic regression for other measured resistances (Methods: Logistic regression “Purchase history adjusted for cross-resistance”). In this cross-resistance adjusted analysis of purchase-resistance associations, the noncognate associations between drug purchases and resistance much diminished and even disappeared while the associations between cognate drug-to-resistance pairs persisted (Fig. 4b and Supplementary Fig. 6c). For example, considering association between purchases of trimethoprim/sulfa and ciprofloxacin to their cognate resistances, we observed that the total (unadjusted) effect of cognate drugs remained when adjusting for cross-resistance (Fig 4c,d, thick solid vs. thick dashed lines), while the total effect of drugs on their noncognate resistance was much diminished when removing indirect effect (Fig 4c,d, thin solid vs. thin dashed lines). Our analysis therefore suggests both direct and indirect selection for resistance at the single-patient level lasting months and even a year following drug use.
Predicting antibiotic resistance at the single-patient single-infection level
Finally, as resistance is strongly associated with demographics, sample history and purchase history, we wanted to determine the predictive power of these factors individually and when combined together. Models of Logistic Regression and Gradient Boosting Decision Trees (GBDT) were trained and tested on temporally separate periods: training period of 9 years from 1-July-2007 to 30-June-2016 and testing period of the following year, from 1-July-2016 to 30-June-2017 (for some antibiotics, training and testing periods were modified to avoid time periods during which resistance was not routinely measured, Supplementary Fig. 1, green/red horizontal bars). This temporal separation between training and testing data emulates forecasting resistance, as would be the case in real-life implementation of such a method. Area Under the Curve (AUC) of Receiver Operating Characteristic was used as a standard measure for predictive power 37.
Individually considering demographics, sample history and purchase history, we find that each of these sets of features had significant predictive power, yet their dominance varied across the different antibiotics (Fig. 5). For demographic parameters (Methods: Logistic regression “Demographics”), AUC ranged from barely 0.58 for ampicillin to values as high as 0.85 for amikacin, reflecting their weak and strong demographic dependencies, respectively (Fig. 2b). For sample history (Methods: Logistic regression “Sample history”), the predictive power was more consistent, with AUC varying between 0.64 for piperacillin/tazobactam to 0.83 for ertapenem. Purchase history (Methods: Logistic regression “Purchase history”) had barely any predictive power in drugs like meropenem and nitrofurantoin (AUC of 0.52 and 0.59, respectively), yet was higher for other drugs, especially for the fluoroquinolones (AUC of 0.78 for ofloxacin and 0.75 for ciprofloxacin). Comparing the relative strengths of these feature sets, we find that while they vary in dominance across the antibiotics, resistance to any given antibiotic was often well predicted by more than one set, suggesting that even higher predictability could be achieved by combining them.
Combining these feature sets, using logistic regression or GBDT models, much increased predictability of resistance. Using all features together in a complete logistic regression model (Methods: Logistic regression “Complete”), provided good to excellent predictability to all antibiotics (AUC ranged from 0.7 for amoxicillin/CA to and beyond 0.85 for ertapenem, amikacin and ofloxacin; Fig. 5). Predictability of resistance across all antibiotics was increased even further by the complete GBDT models (Methods; Fig 5), most notably for meropenem, which was prone for overfitting (Supplementary Fig. 8). A closer look at the decision trees of the GBDT models identified the relative importance of the different features (Supplementary Fig. 9). While some features, like age and past resistance were important in all models, others like gender and time interval from last antibiotic purchase where more specific to each resistance. Together, these results demonstrate how machine learning models, trained on the combined dataset of demographics, sample history and drug purchase history, can provide high predictability of antibiotic resistance at the single-patient and single-infection levels.
Discussion
Analyzing a large longitudinal medical dataset, we demonstrate the predictability of antibiotic resistance in UTIs based on demographics, antibiotic resistance profile of past UTIs and purchase history of antibiotic drugs. These culture-free predictions could guide appropriate antibiotic prescription. Considering demographics, we found that - age, gender, pregnancy, and residence in a retirement home were strongly associated with resistance, showing complex even non-monotonic patterns specific to each of the different antibiotics. Utilizing repeated same-patient cultures in our database, we identified and characterized a personal component of “memory” of resistance, lasting for many months and even over a year and further contributing to predictability of resistance. Long memory was also observed between resistance and past drug purchases. Resistance to a given drug had long-lasting associations not only with past usage of this same drug, but also with other, even mechanistically unrelated, drugs. Yet, adjusting for linkage among resistances exposed direct selection where drug use leads specifically to its own cognate resistance at the single patient level. Taken together, these results are consistent with drug use directly selecting, at the single-patient level, for strains resistant to it and thereby selecting indirectly, likely through genetic linkage, to resistance to other antibiotics. Once resistance strains occurs they persist in the body affecting the chance of resistant infections months ahead.
Some aspects of the data may complicate the interpretation of our results, especially wherever causality may be inferred. As purchase of a drug does not fully guarantee its concurrent use, later usage of a purchased drug may bias our results towards higher odds ratio for purchases made long before infection. Conversely, we can not exclude that some patients have used antibiotics they did not purchase through MHS, which will bias our results towards lower odds ratio for drug purchases. Also, although culture data is routine for suspected UTIs, sending urine for a culture test is not obligatory. As a result, we assume some UTIs would be empirically treated without any culture record, and there could be enrichment for culture tests following treatment failure. Such enrichment towards cultures following treatment failure can generate bias towards measurements of more resistant samples, resulting in overestimation of the total frequency of resistance, especially for first-line treatment. It can further contribute to the strong short-term association of drug purchases with resistance, especially for first-line antibiotics. The level of enrichment for urine culture tests following treatment failure could itself depend on demographics, which can bias correlations of demographics with resistance. While we cannot exclude these biases, our analysis firmly demonstrates that, with all of these potential biases, resistance of urine samples can be well predicted based on the patient’s specific demographics and clinical history.
The predictability of infection resistance profiles lays the basis for a future paradigm where clinicians will routinely consult machine learning algorithms for prescription of antibiotic treatment tailored by the patient records and their clinical history. While the key factors identified here can serve as the basis of such approach, we expect that the specific model, the exact coefficients and relative weights of predictors, must be tailored for each country or region. Indeed, these algorithms can also be dynamically and adaptively updated in real time as new data is acquired. We also expect that inclusion of additional patient specific factors, such as comorbidities, can further increase resistance predictability. In the longer term, these clinical-record based approaches could be integrated with genomics of the patient as well as of the pathogen 38–244 Implemented in the clinic, machine-learning guided personalized empirical prescription can reduce treatment failure as well as lower the overall use and misuse of antibiotics thereby assisting in the global effort of impeding the antibiotic resistance epidemic.
Methods
Data
Anonymized clinical records of urine culture tests (“culture reports”) and records of antibiotic purchases (“purchase reports”) were obtained from Maccabi Health Services (MHS) for the time period from July 2007 to June 2017. Random patient identifiers were used to link culture reports and antibiotic purchase reports.
Culture reports
Antibiotic resistance profiling of bacterial pathogens isolated from urine cultures was carried out centrally (in two locations until 2010, and in one central lab since). We retrieved 711,099 culture reports of positive samples from 315,047 patients total. Each report included: (1) Unique patient code; (2) Date of sample; (3) List of isolates cultured with species identification (typically one isolate per sample; 3.6% of samples had more than one isolate); (4) Resistance profile of the isolates from processed results of a VITEK 2 system given as Sensitive, Intermediate and Resistant for each drug tested. Not all antibiotics were measured for each sample, and only the 16 most commonly measured antibiotics were included for further analysis (NResistances = 16, Supplementary Fig. 1, Table 1). Additionally, for each of the 16 included antibiotic resistances, we removed time intervals where only a small percentage of samples were measured to avoid sampling bias (Supplementary Fig. 1, gray). (5) Demographics: age, gender, pregnancy of the patient, as well as a code for the reporting physician which we used to identify patients residence in retirement homes. (6) Referring lab identifier, which was used to assign patients to a city of residence.
Antibiotic purchase reports
All drug purchases by prescription are routinely recorded in MHS databases. We identified and retrieved all purchases by patients with culture reports by converting internal MHS drug codes to ATC classifications of antibiotics (Supplementary Table 2). Each purchase record included: (1) Unique patient code to be linked to the code of the culture record; (2) Internal MHS product code, which was translated to an ATC drug code, (3) Date of purchase.
Choice of drugs for analysis
We focused on 34 groups of antibiotic compounds (NATC = 34), including the 30 most purchased during the time of the study (groups 1-30, Supplementary Table 2) as well as less frequently purchased groups of compounds manually identified as cognate drugs of the 16 antibiotic resistances analysed (groups 31-33 in Supplementary Table 2).
Feature definition
For each cultured sample, we define the following parameters used for the logistic regression and the gradient boosting decision trees:
Sample resistance profile
For each urine sample, we define Yk as 0 for sensitive and intermediate and 1 for resistant to antibiotic k (1 ≤ k ≤ NResistances). If the sample had multiple isolates, Yk was assigned 1 if at least one isolate was resistant. Missing resistance measurements are defined as N/A, and for each antibiotic k only samples which have defined resistance to it are used when training or testing its Logistic Regression or Gradient Boosting Decision Trees (GBDT).
Demographics
XGender: 0/1 for males/females; XPregnancy: 0/1 indicating pregnancy; XRetHome: 0/1 indicating residence in retirement homes; XAge: patient age in years at time of UTI sampling; XDate: date of sample in days starting at 1-July-2007; XSeason: the day of sample within the calendar year, where 1 is 1-Jan and 365 is 31-Dec; : 0/1 indicating residence in city j (1 ≤ j ≤ NCity, NCity = 20, Supplementary Fig. 2 for City list; we use Tel-Aviv as a reference); for GBDT, nominal XCity ∊ {1, ‥,20} was used instead.
Sample history
For a given sample, we consider all earlier samples taken from the same patient (if any). We bin the time difference between any such earlier sample and the current sample, t = tpast sample – tSample (t is negative designating past events), into one of 8 logarithmically spaced time bins (i = 1,2,…,Ntime bins Ntime bins = 8 a bin i is defined by ti ≤ t < tt–1, where the boundaries of these time bins are t0 = 0, t1 = −1, t2 = −2, t3 = −4,…, t8 = −128 weeks). We then calculated and as the number of prior cultures within time bin i, with sample resistance Yk equal to 0 or 1 (Resistant or Sensitive), respectively.
Drug purchase history
For each urine sample, we consider all earlier drug purchases made by the same patient. We bin the time difference between the urine sample date and a given past purchase, t = tPurchase − tSample, into 8 time bins as above. For each sample, we then calculate as the number of purchases of drugs of a given group j (1 ≤ j ≤ NATC, Supplementary Table 2) made by the patient during time bin i. For the GBDT model, we also included as additional features the types j of the 3 latest purchases (nominal features) and their purchase-to-sample time difference (t, in days)
Cross-resistance
To resolve direct versus indirect associations of drug purchase and resistance, we adjusted the logistic regression of resistance to a given antibiotic k as a function of past drug purchases by the resistances to all other drugs j which are non-analogous to k. We define Ak,j as a binary variable equals 0 and 1 for analogous versus non-analogous drug pairs, respectively. “Analogous” pairs are defined as antibiotics which have exceptionally high cross-resistance (Ak,J = 0 for Corr(Yk, Yj) > Athreshold; we use Athreshold = 0 7 which corresponds to drug pairs of the same class; see pairs labeled with ‘x’ in Supplementary Fig. 7). We then add as features in the regression analysis of a given antibiotic k the resistance measurements Yj to all antibiotics j for which Ak,j = 1. Note: These cross-resistance features provide information from the focal sample and were used only in the analysis of direct/indirect effect of purchases (Fig. 4 and Supplementary Fig. 6c). Importantly, they were not used when evaluating predictability of resistance (Fig. 5).
Logistic regression
Logistic regression of resistance for each antibiotic was performed via the Matlab glmfit function. For each of the 16 antibiotics k, the probability of resistance Pk was fit to the sample resistance Yk for all urine samples which had measurement of resistance to k either across the entire 10 year dataset (for Figs. 2,4), or across the “training period” (for the analysis of predictive power of Fig. 5; see Supplementary Fig. 1 for definition of the training period for each of the 16 antibiotics). The different logistic models included combinations of the following 10 terms:
Different combination of the above terms were used in the different regression models as follows (each row in the Table represents a logistic model that was applied to each of the 16 antibiotics):
Calculating odds ratios from logistic regression
For each antibiotic k, odds ratios were calculated from the coefficients of above logistic regressions.
Binary variables
For the binary variables Gender, Pregnancy and Retirement Home, odds ratio were defined as: female versus male, pregnant versus non-pregnant, retirement home residence versus patients not residing in retirement homes.
Categorical variables
For City, which is a categorical variable, odds ratios for each city j relative to the reference (Tel-Aviv) is given by , where are reported in Supplementary Table 1. In Fig. 2, we report for City the odds ratio , where jmax is the index of the city with largest effect, accounting for error (jmax is defined for each antibiotic k as the city j with maximal, where are the standard errors for the regression coefficients
Oscillatory variables
For seasonality, the regression coefficients and and their CIs are given Supplementary Table 1. In Fig. 2 we provide the odds ratios of the oscillation amplitude: .
Quadratic variables
For the quadratic variables Age and Date, we again report all the individual regression coefficients and their CIs in Supplementary Table 1. In Fig. 2b, we report, for each antibiotic k, effective odds ratios defined as the ratios between the maximal and minimal expected odds taken across the relevant range of these continuous variable (0 ≤ XAge ≤ 100 and 0 ≤ XDate ≤ 10 · 365):
Note that when these quadratic dependencies are monotonic within the relevant range (0 ≤ x ≤ 1), the above formulas become simply: ,.
Analysis of “memory” across sample pairs
To analyze “memory” of resistance across samples, we considered all pairs of samples from the same patient (across all patients with 2-10 samples) and binned them according to their time difference t = t1 – t2 (where t1 and t2 are the sample dates of the early and late sample; t is always negative, indicating information on current sample from past samples) into time bins as indicated by the bars in Fig. 3. In each time bin and for each antibiotic, we counted NR→R, NR→S, NS→R, and NS→S as the number of urine sample pairs where the early and late samples are Resistant, or Sensitive (for example NR→S is the number of same-patient sample pairs, within the time difference bin, where first sample is Resistant and the second Sensitive to the given focal antibiotic. For any antibiotic, only samples for which resistance to the antibiotic was measured were considered). We then calculated for each time difference bin the resistance odds ratio: ORpairs = (NR→R/NR→S) / (NS→R/NS→S).
Gradient Boosting Decision Trees (GBDT)
GDBT is an ensemble method combining regression trees with weak individual predictive performances, into a single high-performance model. This is done by iteratively fitting decision trees, each iteration targeting the prediction residuals of the preceding tree. The final model is built by combining weighted individual tree contributions, with weights proportional to their performances. For each of the 16 antibiotics, a boosted decision tree ensemble was fitted using all features as defined above (demographics, sample history and drug purchase history) on the training set as defined by the training time period (Supplementary Fig. 1, green bars). This training dataset was sampled to balance resistant/sensitive label frequency. For parameter tuning, a validation dataset was sampled from the training set to be used for model selection (20%). For the estimator of the ith iteration, a decreasing learning rate ηi was used such that ηi = η0αi, with an annealing rate α =0.99 and an initial learning rate η0 = 0.1. To further promote a diverse ensemble of individual estimators, a 0.9 feature-sampling and observation-sampling rates were used. Fitting of interaction effects is controlled by varying the size of the individual regression trees, with tree estimator of depth k producing models with up to k-way interactions. The model was tuned to match data complexity by iteratively increasing tree depth limit of all ensemble estimators while evaluating performance on the validation set, selecting the best depth for each antibiotic.
Ethical approval
The study protocol was approved by the ethics committee of Assuta Medical Center.
Contributions
V.S. and R.Kishony perceived the study; I.Y, O.S., G.K., V.S. and R.Kishony designed the study; R.Katz, M.P. and V.S. retrieved and interpreted electronic health records; I. Y., O.S., G.N. and R.Kishony analysed the data; I.Y, O. S., G.C., V.S. and R.Kishony interpreted the results; I.Y and R.Kishony wrote the manuscript with comments from all authors.
Acknowledgements
We thank M. Datta, A. McAdam, G. Priebe and P. Ramesh for thorough reading of the manuscript and important comments. This work was supported in part by US National Institutes of Health grant R01 GM081617 (to RK) and European Research Council FP7 ERC Grant 281891 (to RK).