Abstract
The Bene Israel Jewish community from West India is a unique population whose history before the 18th century remains largely unknown. Bene Israel members consider themselves as descendants of Jews, yet the identity of these ancestors and their arrival time to India are unknown, with speculations on arrival time varying between the 8th century BCE and the 6th century CE. Here, we characterize the genetic history of Bene Israel by collecting and genotyping 18 Bene Israel individuals. Combining with 438 individuals from 32 other Jewish and Indian populations, and additional individuals from worldwide populations, we conducted comprehensive genome-wide analyses based on FST, PCA, ADMIXTURE, identity-by-descent sharing, admixture LD decay, haplotype sharing and allele sharing autocorrelation decay, as well as contrasted patterns between the X chromosome and the autosomes. Bene Israel individuals resemble local Indian populations, while at the same time constituting a clearly separated and unique population in India. They share genetic ancestry with other Jewish populations to an extent not observed for any other Indian population. Putting the results together point to Bene Israel being an admixed population with both Jewish and Indian ancestry, with the genetic contribution of each of these ancestral populations being substantial. The admixture took place in the last millennium, about 19-33 generations ago. It involved Middle-Eastern Jews and was sex-biased, with more male Jewish and local female contribution. It was followed by a population bottleneck and high endogamy, which has led to increased prevalence of recessive diseases in this population. This study also provides an example of how genetic analysis advances our knowledge of human history in cases where other disciplines lack the relevant data to do so.
Introduction
How well does the oral history of a group reflect its origins? The Bene Israel community in West India is a unique community whose historical background before the 18th century other than their oral history remains largely unknown (1-3). The Jewish philosopher, Maimonides, in a letter written 800 years ago (circa 1200 CE), briefly mentioned a Jewish community living in India and may have referred to them (4). In the 18th century the Bene Israel lived in villages along the Indian Konkan coast and were called Shanwar Teli (Marathi for ‘Saturday oil pressers’), as they were oil pressers who did not work on Saturdays. After 1948, most of the community immigrated to Israel. At the beginning of the 21st century, approximately 50,000 members lived in Israel, whereas about 5,000 remained in India, mainly in Mumbai (2). Oral tradition among the Bene Israel holds that they are descendants of Jews whose ship wrecked on the Konkan shore, with only seven men and seven women surviving (2, 3, 5). The exact time of this event, as well as the origin and identity of the survivors, are not known. Some date it around two millennia ago (2), whereas others suggest a specific date and origin: around 175 BCE, where the survivors were Jews living in the northern parts of the land of Israel that left their homes during the persecutions of Antiochus Epiphanes (5). Adding to the vagueness of Bene Israel origin is the fact that a similar story of seven surviving couples is found in the myths of other Indian populations (2, 3). Others suggest that the ancestors of Bene Israel arrived to India earlier – as early as 8th century BCE – or later – from Yemen, during the first millennium CE or from Southern Arabia or Persia in the 5th or 6th century CE (4). However, beyond vague oral traditions and speculations, there has been no independent support for any of these claims, indeed whether Bene Israel are related at all to other Jewish groups and their origin remains “shrouded in legend” (4). In the last decades, genetic information has become an important source for the study of human history and has been applied numerous times for various Jewish populations, first by using uniparental Y chromosome and mitochondrial DNA (mtDNA) markers (6-11) and later by using genome-wide markers (12-15). These studies found that most Jewish Diasporas share ancestry that can be traced back to the Middle-East, in accordance with historical records (12-15). Some of these studies included Bene Israel members, though with inconclusive results (15). Bene Israel mtDNA pool was strongly dominated by a local Indian origin (8, 11, 13) although a few haplogroups found in Bene Israel samples were not present in local Indian populations, but were present in several Jewish populations (11). Y chromosome analysis suggested some paternal link between Bene Israel and the Levant, but the study was based on only four Bene Israel males (13). Another Y chromosome analysis showed that a common Indian haplogroup was almost absent in Bene Israel males, whereas the Cohen Modal Haplotype (CMH) (16) was common in Bene Israel (and other Jewish) males though also present at lower frequencies in other Indian populations (3). These results suggest that the founding males of this population might have had Middle Eastern, possibly Jewish, origins. On the contrary, analysis of the autosomes or the X chromosome did not find any evidence of Jewish origin of Bene Israel community, and it has been concluded that they resembled other Indian populations (13, 15). Thus, these studies also left their genetic history largely unknown.
The complex genetic structure of Indian populations imposes further challenges for genetic analysis of Bene Israel. Previous studies showed that most contemporary Indian populations are a result of ancient admixture (64-144 generations ago) of two genetically divergent populations: ancestral north Indians (ANI), who are related to west Eurasians, and ancestral south Indians (ASI), who are not closely related to populations outside India and related to indigenous Andaman Island people (17, 18). Different Indian populations vary in the proportion of admixture between these two ancestral populations (18). As the ANI component is related to west Eurasia, which includes the Middle East and Europe, a study analyzing the connection between Bene Israel and other Jewish or Middle Eastern populations needs to examine whether such a connection reflects a unique ancestry component, rather than simply being a result of the ANI component that is shared by other Indian populations. To study the genetic history of Bene Israel, while addressing this challenge, we present here the largest collection of Bene Israel individuals that has been assayed genome-wide to date (18 individuals), and we use the collection in conjunction with genotype data of 438 individuals from 32 other Jewish and Indian populations, as well as samples from various worldwide populations. We apply an array of genome-wide population genetic tools that characterize the origins of Bene Israel and their relations to both Indian and Jewish populations and uncover unknown parts of their history.
Results
Bene Israel cluster with Indian populations but as a distinct group
We genotyped Bene Israel individuals and combined with 14 other Jewish populations from worldwide Diaspora previously genotyped using the same array (12, 14). We applied various quality control (QC) steps on these samples, resulting with 18 individuals of the Bene Israel community together with 342 samples from the other 14 Jewish populations. We also applied the same QC steps to a different dataset with samples from 18 different Indian populations (96 individuals), as well as HapMap3 populations, that were genotyped previously on the same array (17) and merged the two datasets. We also merged the data with Middle-Eastern non-Jewish populations (Druze, Bedouin and Palestinians) from the HGDP panel (19) (Table S1). As the merging with HGDP resulted in considerable reduction in number of SNPs available for analysis, we only considered it for some analyses. PCA (Principal Component Analysis) together with four HapMap populations (YRI, CEU, CHB and JPT; total 809 individuals) showed that Jewish populations cluster together with Europeans, while Indian populations form their own cluster, between East Asians and Jews/Europeans (Figure 1A). Bene Israel clustered with the Indian populations, similar to the results of a previous study (13), but were the closest to the Jewish/European cluster as compared to all other Indian populations (Figure 1A). When focusing only on Jewish and Indian samples (Figure 1B), the first PC separated between Indian and Jewish populations while the second PC spanned the Jewish populations. Indian populations were projected based on their ANI-ASI admixture (17, 18) such that populations with higher ANI proportion were closer to Jews in general, as expected by the Middle-Eastern origins of the latter (12-15), and specifically to Middle-Eastern Jews. Bene Israel members were located closely to members of other Indian populations but were also the closest to samples from Jewish populations among all Indian samples (Figure 1B). Motivated by this observation, we applied PCA focusing only on Bene Israel and Indian populations, in which Bene Israel members formed distinct cluster that is clearly separated from other Indian populations (Figure 1C). PCA with only Bene Israel and Jewish populations showed their separation from other Jewish populations (Figure 1D). To avoid bias in PCA due to differences in number of samples between populations (20), we repeated the analysis while limiting the number of samples from each population to 4, and obtained similar results (Figure S1).
We also examined the relation between Bene Israel and other Indian and Jewish populations using the FST statistic, which measures genetic drift between populations based on differences in allele frequencies (17, 21) (Figure S2 and Table S2). This analysis revealed the isolation and genetic drift of Bene Israel from both Indian and Jewish populations: While the average FST between pairs of Jewish populations was 0.011, the average FST between Bene Israel and other Jewish populations was significantly higher (0.04, Wilcoxon rank sum P-value=1.97e-9). Similarly, while the average FST between different Indian populations was 0.011, the mean FST between Bene Israel and other Indian populations was significantly higher (0.033, P-value=8.63e-12, Wilcoxon rank sum test).
ADMIXTURE analysis suggests Bene Israel members have Middle-Eastern ancestry
ADMIXTURE (22) assigns for each individual its proportion in any of a set of hypothetical ancestral populations and hence can reveal relations between different populations. We used this tool on our dataset for varying values of K (the number of hypothetical ancestral populations) on a set of Jewish, Indian, non-Jewish Middle-Eastern (Druze, Bedouin and Palestinian), and three HapMap populations (CEU, CHB and YRI; Figure 2 and Figure S3). For K=3 (Figure 2 and Figure S3), we observed three clusters: East Asian, Sub-Saharan African and Middle Eastern/European. Indians were mainly composed from Middle-Eastern/European and East-Asian components. The proportion of the Middle-Eastern/European component in the Indian populations was correlated to their ANI component in the ANI-ASI admixture reported previously (17) (R=0.91, P-value<e-16, Spearman correlation). Among the Indian populations, the Bene Israel had the highest proportion of Middle-Eastern/European component. At K=4 an Indian cluster emerged which reflects the ASI component in these populations and most Indian populations were composed from this component and the Middle-Eastern/European component, while some of them also had an East-Asian component. Again, Bene Israel population had the highest proportion of Middle-Eastern/European component. At K=5, which provided the best fit based on cross-validation, the European/Middle-Eastern cluster was divided into two clusters: European (reflected by the European population CEU) and Middle-Eastern (reflected by Jewish and Middle-Eastern populations). Importantly, Bene Israel population exhibited a different trend as compared to other Indian populations: While the ANI component of Indian populations was now reflected in the European component, the Bene Israel showed a significantly higher proportion of a Middle-Eastern component as well (mean 31% as compared to less than 16% in all other Indian populations, Wilcoxon rank sum P-value=4.79e-12). For K=6 a new cluster emerged which was mainly found in North-African Jews (Djerban, Libyan and Tunisian Jews). Interestingly, for K=7 Bene Israel formed their own hypothetical ancestral component, marking again the uniqueness of this population and its deviation from other populations. This ancestral component was present, in very small proportions, in many Indian populations but also in some Middle-Eastern populations (Jewish and non-Jewish).
Identity-by-descent analysis suggest Bene Israel members are related to Jewish populations
Next, we analyzed the relations between populations based on Identity-by-descent (IBD) sharing of their individuals. IBD segments shared by two individuals represent a segment inherited from a common ancestor. Higher IBD sharing suggests a more recent common ancestor (23). Following a similar previous analysis of the Jewish populations examined here (except Bene Israel) (12, 14), we used GERMLINE (24) to detect IBD segments between individuals and defined the IBD sharing between individuals to be the total length (in cM units) of IBD segments shared between the two individuals. IBD sharing between populations was defined as the average IBD sharing of unrelated individuals from these populations. As expected from previous results, each Jewish population exhibited significant higher IBD sharing with other Jewish populations than with Indian populations and each Indian population exhibited higher IBD sharing with other Indian populations than with Jewish populations (Wilcoxon P-value<0.05 for all populations; Figure 3A). Having these two IBD clusters of Indian and Jewish populations, we observed that compared to all Jewish populations, Bene Israel had the highest IBD sharing with all Indian populations, and compared to all Indian populations, Bene Israel had the highest IBD sharing with all Jewish populations (Figure S4A). Furthermore, the only population with no significant of IBD sharing between the two clusters of Jewish and Indian populations was Bene Israel: (mean IBD sharing=18.24 cM vs. 18.19 cM with Indian and Jewish populations, respectively. P-value=0.61; Wilcoxon rank sum test; Figure 3B). The closest populations to Bene Israel, in terms of IBD sharing were Middle-Eastern Jews (Georgian, Iraqi and Syrian Jews) from the Jewish side and Velama, Lodi and Bhil from the Indian side. Interestingly, the closest Indian populations to Bene Israel were not those with the highest ANI component. To examine whether the relatively high IBD sharing of Bene Israel with Jewish populations and mainly Middle-Eastern Jews is Jewish specific or Middle-Eastern in general, we repeated our analysis on a merged dataset containing non-Jewish Middle-Eastern populations as well. The IBD sharing of Bene Israel and these non-Jewish populations was lower as compared to their sharing with all other Jewish populations (Figure 3C; Figure S4B), implying that the link between Bene Israel and Jewish populations is at least in part Jewish-specific and not only Middle-Eastern. Although there were differences between the IBD sharing in the two datasets due to the different set of SNPs, there was overall significant correlation between the ranking of IBD sharing of Jewish (R=0.85, P-value=1.36e-4; Spearman correlation) and Indian (R=0.52, P-value=0.03, Spearman correlation) populations in the two datasets. Bene Israel showed higher IBD sharing with Middle-Eastern Jewish populations in this dataset as well (Figure 3C). Higher sharing with Middle-Eastern Jewish populations was also observed when we restricted the analysis to longer segments of IBD that reflect a more recent ancestor (Figure S5). In the Indian side we did not see a clear trend for one of the populations, although in the original dataset Velama showed the highest IBD sharing also in respect to longer IBD segments (Figure S5).
Bene Israel as an admixed population of Jewish and Indian ancestral populations
Motivated by the above results, we next examined whether the Bene Israel community was an admixture of Indian and Jewish ancestral populations, using two different approaches as implemented in the ALDER (25) and GLOBETROTTER (26) tools. Given a putative admixed population and two populations that are taken as surrogates for the true ancestral populations, ALDER computes an admixture linkage disequilibrium (LD) statistic in the admixed population and uses it to examine whether the population is indeed an admixture of the ancestral populations (25). In most cases of one Jewish and one Indian population taken as surrogate ancestral populations, there was a consistent and significant evidence for Bene Israel being admixture between these two populations (147 out of 252 possible pairs. Table S3 and Figure S6; the only population that did not show a significant evidence for being an ancestral population for Bene Israel was the Indian Kashmiri Pandit). Repeating the same analysis but with pairs of Indian populations or pairs of Jewish populations, as well as replacing Bene Israel with any other Indian or Jewish population did not find any pair with significant and consistent evidence for admixture, suggesting that the observed admixture for Bene Israel was not reflecting ANI-ASI admixture but a unique admixture between Jewish and Indian populations. ALDER admixture estimated time varied across the 147 significant pairs of populations, from ~19 (Iraqi Jews and Satnami) to ~33 (Georgian Jews and Mala) generations ago (650-1050 years ago, assuming 29 years per generation (18, 27)) with an average of ~25 generations (~820 years) ago (Table S2). These estimations place the admixture between a Jewish and Indian population well after the estimated time for the ANI-ASI admixtures of Indian populations (64-144 generations ago (18)) and after the establishment of many Jewish Diasporas (15) (Figure 4A). Turning to admixture proportions estimations based on ALDER, those estimated for Indian populations, varying between 44% (Vaish) and 20.2% (Kharia) were generally higher than that of Jewish populations, varying between 23% (Georgian Jews) and 15.5% (Libyan Jews; Figure 4B). When repeating ALDER analysis using the merged dataset with non-Jewish Middle-Eastern populations, the results were less significant, as expected by the smaller number of markers, but still many pairs of one Jewish/Middle-Eastern population and one Indian population were significant. Importantly, the results were more significant for Jewish populations as compared to non-Jewish Middle-Eastern populations: While 3 of the 17 (17.6%) Jewish/Middle-Eastern populations examined were non-Jewish, only 8 of the 113 (7.1%) significant pairs contained non-Jewish population, while all other pairs contained a Jewish population (Table S4 and Figure S7).
In addition, we also applied GLOBETROTTER on our dataset. GLOBETROTTER assigns haplotype segments of the admixed populations to different populations and uses the co-distribution of such segments from different populations to detect and infer admixture. In comparison to ALDER, which focuses on a pair of putative ancestral populations, it considers all populations and assigns ancestry component to all of them simultaneously. Importantly, GLOBETROTTER found evidence for admixture in the similar time range suggested by ALDER – 25.8±2.25 generations ago. Admixture proportion estimation was 54% Indian and 46% Jewish. The Jewish cluster was mainly composed of Middle-Eastern Jewish populations while the Indian cluster was mainly composed of populations with a high ASI component (Figure 4C). This may suggest that the ancestral Indian population had high ASI component. However, some of the ANI component of the ancestral Indian population may have been captured by the Jewish cluster, resulting in the Indian side containing a higher ASI component. If the latter is true, GLOBETROTTER proportions estimation for the Jewish side (46%) is an overestimation of the true proportion as it also contains some of the ANI component in the Indian side. We note here that GLOBETROTTER’s original study analyzed, among 95 worldwide populations, “Indian Jews” (26). However, this group included both the Bene Israel and Cochin Jews (four samples from each population (13)), and none of the Jewish populations examined here was used for the analysis. Nevertheless and reassuringly, they reported an admixture event occurring approximately 20 generations ago, with one side being Indian while the other side related to the Middle-East (e.g., South Italians and Jordanians), which may partially reflect the admixture we report here for the Bene Israel.
High endogamy and founder event in Bene Israel
We now turn to examine the post-admixture population structure of Bene Israel. Both Jewish populations (14, 23), as well as Indian populations after the ANI-ASI admixture (17, 18), show high endogamy. We found that while Jewish populations showing higher IBD sharing as compared to Indian populations (P-value=4.84e-4, Wilcoxon rank sum test) the Bene Israel population exhibited a level that was almost as twice as high as any other of these populations (Figure 3D). Similarly, Bene Israel exhibit higher total length of homozygous segments (Figure 5A) and lower heterozygosity (Figure S8). These results can suggest not only endogamy but also a genetic bottleneck or a founder event where contemporary Bene Israel population descended from a small number of ancestors. To directly examine this hypothesis, we used an allele-sharing statistic that measures the autocorrelation of allele sharing between individuals within a population and subtracts the cross-population autocorrelation to remove ancestral autocorrelation effect. The decay of this statistic with genetic distance can verify if and when a founder event has happened (17, 28). We applied this method to our dataset, using either all Jewish or all Indian populations for cross-population autocorrelation calculation and fitted it to one or two founder events (Figure 5B and Figure 5C). When fitted to a single founder event, analysis suggested it occurred 16 (using Jewish populations) and 14 generations ago (using Indian populations). Fitting to two founder events, which was slightly better, suggested a first founder event 30 generations ago followed by a second event 12 generations ago (using Jewish populations), and a founder event 26 generations ago followed by a second event 9 generations ago (using Indian populations). The first of these two events fits within the timescale of the admixture estimated above and may reflect the founding of this population in the admixing of Jews and Indians. The estimated time of 14-16 generations ago of a single founder event may be the average of these two founder events. The founder event and the genetic drift associated with it are reflected in several other results: by the relatively high FST values between Bene Israel and other Indian and Jewish populations (Figure S2) and by ADMIXTURE analysis which revealed that at K=7 Bene Israel form their own distinct cluster, though this population sample encompasses only 18 individuals out of 746 (Figure 2).
Bene Israel admixture has been sex-biased
Lastly, we examined whether the ancestry of Bene Israel has been sex-biased using the Q ratio (29). In a population with equal size of males and females, there are three copies of X chromosomes for every four copies of each of the autosomes and therefore the expected genetic drift on the autosomes is 3/4 of the genetic drift on chromosome X, though this ratio is affected by many additional factors (29-31). We found a significantly (P-value=5.11e-6, Wilcoxon rank sum test) lower ratio between Bene Israel and Jewish populations (Figure 6 and Table S5; mean=0.58) than between Bene Israel and Indian populations (Figure 6 and Table S5; mean=0.73; See also Figure S4). This entails that the Jewish contribution to Bene Israel has been smaller than otherwise expected for the X chromosome, which points to more male than female Jewish ancestors contributing to the formation of Bene Israel, consistent with findings from previous studies based on Y chromosome analysis suggesting a paternal link between Bene Israel and Middle-Eastern populations (13). As most of the Bene Israel samples in our dataset were women, we did not have enough power to analyze the Y chromosome, but mtDNA analysis revealed common Indian haplogroups (M and R, (32, 33)), consistent with previous studies (8, 11, 13, 15) and the sex-bias we discovered above, while only a few samples had the H haplogroup which is common in Europe and in the Middle East (34) (Table S6).
Discussion
Previous studies used genetic markers to investigate the history of worldwide Jewish populations. While most Jewish Diaspora groups could be linked together and traced back to the Middle East, in accordance with historical records (12-15), there were a few exceptions. Among these exceptions stood the Bene Israel community in India (15). Autosomal markers failed to distinguish between Bene Israel and other Indian populations, with some suggestive evidence, based on uniparental markers, for a non-Indian component and a possible paternal link to the Middle East (3, 8, 11, 13, 15). Furthermore, their history, beyond vague oral traditions, remains largely unknown also in the presence of other, non-genetic, studies, highlighting the importance of a comprehensive genetic study of this population to reveal their history. Indeed, a major advantage of the current work over previous works is the richness of the data. First, we had genome-wide genetic information for 18 Bene Israel samples as compared to only four samples in a previous genome-wide study (13). In addition, the Indian dataset (17) was much more comprehensive as compared to previous studies on Bene Israel, containing genome-wide genetic markers of 96 samples from 18 different Indian populations. This detailed representation of the complex genetic history of India has been crucial in our addressing the challenge of inferring a unique Jewish contribution rather than the ANI component in the ANI-ASI admixture (17, 18). Our results partly support the oral history of the Bene Israel by suggesting that the Bene Israel is an admixed population, with the ancestral populations being both Indian and Jewish. The relations with Jewish populations were highlighted via different analyses (e.g., ADMIXTURE, IBD) and also suggested Jewish specific, rather than Middle-Eastern, linkage. In addition, admixture was sex-biased, with more males in the Jewish side, as was also suggested previously (13). Similarly, mtDNA analyses in this and previous studies (8, 11, 11, 13, 15) show it is mainly of Indian origin. Notably, while a previous study also pointed to a suggestive paternal link to Middle-Eastern populations via analysis of the Y chromosome (13), this is the first study to show such a link based on genome-wide autosomal data, thus enabling us to examine the relation in much more depth. Specifically, we traced the admixture event and inferred its time and admixture proportions using genome-wide based approaches. Both ALDER and GLOBETROTTER suggest that the Bene Israel population is an admixed population of both Indian and Middle-Eastern, likely Jewish, ancestry. The fact that ALDER did not observe a similar trend to any of the other Indian population examined here suggests that the admixture is not related to the ANI-ASI admixture. Admixture estimated times were consistent between the methods: between 19 to 33 generations ago (~650-1050 years) ago (ALDER) and 26 generations (~850 years) ago (GLOBETROTTER). Finally, these analyses also point to substantial genetic contribution from both Jewish and Indian ancestral populations. Maimonides’ letter describing a Jewish population in India, which may be Bene Israel (4), was written ~800 years ago and is well within our estimated admixture time. This time is relatively recent as compared to Bene Israel oral tradition for the arrival of their Jewish ancestors to India (ranging from 8th century BCE to 6th century CE), yet none of these dates has any independent support (4). Importantly, the admixture estimated time captures the timing of the admixture event between Jewish and Indian populations, but it is plausible that Jews arrival to Indian predates this specific admixture. Similarly, our analysis assumes a single admixture event, but if several admixture events occurred, or if admixture has been more continuous, the estimated time of admixture may be intermediate between the different events, and biased towards the more recent admixture time (27). While the admixture time is well after the ANI-ASI admixture and the forming of Jewish Diasporas, our analyses cannot suggest a unique pair of Indian and Jewish populations that are most likely to be the ancestral populations of Bene Israel. However, they suggest that the Jewish forefathers of Bene Israel came to India from geographically close Middle-Eastern communities, perhaps through the Silk Road, and not from farther communities. Regarding the ancestors from the Indian side, initial results based on PCA with both Indian and Jewish populations (Figure 1B) position Bene Israel close to Indian population with highest ANI ancestry. However, several analyses then suggest that this is likely due to the Jewish contribution to the population, and that the closest ancestral Indian populations are not necessarily those with the highest ANI values
Post-admixture analysis of Bene Israel reveals isolation and high endogamy in this population. Although both Jewish (14, 15) and Indian populations (17) show high levels of endogamy, endogamy in Bene Israel is much higher as compared to these populations. Similar to many other Jewish Diasporas (15, 35), we find evidence for a founder event in Bene Israel. The estimated time of this event is ~14-16 generations ago, but it can also reflect an averaged time for two distinct founder events, the first occurring at time of admixture and the second more recently. If indeed two founder events occurred, the estimated time of the first event provides further support to the admixture event and its timing, as admixture between a small group of Jews that arrived to India and local Indians is likely to be accompanied with a founder event. The isolation and the genetic drift experienced by Bene Israel have an effect on other analyses as well. For example, the 3-population test fails to find evidence for admixture by exhibiting positive f3 values, likely due to the post-admixture genetic drift (17, 36) while ALDER, which is less sensitive to such drift (25), is able to detect it. The Bene Israel was traditionally divided into two groups that in previous generations did not marry each other: Gora (or the “White” Bene Israel), presumed to be descendants of the seven couples who landed in the Konkan shore, and Kala (or the “Black” Bene Israel), presumed to be descendants of admixture between Bene Israel men and non-Bene Israel women (1, 2). Our analysis did not provide evidence for two clear subgroups within Bene Israel samples. This may be a result of a too small sample size or biased cohort. Revealing the high endogamy and founder event(s) in Bene Israel is important not only from historical but also from medical perspective, as it predicts higher rates of recessive diseases within this population (37). Indeed, a recent study on isolated foveal hypoplasia, a rare eye disease leading to poor vision, found that unrelated Bene Israel patients share homozygous mutation (c.95T<G, p.Ile32Ser) in the SLC38A8 gene (38). Other recessive mutations in SLC38A8, a putative glutamine transporter, result in a similar medical condition (39). The high prevalence of this mutation in the Bene Israel (10% of Bene Israel individuals screened in the study (38)) which was completely absent from the entire set of individuals, including European and Indian, in the 1000 Genomes Project (40), is likely a result of the founder event and high endogamy this community has experienced.
In conclusion, our results, based on ensemble of different approaches, combine to support the oral history of the Bene Israel as having both Jewish (likely Middle-Eastern) and Indian origin, whereas the estimated time of this admixture is more recent as compared to their oral history. This is an example where genetic study not only confirms known history but also introduces novel historical insights.
Materials and Methods
Recruitment of Bene Israel individuals
Following a local ethics committee and an Israeli Ministry of Health Institutional Review Board approved protocol, Bene Israel samples were collected in two locations in Israel, with all subjects providing informed consent:
(1) Bene Israel synagogue in Ramla (8 samples).
(2) Sheba Medical Center, Tel Hashomer (20 samples; see below).
The 20 samples from Sheba Medical Center were taken from individuals (who came for prenatal or oncogenetic counseling) identifying themselves as Indian Jews, rather than necessarily Bene Israel and therefore were either Cochin Jews or Bene Israel, the two large Indian Jewish communities in Israel. Although all individuals were recruited in Israel, where the vast majority of Indian Jews now live, they did not mix with non-Indian Jewish populations: all individuals recruited reported that their four grandparents belonged to the same Jewish community, similar to other Jewish populations analyzed in the current and in previous works (12, 14).
In order to distinguish between Bene Israel and Cochin Jews in samples from Sheba Medical Center, we used the SMARTPCA tool from the EIGENSOFT software (41) and ADMIXTURE (22). Briefly, we added to the 20 Sheba samples additional samples with known population of origin: 8 Bene Israel samples (from Ramla, as described above), 20 Cochin Jews samples (taken from the National Laboratory for the Genetics of Israeli Populations) and 36 Ashkenazi Jews samples (12). Projection of these samples on the first two principal components (PCs) clearly showed three main clusters: Ashkenazi Jews, Cochin Jews and Bene Israel (Figure S9). Similarly, ADMIXTURE analysis (with K=3) suggested three main corresponding clusters (Figure S10). We labeled an Indian Jewish sample as either Bene Israel or Cochin Jewish if the ADMIXTURE estimated the fraction of the corresponding inferred cluster in this sample was at least 95% and if this was also visibly reflected in the PC analysis.
Following this procedure, eleven Indian Jews were labeled as Bene Israel. One of the Bene Israel samples was later removed in the quality control (QC) steps performed on the dataset, as described below, resulting in a total 18 Bene Israel samples used in the current analysis.
In addition to that, we repeated PCA (Figure S11) and ADMIXTURE (Figure S12) analyses presented in the main text while using only the samples collected in Ramla and not those collected in Sheba Medical Center, obtaining similar results as those described in the main text for all Bene Israel individuals.
Further support to our Bene Israel and Cochin Jews labeling was obtained from another independent source. Previously, Behar et al. (13) genotyped various worldwide Jewish populations, including Cochin Jews and Bene Israel (four members from each of these two populations). We merged our dataset with that dataset which was genotyped on a different platform (Illumina Human610-Quad v1.0 BeadChip), resulting in 50,483 shared SNPs. Projection of the first two PCs showed that our labeling was in accordance with the labeling of their samples (Figure S13).
Cochin Jews have a different history (2), as also exhibited here in the above analyses and in previous genetic studies as well (11), and therefore they were not included in the current study in further analyses.
Jewish dataset and genotyping
We included in this study a Jewish dataset containing samples from additional 14 Jewish populations which was collected as described previously (Table S1) (12, 14). All samples, including the Bene Israel samples in this study, were genotyped on the Affymetrix 6.0 array (Affy v6) at the genomic facility at Albert Einstein College of Medicine. Compared to previous studies (12, 14), we used an updated version (1.5.5) of the Birdseed tool in the Birdsuite software (42) to recall the genotypes again. Samples with ambiguous gender (based on genotyping) were removed from the analysis. For the current study, where we focused on the Bene Israel, we did not include Ethiopian Jews as previous studies have shown that they do not group tightly with other Jewish populations (13, 14).
Indian dataset
We also incorporated a dataset of Indian populations based on the study of Reich et al. (17) where the samples were genotyped in the same array (Affymetrix 6.0). We removed Indian populations living in islands and not the mainland (Great Andamanes and Onge), and those defined by Reich et al. (17) as genetic outliers (Aonaga, Nysha and Siddi, Chenchu). In addition, we removed Srivastava from our analysis, as after QC steps only one sample from this population was left, resulting in the inclusion of 18 Indian populations in our analysis (Table S1). The dataset generated by Reich et al. (17) contained, in addition to the Indian populations, samples from HapMap3 (43) populations and these samples were also used (after QC, described below) for phasing and for some of the analyses (Table S1).
Human Genome Diversity Project dataset
For some of the analyses we also included data for non-Jewish Middle-Eastern populations (Bedouin, Druze and Palestinian), to distinguish between Middle-Eastern and Jewish specific genetic attributes. We incorporated these samples by merging our dataset with data from the Human Genome Diversity Project (HGDP) (44) genotyped on the Affymetrix GeneChip Human Mapping 500K (19). This dataset included five unrelated samples from each of these three populations (Table S1).
Dataset merging and Quality Control
After removing SNPs with low call rate (<95%) from the two datasets (Jewish and Indian), we merged them together (including the HapMap samples in the Indian dataset (17)) and removed individuals as follows:
(1) Relatives. Following Campbell et al. (14), which analyzed most of the Jewish populations present in the current study, two individuals were considered related if their total autosomal identity by descent (IBD) sharing was larger than 800 cM and if they shared at least 10 segments with length of at least 10 cM (see below how IBD sharing was calculated). To remove as few related individuals as possible while maintaining only unrelated individuals in our dataset, we constructed a graph whose vertices were individuals, and two individuals were connected if they were defined (according to the above criteria) as related. We then tried to find a maximal independent set (i.e., a maximal set of unrelated individuals) in this graph using a greedy algorithm (45).
(2) Genetic outliers. We used the SMARTPCA program (41) to detect genetic outliers, with default parameters for genetic outlier removal. We removed individuals further than six standard deviations from the mean in any of the top ten eigenvectors over five iterations. This analysis was done for each population alone, based on autosomal SNPs.
The merged dataset following these QC steps included 513,581 and 25,379 autosomal and X chromosome (in the non-pseudoautosomal regions) single nucleotide polymorphisms (SNPs), respectively, for 456 individuals from 33 Jewish and Indian groups. Additional 842 samples from 11 HapMap3 populations were also available in this dataset, resulting in total 1298 samples. Further merging with the HGDP dataset (for some analyses) consists of 1313 samples with 304,973 shared autosomal SNPs. The number of samples from each population is shown in Table S1.
A set of filtered SNPs based on linkage disequilibrium (LD) were used in the following analyses: PCA, FST, ADMIXTURE, runs-of-homozygosity and heterezygosity. For each pair of SNPs showing LD of r2>0.5 we considered only one representative (using SMARTPCA’s (41) r2thresh and killr2 flags). This filtering was done for each analysis alone, depending on LD in the specific set of populations used in the analysis. Other analyses were performed on the full datasets described above.
Identity-by-descent analysis
We phased the data with the BEAGLE software (version 3.3.2) (46) and extracted shared identity-by-descent (IBD) segments with GERMLINE (version 1.51) (24), using the same parameters described in Campbell et al. (14) which analyzed most of the Jewish populations examined here. To reduce the rate of false positive IBD segments, only segments with length of at least 3 cMs were considered for analysis. Similar to previous studies (12, 23), we ignored regions with low informative content. Specifically, using non-overlapping windows (of 1 MB or 1 cM) we ignored all regions with SNP density of less than 100 SNPs per cM or per MB. Genetic positions were obtained from the HapMap genetic map (downloaded from: ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/).
For each pair of unrelated individuals we calculated the total length of autosomal IBD sharing. Given two populations, the average IBD sharing of these two populations was defined as the average IBD sharing of all pairs of individuals from these populations. Similarly, the average IBD sharing within a population was defined as the average IBD sharing between all pairs from this population. In addition, we also calculated the average IBD sharing between a population and the group of all Jewish populations, by averaging the IBD sharing between this population and each of the other Jewish populations. We repeated a similar procedure to calculate average IBD sharing between a population and the group of all Indian populations. As this analysis was done to compare other populations to Bene Israel, the Bene Israel population was not considered to be either Jewish or Indian in this analysis.
Principal Component Analysis
Principal component analysis (PCA) was performed using the SMARTPCA program (41) on the Jewish, Indian, and several additional worldwide populations. As the number of samples from each population can affect the results of PCA (20) and as there were more samples in each of the Jewish populations as compared to each of the Indian populations, we repeated PCA using not more than four samples from each population selected randomly (using the popsizelimit flag in SMARTPCA).
FST
We calculated population differentiation based on differences in allele frequencies between each pair of populations using the FST statistic, following the definitions described previously in Reich et al. (17). Specifically, let pi be the frequency of a variant in a biallelic SNP in two populations (i = 1,2) and define qi = 1 − pi. We defined FST as N / D where
To generalize this measure to more than a single SNP, we followed the “ratio of averages” approach (47), where N and D were averaged separately and only then their ratio was taken. Thus, let Ni, Di be the above definitions for SNP i, then for a set of SNPs S, FST was defined as:
Similar to Reich et al. (17), the following estimators for N and D were used:
Where ai and bi are the allele counts of the two alleles, and ni = ai + bi. We calculated FST separately for the autosomes and for the X chromosome.
ADMIXTURE
ADMIXTURE (22), a STRUCTURE (48) like algorithm, assigns for each individual its proportion in any of K hypothetical ancestral populations, and therefore can reveal relations between different populations. ADMIXTURE (version 1.2) analysis was performed with default parameters and varying values of K (from K=3 to K=10), with 250 bootstrap replicates. We ran ADMIXTURE with the dataset merged with the HGDP populations and included, in addition to the 33 Jewish and Indian populations, the following populations: CEU, YRI, CHB, Druze, Bedouin and Palestinians (total 746 unrelated samples). The dataset was filtered based on LD, resulting in 175,510 autosomal SNPs. ADMIXTURE’s cross validation procedure was used to determine the K that fits the data best.
Inferring admixture proportions and time
We applied two tools to examine the hypothesis that Bene Israel population was an admixed population and to infer admixture proportions and time: ALDER (version 1.03) (25) (version 1.03) and GLOBETROTTER (downloaded in March 2015) (26). A detailed description is found in the original publications of these tools, while we provide a brief description in the following.
(1) ALDER: Given a putative admixed population and two surrogate populations taken as a proxy for the presumed ancestral populations, ALDER uses admixture LD statistic to look for evidence for admixture. For each pair of SNPs ALDER calculates a statistic being the covariance of these two SNPs in the admixed population, weighted by the allele frequency differences between the two reference populations. Exploring the behavior of this admixture LD statistic as a function of the genetic distance between the two SNPs can imply whether the population is admixed or not. ALDER fits the statistics curve to an exponential function y = Ae−nd + c where n is the number of generations since admixture and d is the genetic distance (in Morgans). In addition to the test of admixture using two reference populations, ALDER examines evidence for admixture using only one surrogate population as a reference, with the admixed population serving as a proxy for the second population.
We considered a pair of populations as a candidate for being the ancestral populations for a certain population if all three ALDER results (two one-reference admixture LD and a two-reference admixture LD analyses) were significant and the estimated time of decay was consistent between the three.
In addition to time of admixture, ALDER also estimates admixture proportions from the amplitude of the exponential curve. This is done both in the one-reference version of ALDER (estimating the lower bound of admixture proportion of that population) as well as in the two-reference version. As the populations examined here are taken as a proxy for the true mixing populations, the admixture proportions suggested are lower bounds (25). A caveat in the two-reference version admixture proportion estimation is that ALDER does not determine to which population to assign the admixture proportion estimation α (i.e., it does not distinguish between α and 1 – α).
Therefore, we used min(α, 1 – α) as a lower bound for the admixture proportion of the Jewish population in each significant pair. To determine from the output of ALDER two-reference population test, f2 values (representing genetic drift between the two populations (17)) are needed (25) and these were calculated using MixMapper (49).
(2) GLOBETROTTER: Given a putative admixed population and a set of populations (some of them may be a proxy for the presumed ancestral populations), GLOBETROTTER examines whether the putative admixed population is an admixed population of some of the populations from that set (26). As GLOBETROTTER is based on haplotypes, we phased the data using BEAGLE (46). GLOBETROTTER algorithm requires several steps. First, the chromosomes of each individual in the admixed population are broken into “chunks” where each chunk is assigned, based on similarity, to a single individual from one of the other populations. This step, implemented in the CHROMOPAINTER (50) tool results in “coloring” of the chromosomes of admixed individuals with different populations. Second, for each pair of populations a curve, which quantifies each genetic distance how often a pair of haplotype chunks separated by this distance come from each pair of populations, is produced. Similar to ALDER, the decay rates of these curves are used to examine whether admixture event happened and to infer its time, while the amplitude of the curve is used to infer the contributing populations and their proportions. In case of evidence for admixture, GLOBETROTTER also examines whether the data fits better single exponential decay (i.e., single admixture event) or a mixture of exponential decays (i.e., several admixture events or continuous admixture over a longer period). In case of admixture, GLOBETROTTER suggests two main clusters of admixture, each may be composed of several populations, which together represent the genetic structure of the ancestral population.
We used CHROMOPAINTER (version 2) and ran GLOBETROTTER on Bene Israel and the Jewish and Indian populations, using 100 bootstrap replicates to obtain standard error estimates for the admixture time.
Time estimates
We converted the number of generations into years by assuming 29 years per generation for such recent history (18, 27) and that individuals genotyped in the current study were born circa 1950 CE. Thus, if n is the number of generations since admixture, we convert it to the year 1950 – 29(n + 1) (CE). Changes in generation lengths estimations will scale the time estimations proportionally.
Homozygosity and Heterozygosity estimations
We used PLINK (version 1.07) (51) to identify runs-of-homozygosity (ROH) – autozygous segments in the genome. We used the following flags in PLINK: “-- homozyg --homozyg-window-kb 1000 --homozyg-window-snp 100 --homozyg-window-het 1 --homozyg-window-missing 5 --homozyg-snp 100 --homozyg-kb 1000”.
The heterezygosity score of an individual was defined as the fraction of the heterozygous SNPs among all autosomal SNPs.
Estimating founder event time
We used allele sharing autocorrelation for estimating time of founder event, along the lines suggested by recent studies (17, 28). Specifically, for each pair of individuals from the population, and for each autosomal SNP, we measure the number of alleles these individuals share: zero, one or two. When both of the individuals are heterozygous for the SNP, we consider them as sharing one allele (to account for haplotype phasing ambiguity). Thus, each SNP is represented by a vector where each entry in the vector corresponds to a pair of individuals and the value of that entry is the number of shared alleles between these two individuals. Next, a Pearson correlation coefficient is calculated between the vectors for each pairs of SNPs (referred as allele sharing autocorrelation). To remove the effect of ancestral allele sharing autocorrelation, we subtract the cross-population allele sharing using this population and a different population. To infer the founder event, we plot the autocorrelation vs. genetic distance and fit the curve to the exponential equation where t represents the number of generations since the founder event and D is the genetic distance (in Morgans) between the two SNPs (17, 28).
We applied this method for the Bene Israel and calculated allele sharing autocorrelation between each pair of SNPs less than 30 cM apart. We partitioned the values into 0.1 cM bins and considered the mean of each bin. To consider two founder events, we fitted the decay to an equation of the form where t1 and t2 were the times (in generations) since the two founder events. Fitting was done by non-linear least squares, using the nls function in R (52). Evaluation between the single and two founder events models was done by comparing the sum of residuals of each of the models.
Sex-biased population differentiation
To examine sex-biased demography, we calculated a statistic presented by Keinan et al. (29): It estimates differentiation in allele frequencies (measured by FST) between two populations for the autosomes and for the X chromosome to estimate a ratio
Q captures the relative genetic drift between the X chromosome and the autosomes. Under several assumptions (29), if the effective population size of males and females has been equal since the two populations split, Q is expected to be ¾, the ratio of effective population size of the X chromosome to the autosomes in this case. A significant deviation from ¾ may suggest sex-biased demography since population split.
mtDNA analysis
mtDNA genotypes were used to assign to each of the Bene Israel samples a mtDNA haplogroup based on HaploGrep classification (53).
Acknowledgments
We thank all the individuals who contributed DNA for this study. We thank David Reich for providing us with the Indian dataset. We thank Po-Ru Loh and Mark Lipson for providing us with ALDER, advice on how to use it and interpret its results, and for comments on our results. We thank Garrett Hellenthal for providing us with GLOBETROTTER and advice on how to use it. We also thank Priyanka Vijay and Eyal Nitzany for preliminary analysis of related data, and Amy Williams and members of the Keinan and Halperin labs for helpful discussions.