Abstract
The function of many biological systems, such as embryos, liver lobules, intestinal villi, and tumors depends on the spatial organization of their cells. In the past decade high-throughput technologies have been developed to quantify gene expression in space, and computational methods have been developed that leverage spatial gene expression data to identify genes with spatial patterns and to delineate neighborhoods within tissues. To assess the ability and potential of spatial gene expression technologies to drive biological discovery, we present a curated database of literature on spatial transcriptomics dating back to 1987, along with a thorough analysis of trends in the field such as usage of experimental techniques, species, tissues studied and computational approaches used. Our analysis places current methods in historical context, and we derive insights about the field that can guide current research strategies. A companion supplement offers a more detailed look at the technologies and methods analyzed: https://pachterlab.github.io/LP_2021/.
1 Introduction
It has long been recognized that in biological systems ranging from the Drosophila embryo to the hepatic lobule, many genes need to be properly regulated in space for the system to function. In order to study the spatial patterns of gene expression, many different spatial transcriptomics methods, which produce spatially localized quantification of mRNA transcripts as proxies for gene expression, have been developed. Thanks to growing interest in the field, several reviews have been written in the past 5 years, providing overviews of experimental techniques for data collection [1, 2], and describing how such techniques can be applied to specific biological systems, e.g. tumors [3], brain [4], and liver [5]. These reviews typically begin with either laser capture microdissection (LCM) or single molecular fluorescent in situ hybridization (smFISH) in the late 1990s, although the quest to profile the transcriptome in space is much older. The methods underlying such “prequel” technologies are important because many of them are used, in updated forms, in modern “current era” high-throughput methods.
Some important technologies enabling spatial transcriptomics date back to the 1970s (Supplementary Material; Chapter 2). Various forms of in situ hybridization (ISH) have been used for a long time to visualize gene expression in space. Radioactive ISH was first introduced in 1969, visualizing ribosomal RNA [6] and DNA [7] in Xenopus laevis oocytes, and was first used to visualize transcripts of specific genes (globin) in 1973 [8] (Figure 1A). Non-radioactive fluorescent or colorimetric ISH was developed in the 1970s and the early 1980s, improving spatial resolution, enabling 3D staining, and shortening required exposure times [9, 10] (Figure 1A). Early ISH was performed in tissue sections, making it challenging to apply to blastrulas and to reconstruct 3D tissue structures; whole mount ISH (WMISH) was first introduced in Drosophila in 1989 [11] and was soon adapted to other species such as mice in the early 1990s [12].
Another strand of development in early spatial transcriptomics was the enhancer and gene trap screen which was developed in the 1980s when DNA sequencing throughput was increasing [54] and metazoan genomes were newly opened frontiers. The first screens in Drosophila [14] and mice [15] were performed in the late 1980s in order to visualize expression of untargeted, and often previously unknown, genes. These could be identified with 5’ rapid amplification of cDNA ends (RACE) polymerase chain reaction (PCR) followed by Sanger sequencing of the PCR products. A related technology, in situ reporter, in which a promoter of a known gene or a predefined genomic fragment drives the expression of a reporter, was first used in a gene expression screen in C. elegans in 1991 [16]. With increasing throughput, enhancer and gene traps became the technology of choice for spatial transcriptomics in the 1990s, until the rise of (WM)ISH in the late 1990s which leveraged automation. WM(ISH) also avoided the need for transgenic lines, and was facilitated by the widespread availability of reference genomes in the early 2000s, which could be used for computational probe design. Although now eclipsed by newer methods, enhancer trap, gene trap, and in situ reporter methods have been used to build reference databases of gene expression and enhancer usage patterns in transgenic lines throughout the 2000s and 2010s [55, 56].
The foundation for many current era technologies was built in the decades between the 1970s and the 2000s (Figure 1C). For example, UV laser was first used to cut tissue in 1976 [37]. Popular IR and UV LCM systems were first reported in 1996 [43, 44] and were soon commercialized. Microarray technology was first reported in 1995 [42], and was originally used to quantify transcripts hybridized to cDNAs printed on a slide, but it was soon adapted to quantify the transcriptome from LCM samples in 1999 [46]. Popular current era technologies such as Spatial Transcriptomics (ST) [53] and 10X Visium rely on such microarray technology, but to capture transcripts from the tissue mounted on the microarray slides rather than from solution. Some highly multiplexed sm-FISH technologies such as seqFISH [51] rely on combinatorial barcoding, i.e. encoding each gene with a combination of colors so transcripts of more genes than easily discernible colors (up to 5) can be quantified simultaneously. Combinatorial barcoding was first reported in immunological DNA FISH in 1989 [41] and was first used for transcripts in 2002 [47]. The first unequivocal demonstration of smFISH showing each mRNA molecule as a spot was reported in 1998 [45]. Highly multiplexed smFISH would not have been possible without the development of these technologies.
(WM)ISH was the technology of choice in the late 1990s and the 2000s before the rise of highly multiplexed, high resolution, and more quantitative technologies, and has been used to create gene expression atlases in embryos of several species such as Drosophila melanogaster [25], Mus musculus, and Gallus gallus [28], in various mouse organs such as the brain [30], genitourinary tract [34], and lung [35], and for specific types of genes such as miRNAs [29] (Figure 1B). For species other than mice and humans, organs other than the brain, and miR-NAs, the only spatial transcriptomics resources currently available are for the most part (WM)ISH atlases. Model organism databases collecting the proliferating gene expression patterns from various sources were also established in this period, such as gene expression database (GXD) [57] and Zebrafish Information Network (ZFIN) [58] (Figure 1B). The golden age of (WM)ISH seems to have ended in the 2010s (Figure 1B), perhaps due to some of the disadvantages of (WM)ISH, such as requiring stereotypical tissue structure, the need for thousands of animals to generate an atlas, and the largely qualitative nature of results.
Early motivating applications for spatial transcriptomics included identification of genes with restricted patterns which indicated function in development, identification of novel cell type markers, and identification of novel cell types not evident from tissue morphology [14, 15]. In the 1980s and 1990s, analyses were typically done manually, although more recently automated methods have been developed (Supplementary Material Chapter 3). Recent developments in machine learning, coupled to more powerful computing infrastructure and the availability of more quantitative data, have opened up new possibilities. However, the legacy of the prequel era still lives on; current era studies still frequently reference prequel atlases [59–61].
2 Data collection
Current era technologies broadly fall into five categories: Microdissection (Supplementary Material Section 5.1), smFISH (Supplementary Material Section 5.2), in situ sequencing (ISS) (Supplementary Material Section 5.3), array (Supplementary Material Section 5.4), and no imaging (Supplementary Material Section 5.5). Developers of such technologies often seek to enable a trifecta of transcriptome wide profiling, single cell resolution, and high gene detection efficiency. While this achievement appears to be increasingly within reach, current era technologies are characterized by trade offs between these goals.
2.1 Microdissection
Current era microdissection constitutes the compilation of spatial locations of samples during the course of microdissection, and the assaying of transcriptomes of samples by cDNA microarray or RNA-seq. Since 1999, by far the most widely used microdissection technology is LCM, which has been applied to various biological fields such as oncology, neuroscience, immunology, developmental biology, and botany (see Supplementary Material Chapter 6 for topic modeling of PubMed LCM literature). In UV LCM, sometimes also called laser microbeam microdissection (LMM) or laser microdissection (LMD), a UV laser ablates a narrow strand of tissue around the desired area, which is then collected into tubes by gravity (Leica LMD) or laser pressure catapult (LPC, as in Zeiss PALM microbeam) (Figure 2B, C). In IR LCM (Arcturus PixCell II), the tissue section is mounted on a plastic membrane on tube caps. The IR laser briefly heats the desired area, melting the membrane so the tissue in the area is fused to the cap and captured (Figure 2A). IR and UV can be combined; for example UV can be used to cut the tissue and IR to remove the desired area with the plastic membrane (recent versions of Arcturus).
Advantages of LCM include transcriptome wide profiling, precise cuts informed by histology, compatibility with formalin fixed paraffin embedded (FFPE) tissues [62], and the possibility of applying the method to single cells from both frozen [63] and FFPE sections [64]. LCM can also be applied to 3D tissues by microdissecting non-overlapping domains as in Geo-seq [65]. Disadvantages of LCM include difficulty to scale to larger number of samples, exclusion of whole mount samples, and potential RNA degradation [66]. While LCM is widely used to capture targeted areas according to histology, it can also cut tissue into an untargeted grid [67].
Other microdissection techniques generally fall into two categories: Mechanical (Supplementary Material 5.1.4) and optical (Supplementary Material 5.1.5). The former includes 2000s voxelation [68] (Figure 4), and Tomo-seq [69], which sections a tissue with a cryotome along an axis of interest, followed by RNA-seq on each section. Optical microdissection includes GeoMX DSP [70], which shines UV light on regions of interest (ROIs) to release photo-cleavable gene barcodes for quantification, and Niche-seq [71], which uses fluorescence activated cell sorting (FACS) to isolate cells expressing photoactivated GFP in transgenic mice for scRNA-seq.
2.2 Single Molecule FISH
Chronologically, the next technology developed in the current era is highly multiplexed single-molecule FISH (smFISH), which began with a 2012 prototype (seqFISH) that relied on super-resolution microscopy (SRM) to simultaneously profile 32 genes in yeast by hybridizing probes with different colors to transcripts, and then deducing relative locations of the colors present [72]. SRM is no longer needed; in 2014 seqFISH [51] was published, in which one color per gene is visualized per round of hybridization and the probes are stripped before the next round for the next color in the barcode. With 4 colors, 8 rounds of hybridization (48 = 65536) are more than enough to encode all genes in the human or mouse genome. In practice, an error correcting round of hybridization is performed, so that genes can still be distinguished if signal from one round of hybridization is missing [73] (Figure 2D). More recently in a version of seqFISH based on RNA SPOTS [74], the “colors” themselves are one hot encoded by a sequence of hybridizations, expanding the palette to 20 “colors” per channel and enabling the profiling of 10,000 genes [75].
Another smFISH technique is multiplexed error-robust FISH MERFISH [52], which uses a different barcoding strategy, in which each gene is encoded by a binary code. The color codes in each experiment must be separated by a Hamming distance (HD) of 4 to allow for correction of missing signal in one round, and by 2 to identify error without the facility for correcting it (Figure 2D). The length of barcodes can be increased to encode 10,000 genes [76]. As only the fluorophores are removed but the probes are not stripped, numerous rounds of hybridization in MERFISH are less time consuming than those in seqFISH. Most other smFISH based techniques, such as HybISS [77] and split-FISH [78], use either seqFISH-like or MERFISH-like barcoding.
Advantages of smFISH based techniques include high gene detection efficiency (~95% for Hamming distance 4 MERFISH [79] compared to smFISH), single cell resolution, and subcellular transcript localization, which can be biologically relevant [80, 81]. Single round smFISH has nearly 100% detection efficiency [72], and multiple rounds of hybridization tend to decrease the efficiency in part because barcodes with incorrigible errors are discarded. Disadvantages include requirement of pre-defined gene panel and probes, difficulty in probing shorter transcripts, lengthy imaging time, limited scalability to large area of tissues, possible challenges in cell segmentation, and the need to process terabytes of images.
Challenges with smFISH have been addressed by various methods: Signal to noise ratio can be improved with rolling circle amplification (RCA) [77], branched DNA (bDNA) [82], hybridization chain reaction (HCR) [73], and tissue clearing [83]. With increasing number of genes profiled, the transcript spots are increasing likely to overlap, causing optical crowding. This can be mitigated by expansion microscopy (ExM) [84], only imaging a subset of probes at a time and using computational super-resolution [75], imaging highly expressed genes without combinatorial barcoding [52], and computationally resolving overlapping spots [85]. While smFISH based techniques are typically designed for frozen sections, SCRINSHOT is designed for FFPE sections [86].
2.3 In Situ Sequencing
ISS methods yield spatial transcriptome information by sequencing, typically by ligation (SBL), gene barcodes (targeted) or short fragments of cDNAs (untargeted) in situ. Such methods rely on ligase only joining two pieces of DNA – a primer with known sequence and a probe – if they match the template, with non-matching probes washed away. The probes used are degenerate except for one or two query bases encoded by a color. The 2013 ISS [50], later commercialized by Cartana, uses one query base per probe as in cPAL [87] to sequence gene barcodes (Figure 2E). FISSEQ [88] and a later adaptation with ExM called ExSeq [89] use SOLiD, which uses two query bases per probe to sequence circularized and RCA amplified cDNAs. In STARmap [90], gene barcodes are sequenced by SEDAL, in which SOLiD-like two query bases are also used to reject error, but one base encoding can also be used [90]. BARseq also RCA amplifies probes with gene barcodes, but uses sequencing by synthesis (SBS) instead of SBL to sequence the barcodes [91].
Advantages of ISS include single cell resolution and subcellular transcript localization, as ISS displays each mRNA molecule as a spot. The ISS technologies mentioned above use RCA, which greatly amplifies signal, but misses many transcripts. This can be an advantage, as brighter and less crowded spots allow ISS to be applied to larger tissue areas such as whole mouse brain sections, with lower magnification (20x, while MERFISH uses 60x) [92, 93]. However, ISS techniques tend to have low detection efficiency. Whereas detection efficiency of scRNA-seq techniques is between 3% - 25% [94–98], the detection efficiencies of Cartana ISS and FISSEQ [99] are ~5% and ~1% respectively, with STARmap only marginally better. However, ExSeq claims up to 62% efficiency compared to smFISH, i.e. for genes of interest in the same cell type, ExSeq detects around 62% as many transcripts as smFISH. Untargeted ExSeq can be transcriptome wide, but the targeted ISS techniques have only been used for up to 1020 genes [90] though more typically fewer than 300 genes [89], perhaps due to limited read length of SBL and challenges of imaging and image processing as in smFISH.
2.4 Arrays
Spatial locations of transcripts can also be preserved by capturing the transcripts from tissue sections on in situ arrays. Such arrays can be manufactured by printing spot barcodes, UMIs, and poly-T oligos on commercial microarray slides to capture polyadenylated transcripts, as in the ST and Visium technologies (Figure 2G, H). They can also be Drop-seq-like beads [94] with split pool barcodes, UMIs, and poly-T oligos spread on slides in a single layer (e.g., Slide-Seq [100]) or confined in wells etched on the slides (e.g., HDST [101]), with bead barcodes subsequently located using in situ SBL. Alternatively, in DBiT-seq [102], an array is generated by microfluidic channels, which are used to deposit one type of barcode in one direction, and then another in a perpendicular direction, with the orthogonal barcodes ligated so each spot can be identified with a unique pairwise combination. While array based techniques are typically designed for frozen sections and 3’ end Illumina sequencing, Visium has recently been adapted to FFPE sections [103] and Nanopore long read sequencing [104].
Array based techniques have been applied to large areas of tissue [61], and their use is increasing (Figure 3A). Nevertheless, they do not have single cell resolution, and the spot diameter of ST is 100 μm with spots 200 μm apart (Figure 2G). Visium, which is an improved version of ST released after 10X Genomics acquired ST, has spots in a hexagonal array with diameter 55 μm (Figure 2H). Bead diameter is 10 μm in Slide-seq, and 2 μm in HDST. Slide-seq and HDST use bead size smaller than single cells, however they still don’t provide single cell resolution because one bead can span two or more cells. Resolution of DBiT-seq is determined by channel width (either 50, 25, or 10 μm). More recently, the spot size can be reduced to below 1 μm, with RCA amplified DNA nanoballs as small as 0.22 μm across with spot barcodes deposited in wells 0.5 or 0.715 μm apart in Stereo-seq [105], and in Seq-Scope polonies with spatial barcodes 0.5 μm in diameter on an Illumina flow cell re-purposed to capture transcripts from tissue sections [106]. Another polony based method PIXEL-seq achieves spot diameter of about 1.22 μm but unlike in the flow cell, PIXEL-seq does not have much spacing around each polony [107].
Array based techniques tend to have low detection efficiency. The efficiency of ST is estimated to be 6.9% compared to smFISH for select genes in the same tissue type, comparable to that of scRNA-seq. Visium’s efficiency seems to be moderately higher than that of ST, and DBiT-seq’s is even higher, at ~15.5% compared to smFISH. Slide-seq and HDST are much less efficient. Efficiency of Slide-seq1 is only ~2.7% of that of Drop-seq, while Slide-seq2 is on par with Drop-seq [108]. Efficiency of HDST is ~1.3% per bead compared to smFISH. Efficiencies of the submicron techniques, in number of UMIs per unit area in the same tissue, might be comparable to that of Visium [107].
2.5 Methods that don’t rely on imaging
Some technologies have been developed to preserve information necessary to computationally reconstruct spatial gene expression patterns without imaging. One such technology is DNA microscopy [109], which records proximity between cDNAs. This information can be used to reconstruct relative locations of transcripts. At the cellular level, gene expression in rare cell types can be reconstructed by deliberately assaying multiplets, and then mapping them to locations in a spatial reference based on gene expression of cells from common cell types attached to cells from the rare cell types [110, 111]. Variants of the term “spatial transcriptomics” have also been used to describe techniques localizing transcripts to organelles (e.g., APEX-seq [112]), although no spatial coordinates are recorded.
2.6 Multi-omics
Another direction of development is multi-omics (Supplementary Material 4.6). Oligo-tagged antibodies are used to detect proteins of interest, and the oligonucleotide signifying the protein species can be detected with smFISH based methods. Such antibody panels have been combined with a variant of ST as SM-Omics [113], GeoMX DSP [70], and MERFISH [90]. A disadvantage of antibody panels is that the number of proteins profiled is limited to at most a few dozens. MERFISH [114] and seqFISH+ [115] have been adapted to visualize chromatin structure, by targeting DNA genomic loci [114] or introns of nascent transcripts [114, 115]. Multiplexed transcript quantification can also be combined with neuron projection tracing. For instance, cholera toxin subunit b (CTb) retrograde tracing has been used in conjunction with MERFISH to visualize axons [116]. Also, BARseq was originally designed to use ISS for axon tracing by sequencing neuron specific barcodes introduced by a virus injected into the brain, but was later adapted to sequence gene barcodes as well.
3 Data analysis
The processing and analysis of high-throughput spatial transcriptomics data requires novel methods and tools, especially for problems such as image preprocessing, spatial reconstruction of scRNA-seq data, cell type deconvolution of array-based data, identification of spatially variable genes, and inference of cell-cell interactions.
For smFISH and ISS based data, the raw data consists of images of fluorescent spots, which must be processed to identify transcript spots, match spots to genes, and assign spots to cells (Section 5.1). SmFISH and ISS studies often use classical image processing tools such as top-hat filtering to remove background, translation to align images from different rounds of hybridization, and watershed for cell segmentation [73, 79, 90]. Machine learning in Ilastik, deep learning packages like DeepCell [117], and alternative tools incorporating scRNA-seq data [118], can also be used for cell segmentation. However, without visualizing the plasma membrane, accuracy of cell segmentation is limited. Some analyses, such as identification of tissue regions, can be performed without cell segmentation [118]. Until 2019, image processing was typically performed with poorly documented and technique specific code written in the proprietary language MATLAB, but more recently such code is increasingly written in the open source language Python. The package starfish [119] was developed as an attempt to provide a unified and well-documented user interface to process images from different techniques such as seqFISH, MERFISH, and ISS, but it has not yet been widely adopted.
Recently, improvements in scRNA-seq technology have inspired new methods for leveraging the complementary nature of high-resolution transcriptome quantification with spatial transcriptomics data. For smFISH and ISS data that is not transcriptome wide, expression patterns of genes not profiled in the spatial data can be imputed with scRNA-seq data, either by mapping dissociated scRNA-seq cells to the spatial reference or by directly imputing gene expression in space using expression profiles from scRNA-seq (Supplementary Material 7.3). Cells can be mapped to spatial locations on an existing spatial dataset with genes shared by the two datasets, with an ad hoc score favoring similarity between cell and location [60] or via optimal transport modeling [120]. While ad hoc scoring is simple to implement, the results tend to be qualitative. Alternatively, spatial locations of scRNA-seq cells can be reconstructed de novo, by optimal transport modeling, or projection into 2 or 3 dimensions corresponding to the spatial dimensions [121]. Optimal transport exploits spatial autocorrelation of gene expression to map cells to locations. However, spatial autocorrelation in tissues breaks down when different cell types are co-localized. When a spatial reference is not available, the de novo approach often does not yield results resembling the original tissue [121].
Gene expression in space can also be imputed from scRNA-seq without explicitly mapping scRNA-seq cells to locations. A common approach is to project the spatial and scRNA-seq data into a shared low dimensional and batch-free latent space, and to subsequently estimate gene expression by projecting the spatial cells into the latent space. Examples of this approach include Seurat3 [122] and gimVI [123]. These methods may also be used to add spatial context to single cell multi-omics data when spatial techniques for some of the multi-omics data are not available.
In spatial data without single cell resolution, such as those derived from ST and Visium, scRNA-seq data can inform cell type composition of the spots or voxels (Supplementary Material 7.4). A common approach is to explicitly model observed gene expression at the spots as a weighted sum of mean gene expression of each cell type from scRNA-seq. Gene counts can then be modeled with negative binomial (NB) or Poisson distributions, and cell type proportions in each spot can then be estimated from the parameters of the model. Examples of such methods include stereoscope [124], and RCTD [124].
Given the relevance of scRNA-seq to spatial data, popular scRNA-seq exploratory data analysis (EDA) ecosystems such as Seurat [122], SCANPY (Squidpy) [125], and SingleCellExperiment (SpatialExperiment) [126] have added functionalities for spatial data, such as updates to data containers and functions to facilitate visualization of gene expression and cell/spot metadata at spatial locations (Supplementary Material 7.2). EDA packages dedicated to spatial data with beautiful graphics and good documentation have also been written, such as Giotto [127], STUtility [128], and SPATA [129]. Seurat, Giotto, and SPATA also implement basic methods to identify spatially variable genes. In addition, Giotto implements methods to identify cell type enrichment in ST and Visium spots, identify gene coexpression and association between gene expression and cell type colocalization, and to identify spatial regions [130].
Spatially variable genes are genes whose expression is associated with spatial location (Supplementary Material 7.5). Two approaches are commonly used: Gaussian process regression (GPR) [131] and its generalization to Poisson [132] and NB [133], and Laplacian score [134]. The former models normalized gene expression or the rate parameter of Poisson or NB gene expression as a GPR and finds whether the model better describes the data with the spatial term than without. To speed up computation, the cells or spots can be aggregated with self-organizing map before the GPR approach is applied [135]. The latter approach identifies genes whose expression better reflects the structure of a spatial neighborhood graph. The locations of cells can also be modeled as a spatial point process with gene expression as marks; spatially variable genes can be identified as marks associated with locations [136]. GPR based methods approximate the p-values of genes from theory, while other methods use permutation testing, which makes them less scalable. However, the Gaussian kernel commonly used for GPR based methods doesn’t account for anisotropy observed in tissues.
Spatial information also enables identification of potential cell-cell interaction (Supplementary Material 7.8). This is commonly done with knowledge of ligand-receptor (L-R) pairs and testing of whether L-R pairs are more likely to be expressed in neighboring cells or spots [137]. Expression of genes of interest can also be modeled, including a term for cell-cell co-localization; the gene is considered associated with cell-cell co-localization if the model better describes the data with this term than without [138, 139].
There are many other types of analyses that are useful for spatial transcriptomics analysis, including identification of archetypal gene patterns (Supplementary Material 7.6), spatial regions defined by the transcriptome (Supplementary Material 7.7), inferring gene-gene interactions (Supplementary Material 7.9), subcellular transcript localization (Supplementary Material 7.10), and gene expression imputation from H&E (Supplementary Material 7.11).
4 Trends in the spatial transcriptomics field
The quality vs. quantity trade off inherent in existing technologies means that there is no single “best” solution currently available, and the difficulty in implementing methods has resulted in many technologies never spreading beyond their institutions of origin. LCM, ST, Visium, ISS, and Tomo-seq have been the most widely adopted (Figure 3A), and in almost all cases in the US and western Europe (Supplementary Figures 4.9, 5.25, 5.31). In terms of tissues analyzed, multiplexed current era techniques have been used widely to characterize human tissues [140], tumors [53] (especially breast tumors and squamous cell carcinoma), and pathological tissues that don’t necessarily have a stereotypical structure [141] (Figure 3B, C). In the SARS-CoV-2 pandemic, GeoMX DSP has been used for spatial transcriptomic profiling in lung autopsy of COVID victims [142, 143].
Some of the processed data, and associated spatially variable genes, can be downloaded and visualized from SpatialDB [144]. Excluding LCM, the vast majority of current era studies were performed on either humans or mice (Figure 3D), and the brain is the most studied organ (Figure 3B, E, F). In particular, the international project Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative - Cell Census Network (BICCN) is constructing a multi-modal atlas for the human, mouse, and non-human primate brain, including spatial data such as MERFISH and seqFISH [145]. While the smFISH techniques being utilized for this project can in principle scale to many genes, in practice they have for the most part been used for limited numbers of genes (Figure 3G) and cells (Figure 3H).
All packages mentioned in the Data analysis section are open source and written in languages such as R, Python, and Julia. Downstream analyses in studies primarily concerning new data and data analysis packages predominantly use open source programming languages such as R, Python, and C++ (Figure 3I, J). While MATLAB is still popular, its use appears to be declining with R and Python gaining in popularity (Supplementary Figure 5.9). While R is more popular for downstream analyses and EDA, Python and C++ are more popular for package development (Figure 3I, J), reflecting relative strengths of these languages and their surrounding cultures. Most of the packages are not hosted on standard repositories such as the Comprehensive R Archive Network (CRAN), Bioconductor, pip, and conda (Supplementary Figure 7.9). While most packages using R, Python, and C++ are well-documented, most MATLAB packages are not, perhaps reflecting the central role of documentation in open source culture (Supplementary Figure 7.9).
5 Future prospective
While technologies of the prequel are rapidly being deprecated, the ideas and methods that underlie them are fundamental to current era spatial transcriptomics. The field has dramatically expanded over the past 5 years (Figure 4A), with a plethora of new techniques and popularization of Visium driving growth (Figure 4B, Supplementary Figures 4.8, 5.34). The popularity of these techniques may be attributable to applicability to diverse tissues and the availability of commercial kits and core facilities translating to less work and cost to set up instruments and train personnel.
What lies ahead of the rising curves? First, more can be done to improve data collection techniques. For example, most current era techniques require tissue sections. Highly multiplexed whole mount smFISH and tissue clearing protocols, and more efficient computational tools for aligning multiple sections that may come from multiple individuals or even developmental stages, should be developed to extend current era techniques to 3D and to spatiotemporal analysis. Furthermore, smFISH and ISS techniques, with signal amplification to reduce the number of probes per transcript, can be adapted to target isoform specific exons or untranslated regions rather than all transcripts of a gene.
Second, current era data has not yet been integrated into comprehensive databases. Prequel databases such as GXD, e-Mouse Atlas and Gene Expression (EMAGE) [148], and FlyExpress [149] include data from multiple sources and can be queried by gene symbol and developmental and spatial ontologies. In addition, ABA [30], EMAGE, and FlyExpress aligned ISH images to common coordinates and can be queried with expression patterns. While some current era authors provide online interactive visualization of datasets from their studies [61], comprehensive databases integrating data from multiple sources as in the prequel era have not yet been developed. Furthermore, while prequel ontologies are still used in current era studies, such ontologies may be improved with the transcriptome wide quantitative data from the current era.
Third, outside of LCM, the current era is highly focused on the brain in human and mice, with potential spatial transcriptomics investigations of other organs such as the liver and the leaf lagging behind. Technological modernization of prequel consortia for organisms other than human and mice, and for organs other than the brain, holds much promise for the development of useful spatial transcriptomics atlases.
Fourth, an open source, well-documented, interoperable, and scalable workflow with an integrated easy-to-use interface would greatly simplify spatial transcriptomics data collection and analysis. At present, for tasks beyond EDA, users still often need to learn new syntax, convert object types, and even learn new languages to use some data analysis tools. Finally, our survey of methods shows that spatial transcriptomics methods need to be more open and accessible so that they become adopted around the world, and are not restricted to Western elite institutions.
6 Methods
6.1 Database
The criterion for including studies in the database was that the study used a method for quantification of transcripts while recording spatial context of samples within a tissue or cell. Methods were required to be able to quantify more transcripts and genes than possible with one round of FISH or immunofluorescence. Reviews and protocols were excluded. For publications on data analysis methods, the criterion for inclusion was that the proposed method went beyond the use of existing packages, and demonstrated a novel methodological approach or technique on a spatial transcriptomic dataset. In addition to collating publications through extensive reading and citation searching, keywords such as “spatial transcriptomics”, “digital spatial profiling”, “in situ sequencing”, as well as technology specific terms such as “visium”, “seqfish”, and “merfish”, were searched on PubMed and bioRxiv. The search results, as well as related papers and other papers citing the publication of interest on PubMed were manually screened and entered into the database. The sharing of the database is intended to foster crowd sourcing of literature in the future.
6.2 Analysis
All analyses were performed with R version 4.0.4. The timelines were generated with the R package ggtext [150]. The tissue maps in Figure 3B-C, E-F were generated with the R package gganatogram [151]. All trend lines were estimated by linear regression, and significance of the slope for each was computed by a t-test. The pictograms in Figure 3I-J were generated with the R package ggtextures [152].
In the supplement, the medium resolution world map was generated with the R package rnaturalearth using the Robinson projection option. Cities and institutions were geocoded with the Google geocoding API.
For LCM text mining, PubMed abstracts and metadata were downloaded by searching for the expression “((laser capture microdissection) OR (laser microdissection)) AND ((microarray) OR (transcriptome) OR (RNA-seq))” with the PubMed API, with R package easyPubMed [153], BioRxiv abstracts were downloaded by web scraping with Python package biorxiv-retriever [154] and metadata was downloaded with the bioRxiv API. The abstracts were tokenized into unigrams while preserving common phrases. Stopwords and punctuations were removed and the words were stemmed. With token counts in each abstract, the R package stm [155] was used for topic modeling of LCM abstracts from PubMed and bioRxiv. The date of publication, city of first authors (PubMed) or corresponding authors (bioRxiv), and journal (including bioRxiv) were used as covariates to model topic prevalence. Due to the large number of cities and journals, cities and journals with fewer than 5 publications were grouped into “Other” prior to modeling. The number of topics was chosen based on a trade off between hold out likelihood and residual, and between topic exclusivity and semantic coherence; 50 topics were used. Global vector (GloVe) embedding [156] of words from the abstracts was performed with the R package text2vec [157], using 125 dimensions. The word embeddings were then Louvain clustered [158] with the R package igraph [159] and projected to lower dimensions for visualization with principal component analysis (PCA) using the R function prcomp, and uniform manifold approximation and projection (UMAP) [160] with the R package uwot [161].
The paper and supplement can be continuously updated to reflect the contents of the database.
6.3 Code and data availability
All code for generating the figures is available at https://github.com/pachterlab/LP_2021, and the version as of submission is archived at https://doi.org/10.5281/zenodo.4795375. The curated database can be accessed at https://docs.google.com/spreadsheets/d/1sJDb9B7AtYmfKv4-m8XR7uc3XXw_k4kGSout8cqZ8bY/.
Footnotes
Figure 1 revised to include ligation mediated SNV detection and single cell cDNA amplification; Figure 3 revised to reduce confusion and correct misunderstandings per Long Cai's request; corresponding parts of the Supplement were updated reflecting the paper in a more detailed manner.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.
- 96.
- 97.
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵
- 133.↵
- 134.↵
- 135.↵
- 136.↵
- 137.↵
- 138.↵
- 139.↵
- 140.↵
- 141.↵
- 142.↵
- 143.↵
- 144.↵
- 145.↵
- 146.↵
- 147.↵
- 148.↵
- 149.↵
- 150.↵
- 151.↵
- 152.↵
- 153.↵
- 154.↵
- 155.↵
- 156.↵
- 157.↵
- 158.↵
- 159.↵
- 160.↵
- 161.↵