Abstract
First-order tactile neurons have spatially complex receptive fields. Here we use machine learning tools to show that such complexity arises for a wide range of training sets and network architectures, and benefits network performance, especially on more difficult tasks and in the presence of noise. Our work suggests that spatially complex receptive fields are normatively good given the biological constraints of the tactile periphery.
Results
First-order tactile neurons in the glabrous skin of the human hand have distal axons that branch in the skin and form many transduction sites1-3, yielding spatially complex receptive fields with many highly sensitive zones4,5 (Fig. 1a). We have recently shown that this arrangement permits first-order tactile neurons to signal high-level features of touched objects such as the orientation of a touched edge4, a capacity previously considered a hallmark of processing in the somatosensory cortex6-8. Here we leverage machine learning tools to investigate why complex receptive fields arise and what computational benefits they yield. We show that complex receptive fields arise under a wide range of training sets and biologically realistic network constraints. We also show that complex receptive fields benefit network performance, especially on more complex discrimination tasks and in the presence of noise.
We abstracted the tactile processing pathway with a four-layer feedforward neural network (Fig. 1b, c). The input layer of our network consisted of 784 units, representing mechanoreceptors distributed over a small patch of skin. In this arrangement, the weight matrix between the input and first hidden layer - which we call W(1) - represents the receptive fields of first-order tactile neurons. Our network was trained on a range of stimuli including single points, multiple points, as well as Roman and Braille characters (Fig. 1d). These stimuli were subjected to translation and rotation and were spatially filtered to crudely approximate skin mechanics. Importantly, we introduced three biologically-inspired constraints. First, non-negative regularization in W(1) to simulate the fact that first-order tactile neurons can only be excited when their transduction sites are stimulated9. Second, convergence from the input to the first hidden layer to simulate the many-to-one convergence from mechanoreceptors in the skin to first-order tactile neurons traveling in the nerve1-3. Third, two distinct unsupervised and supervised training phases, representing the encoding and interpreting aspects of the tactile processing pathway, respectively.
We first asked under what conditions, if any, our network learns spatially complex receptive fields. In our main analysis, the 784 units in the input layer converged to 81 units in the first hidden layer, estimating the fact that first-order tactile neurons innervate on the order of ten mechanoreceptors1-3. We reasoned that the complexity of the training set would influence the complexity of the receptive fields10. We tested this idea with four training sets: Gaussian single points, mixed one and two Gaussian-points, Roman letters, and a mixed set that included one and two Gaussian points, Roman letters and Braille characters in equal proportions (see Methods). These training sets represent different degrees of structural complexity, and consist of stimuli that have been used in tactile studies in both human and animal models11-15 but were not meant to represent the natural statistics of tactile stimuli, which are unknown.
We trained our network on each of these training sets in an unsupervised fashion and examined the resulting receptive fields (i.e. the W(1) matrix). All networks, even those trained with the simplest training set, exhibited receptive fields with multiple areas of high sensitivity (Fig. 2a). Overall, there was a clear effect of training set on receptive field complexity (F(3, 76) = 1642, P<0.01) where the number of highly sensitive zones increased with the complexity of the training set (Fig. 2b). A similar effect was evident when analyzing receptive fields in the spatial frequency domain, with more complex training sets yielding higher spatial frequency content.
We next asked how the degree of convergence between the input and first hidden layers influenced receptive fields. That is, how physical constraints placed on the number of first-order tactile neuron axons traveling within the peripheral nerve should affect connectivity to mechanoreceptors in the skin. We reasoned that increasing convergence would increase receptive field complexity, since this smaller set of units must still encode the same set of inputs. We tested this idea by decreasing the size of the first hidden layer from 81 to 36 units, closer to the lower limit of biologically relevant convergence1-3, and training the network on the same four training sets described above. Increasing convergence did result in more complex receptive fields for alphabet and mixed networks (Fig. 2b). On average, the 36-unit alphabet network had 3.0 more peaks than the 81-unit alphabet network (t(38) = 46.39, P<0.01), and the 36-unit mixed network had 4.0 more peaks than the 81-unit mixed network (t(38) = 56.93, P<0.01). Interestingly, however, the one point and the one and two point networks (our simplest training sets) did not show increased complexity with increased convergence (Fig. 2b). In fact, the 36-unit one point network had 0.3 fewer peaks than the 81-unit one point network (t(38) = −8.55, P<0.01), and the 36-unit one and two point network had 0.5 fewer peaks than the 81-unit one point two point network (t(38) = −10.00, P<0.01).
At this point we further abstracted our network constraints to examine how they influenced the learned receptive fields. First, we trained our network on the mixed stimulus set without non-negative regularization in W(1) and found qualitative changes in receptive field morphology such that they no longer had structural similarities to our previously documented empirical receptive fields4 (Fig. 3a). Second, we trained our network on the mixed stimulus set with extreme convergence (4 units in the first hidden layer) and, again, found the resulting receptive fields did not resemble our empirical receptive fields (Fig. 3b). Last, we trained our network on each of the four stimulus sets without convergence (i.e. 784 units in the first hidden layer). We reasoned that such a network may not develop complex receptive fields because it did not need to compress the input space, especially for the single dot training set given its simple spatial statistics. However, receptive fields with multiple highly sensitive zones emerged for all training sets to varying degrees (Fig. 3c).
Given that our networks developed complex receptive fields under all network constraints and training sets, we investigated the functional consequences that such an arrangement had on sensory processing. In these analyses, we trained the network on unlabelled Mixed stimuli, then fixed W(1) and trained the remaining layers as a classifier using labelled Mixed stimuli. In our approach, the unsupervised training phase represents the encoding function of the tactile processing pathway, while the supervised training phase abstracts the more interpretive functions of the central nervous system. We compared this learned network against a network engineered to have single-peaked Gaussian receptive fields in W(1) on discrimination and identification tasks. For the engineered network, we selected the width of the Gaussian receptive field (SD = 3.0 steps) that resulted in best performance.
We first asked whether complex receptive fields benefit spatial accuracy. We had the network perform two-point discrimination, a task central to many studies of tactile acuity11,16,17. Specifically, we used a two-alternative forced choice paradigm and defined the difference limen as the separation distance between stimuli at which the network classified 75% of the stimuli correctly. The learned network had a mean difference limen of 6.94 (SD = 1.36) steps on our input space, which corresponds to a modelled distance of ~1-3 mm, depending on assumptions about mechanoreceptor innervation density. Overall, performance of learned and engineered networks were not significantly different with 81 units in the first hidden layer (t(45) = −1.85, P = 0.071; Fig. 4a). Moreover, changing the degree of convergence from 81 to 36 units did not cause a statistically significant change in performance for either the learned or the engineered network (F(1, 82) = 0.31, P = 0.58; Fig. 4a).
We then asked whether complex receptive fields benefit network performance in a more difficult identification task. We assessed the network’s ability to correctly classify characters from the Roman alphabet, as has been previously done with human participants12. In this case, engineering W(1) to have single-peaked Gaussian receptive fields and increasing convergence both decreased network accuracy (F(1,79) = 103.78, P < 0.01, F(1, 39) = 107.23, P < 0.01, respectively), and the interaction between these factors was also significant (F(1, 79) = 7.05, P = 0.0096). That is, both learned and engineered networks performed well, but the learned networks outperformed engineered networks for both levels of convergence and the benefit of complex receptive fields increased with increased convergence (Fig. 4b).
Finally, we asked whether complex receptive fields benefit network performance in the presence of noise. We introduced varying levels of normally distributed additive and multiplicative noise to the training data during both unsupervised and supervised training phases and then tested the network’s performance on a noiseless dataset. The effect of training noise on the network’s ability to classify characters from the Roman alphabet was substantial (Fig. 4b). The learned network had an accuracy of 87.7% (SD = 1.1) with low levels of additive noise (see Methods) compared to 75.1% (SD = 2.5) for the fixed network with the same amount of noise, a statistically significant performance gap (t(41) = 20.65, P < 0.01). Convergence also significantly influenced classification accuracy under the different noise levels (F(6, 555) = 12.36, P < 0.01). The performance of the 36-unit network decreased by 1.4% compared to the 81-unit learned network with low levels of additive noise (t(38) = 4.25, P = 0.00013). In contrast, the performance of the 36-unit network with engineered Gaussian receptive fields decreased by 6.1% compared to the 81-unit engineered network (t(41) = 9.59, P < 0.01). The performance gap grew between learned and engineered networks with additional additive noise (Fig 4b). For all networks, multiplicative noise had a similar effect but much smaller effect size (Fig. 4b).
Discussion
A core feature of the tactile processing pathway is that there are many more mechanoreceptors in the skin of the hand than there are first order tactile neurons in the median and ulnar nerves. It is not surprising, therefore, that first order tactile neurons branch1-3 since this is the only way they can innervate all the available mechanoreceptors. What may be surprising is the spatial complexity and apparent heterogeneity of the innervation pattern4,5, a feature which has been overlooked or ignored in previous models of the tactile processing pathway11,18-20. Our work here leverages simple machine learning tools to provide two fundamental insights in this respect. First, we show that spatially complex receptive fields are a normatively good and, perhaps, biologically parsimonious, arising under a wide range of training sets and network architectures. Second, we show that spatially complex receptive fields benefit network performance, especially in relatively difficult tasks and in the presence of noise.
Heterogeneously sampling the input space is a good thing for the nervous system to do because the input space of sensory stimuli is inherently sparse. Neural networks like the one we use here implicitly learn the statistical regularities (and thus sparsity) of the stimuli to which they are exposed. Indeed, such a machine learning approach has been shown to reproduce biological receptive field properties of neurons at various levels of the visual processing pathway10,21. Another suggestion for a mechanism to exploit sparsity comes from the field of compressed sensing, which shows that randomly sampling the input space can, under reasonable assumptions, allow a system to fully reconstruct a sparse input signal with fewer measurements than that prescribed by the Shannon-Nyquist theorem22-25. Given an input with sparsity S (at most S non-zero terms), in many situations the input signal can be fully reconstructed by randomly sampling at a frequency greater than 2S with no noise or multiplicative noise, or 4S with additive noise22,24, consistent with our observation that networks with more spatially complex receptive fields are particularly immune to additive noise. Figure 5 illustrates a cartoon compressed sensing scenario in our experimental setting, showing that a network with fully randomized positive weights in the first hidden layer can perform strikingly well on the alphabet discrimination task relative to the learned and fixed networks we described above. That is, the random network performs only slightly worse than the learned network and equivalent to the fixed network with no noise and, as expected, is able to better maintain its performance as the amount of additive noise is increased. This is not to say that the heterogeneity of how first-order tactile neurons innervate mechanoreceptors is random – indeed random connectivity yields receptive fields that are qualitatively distinct from those we record from humans (Fig. 5b) – but, rather, that even random sampling can outperform pixel-like sampling with Gaussian receptive fields.
Methods
Feedforward Neural Network Architecture
We designed a four layer feedforward network model with layers L1 to L2) containing s1 to s4 units respectively. s1 = 784, s2 = 81 or 36, s3 = 784, and s4 = 26 or 2 depending on if the network is trained to perform alphabet classification or two-point discrimination. The general form of feedforward computation was as follows: where W(l) denotes the weights from layers Ll to Ll+1, z(l+1) is the weighted sum of outputs from layer Ll, and a(l) is the output of layer Ll, after the activation function ƒ. For unsupervised training (L1 to L3), we used a rectified linear function ƒ(x) = max(0, x) for W(1) and a softmax function for W(2). For supervised training (L1 to L4), we used a rectifier for W(1) and W(2) and softmax for W(3).
Two-Phased Training and Non-Negativity Constraint
We randomly initiated weights by drawing from distribution N(0, 0.01). The general learning algorithm was mini-batch gradient descent with mini-batches of size 256. We trained the network in two phases. In the unsupervised learning phase, we trained L1 to L3 as an autoencoder that reproduced the input. The goal of gradient descent was to minimize the categorical cross-entropy cost: where, for training instance x, p(x) is the true output (which is equivalent to input x in the unsupervised learning phase), q(x) is the predicted input, and R(W(1)) is the non-negativity constraint, leading to the learning rule
We incorporated the asymmetric regularization term26, R(W(1)), where for each unit j of L1 and unit i of L2). c denotes an arbitrarily large constant, which we picked as 1000, that harshly penalized the network for learning negative weights in W(1).
In the supervised phase, we froze W(1) and trained L1 to L4 as a classifier. We reinitiated W(2) between the two training phases. Depending on the discrimination task to be performed, the network may operate as a binary (for two-point discrimination) or multiclass (for alphabet) classifier. Gradient descent minimized the cross-entropy cost:
The learning rule in this phase was:
Network hyperparameters used during training varied among different network architectures and training sets. Networks that did not reach convergence in the number of iterations were removed from testing.
Training Stimuli
We generated all training inputs X such that xij ∈ ℝ28×28. We generated Gaussian-points stimuli by initializing one or two peaks |xij | = 10 where i, j are integers chosen independently from distribution U(0,27), then passed through a two-dimensional Gaussian filter with width σ = 3.0.
We generated Roman letters stimuli as Helvetica characters normalized to 17 steps in height. We used similar height scaling for Braille characters. The filled portions of characters were initiated as |xij | = 1. We subjected each character to a random rotational angle drawn from distribution N(0, 20) in degrees, followed by random horizontal and vertical translation drawn from distribution N(0, 5) in steps.
We generated 60,000 training stimuli of each class. For Roman letters and Braille characters, there was approximately equal proportion of each character. Gaussian-points were evenly split between one and two points (i.e. 30,000 of each). We used standard one-hot encoding for labelling in supervised training.
Receptive Field Complexity
We bootstrapped 1000 receptive fields from each network. First, we designed a peak counting algorithm that calculated the number of significant local maxima contained in each receptive field. For each receptive field R, we define rij as a peak if 1) it is a local maximum 2) |rij| > (maxkrk)/2, that is, the value of rij is greater than half of the global maximum, and 3) rij is at least 5 steps away from the next closest local maximum. These criteria prevent low amplitude noise from being counted as peaks. Second, we analyzed receptive fields in the frequency domain by performing discrete two-dimensional Fourier transformation using the Fast Fourier Transform algorithm. We performed Fourier transformation after normalizing sampled RFs by their peak values such that maxkrk = 1.0. Last, to compare information shared by each pair of networks, we used mutual information between pairs of bootstrapped RFs normalized by their respective entropies, such that 1.0 means perfect correlation and 0 means no mutual information. We binned weights into 10,000 bins before calculating mutual information so that the control group (learned versus learned) RFs has a normalized mutual information of close to 1.0.
Model Performance
We assessed network accuracy in two-point discrimination and alphabet classification. We implemented two-point discrimination using a two-alternative forced choice paradigm. We generated 2000 one and two Gaussian-points in equal proportions. Two Gaussian-points were spaced symmetrically about the center of the input space at distances 0 to 22 steps apart with increments of 2 steps. We subjected two Gaussian-points to a random integer rotational angle drawn from distribution U(0, 90) in degrees. We defined the difference limen, or just-noticeable difference, for two-point discrimination as the distance at which the network correctly classified 75% of test stimuli. We estimated difference limen using cubic spline interpolation on the full accuracy plot.
We assessed the network on alphabet classification by testing it on 7800 new characters with 300 instances of each letter, subjected to rotational and translational variability as described above.
To assess robustness against noise, we trained the networks, in both unsupervised and supervised phases, with noisy data before testing them on noiseless data. We implemented multiplicative noise on input X as εij = c · u · xij for each coordinate i, j in X, where u was randomly drawn from distribution N(0, 0.01). We implemented additive noise as θ = c · v · maxkxk, where v was randomly drawn from distribution N(0, 0.01). We designated c = 1.0 as low-level noise and c = 3.0 as high-level noise. Noise was re-instantiated at the beginning of each training epoch.
Acknowledgements
This work was supported by the Canadian Institutes of Health Research (Foundation Grant to JAP: 3531979). JAP received a salary award from the Canada Research Chairs Program.