Abstract
Social learning enables complex societies. However, it is largely unknown how insights obtained from observation compare with insights gained from trial-and-error, in particular in terms of their robustness. We use aversive reinforcement to train “experimenter” zebra finches to discriminate between auditory stimuli in the presence of an “observer” finch. We find that experimenters are slow to successfully discriminate the stimuli but immediately generalize their ability to a new set of similar stimuli. By contrast, observers subjected to the same task instantly discriminate the initial stimulus set, but require more time for successful generalization. Drawing upon machine learning insights, we suggest that observer learning has evolved to rapidly absorb sensory statistics without pressure to minimize neural resources, whereas learning from experience is endowed with a form of regularization that enables robust inference.
Introduction
Humans and animals have the remarkable ability to generalize their acquired knowledge to new examples and situations (Pavlov 1927; Markman and Hutchinson 1984; Bass and Hull 1934; Spierings and Ten Cate 2016). For example, they can learn to discriminate threatening from harmless stimuli and they can generalize this knowledge to new instances of a threat. They are also capable of learning from few examples (Cherkin 1969; Bitterman et al. 1983), presumably because brains have evolved under the pressure of fatal consequences when threats are not immediately recognized. Two ethologically relevant learning metrics are thus the acquisition time and the transferability of acquired information. Which forms of learning focus more on the former and which more on the latter of these metrics?
We propose a comparative approach towards disentangling rapid learning from robust generalization, exploiting the fact that many animals are not only capable of learning from aversive or appetitive cues through trial-and-error type processes (Thorndyke 1905; Skinner 1953), but also from observing cues produced by conspecifics and other animals involved in learning or doing the same task (Galef 1988; Zentall 2006; Byrne 2003). In what way does the retained sensory information depend on whether the learning cue is experienced or observed, keeping all other parameters fixed?
We study cue modulation in an auditory stimulus discrimination task involving pairs of zebra finches (Okanoya and Dooling 1990; Sturdy et al. 1999; Tokarev and Tchernichovski 2014; Canopoli et al. 2014). Zebra finches are useful models for sensory learning thanks to their ability to detect subtle differences among highly stereotyped song renditions (Woolley and Doupe 2008). Their learning in stimulus playback experiments is dependent on cues such as behavioral context (Tchernichovski et al. 2001; Derégnaucourt et al. 2013).
Using aversive air-puffs, we trained one of the two birds in a pair to discriminate short from long renditions of a zebra finch song syllable (Go-NoGo avoidance conditioning, Fig. 1A; spectrograms of stimuli in Fig. 1B, durations in Fig 1C). We refer to these birds as “experimenters”. Simultaneously, we allowed a paired zebra finch to observe the entire training phase of the experimenter, including the acoustic stimuli and the experimenter’s actions. These latter birds are referred to as “observers”; they could engage in unrestricted visual and auditory interactions with experimenters, but did not perform the task until after experimenters completed their training phase.
The acoustic stimuli in such experiments are fully predictive of whether an air-puff is imminent or not. Experimenters reveal their ability to discriminate the stimuli by escaping from the perch before they get struck by the air-puff (Canopoli et al.2014). We refer to this form of learning as experience learning because birds learn to discriminate based on experience of the air-puffs. By contrast, observers could not learn from air-puff experiences, but they could learn from observing the air-puffs’ direct and indirect effects on experimenters’. We refer to this form of learning as observation learning (which is not meant to imply that observers learn by imitating the actions of experimenters, which is commonly known as ‘observational learning’).
We expected that observers would be able to demonstrate their learned discrimination ability in a separate testing phase in which they were exposed to air-puffs. Here, we investigate the performance tradeoffs between experience learning and observation learning using two measures taken from the machine learning community: learning speed and generalization performance.
Results
Observation learning induces rapid expression of auditory discrimination
During a pre-training phase, experimenters (EXP) were accustomed to air-puffs that followed one of two auditory stimuli of different duration. Then a training phase followed, during which we exposed EXP to the full training set of 10 auditory stimuli (Fig. 2A left panel and Supplementary methods). Gradually, EXP learned to escape from the perch more often in puffed trials; their escape probabilities (cumulated from trial onset to trial end) became larger on puffed trials than on unpuffed trials (Fig. 2B left and right). We quantified the birds’ ability to discriminate stimulus class by the average difference in (cumulative) escape probabilities (dPesc) between puffed and unpuffed trials (Fig. 2, B, D and F). EXP attained a stringent performance criterion (see Methods and Fig. S1 D) after 4.8 ± 2.9*103 trials (mean ± std, n=10 birds). This criterion defined the end of the training phase, at which time EXP displayed a dPesc of 0.36 ± 0.06 (mean ± std, dPesc averaged over the last 3 blocks of training, including the criterion block). After the training phase (or observation phase for observers), the experimenter was replaced by the observer (OBS) and a naïve bird was placed in the observer’s cage. Then we began testing the OBS using the same pre-training and training paradigms it was previously allowed to observe. We refer to the training of observers as testing, Fig. 2C right panel.
At the beginning of the testing phase (first 3 testing blocks), OBS displayed a significantly higher discrimination performance than EXP at the beginning of their training phase, Fig. 2E. Surprisingly, OBS’ initial performance was no worse than that of EXP who had reached the learning criterion (average initial dPesc=0.40 in n=9 OBS vs average final dPesc=0.36 in n=10 EXP). OBS reached the performance criterion very rapidly, in only 1.82 ± 1.7*103 trials (mean ± std, n=9 birds), less than a third of the trials required by EXP (single-sided Wilcoxon rank sum test with alternative hypothesis: EXP > OBS, p = 0.0075 (not exact), test statistic = 75; Effect size: Cohen’s d = 1.224, 95% CI not computed because of ties), Fig. 2G. After reaching the criterion, OBS showed a significantly higher discrimination performance than EXP (average dPesc in OBS: 0.47 ± 0.11; average dPesc in EXP: 0.36 ± 0.06 Wilcoxon rank sum test, p = 0.0056, test statistic = 12; Effect size: Cohen’s d = 1.38, 95% CI = [−0.2 −0.031]), Fig. 2F.
Observers generalize poorly compared to experimenters
Experimenters could rapidly generalize their learned knowledge to novel instances of the stimuli (generalization set, spectrograms in Fig. 1B) but observers were not able to do so. To compare generalization in experimenters and observers, first, we allowed generalization observers (GENOBS) to watch generalization experimenters (GENEXP) learn to discriminate the stimuli in the training set, Fig 3A top. After training completion, GENEXP and GENOBS were separated and placed in different experimental chambers, each paired with a naïve observer. GENEXP were then exposed to the generalization set of stimuli, Fig. 3A bottom left. GENOBS were exposed to a pre-training phase with two stimuli from the training set, followed by a testing phase comprising the ten stimuli from the generalization set, Fig. 3A bottom right. Contrary to our findings on the training set, GENOBS initially showed significantly poorer discrimination on the generalization set (average dPesc over the first 3 blocks in GENEXP: 0.42 ± 0.08 and in GENOBS: 0.2 ± 0.18, p = 0.024, test statistic = 66, two tailed Wilcoxon rank sum test, Effect size: Cohen’s d= 1.11, 95% CI = [0.016, 0.36]), Fig 3 B, C. GENOBS also took more time than GENEXP to reach criterion (4.9 ± 4.0*103 trials in n=9 GENOBS versus 1.1 ± 0.5* 103 trials in n=9 GENEXP, p = 0.0064 (not exact), test statistic = 10, two tailed Wilcoxon rank sum test; Effect size: Cohen’s d = 1.3, 95% CI not computed because of ties), Fig 3D.
GENOBS needed significantly more trials to reach the learning criterion than did OBS (p = 0.044, test statistic = 17.5, Wilcoxon rank sum test), demonstrating that observers reacted to small differences between stimuli from the training and generalization sets. However, after reaching the criterion, OBS and GENOBS discriminated the stimuli equally well (similar dPesc at criterion, p = 0.077, test statistic = 20, Wilcoxon rank sum test). Thus, overall, observers seemed to associate the perch-escape behaviors by experimenters much more exclusively with the presented auditory stimuli than did the experimenters themselves, who associated the air puffs more inclusively with the stimuli (to include similar stimuli from the generalization set).
We inspected the escape behaviors of observers and experimenters. We found that after reaching the learning criterion, EXP and OBS displayed similar perch escape strategies. That is, they tended to abruptly increase their perch escape rates just before air-puff onsets, Fig. S3A, B. This similarity of behavior suggests that observers might learn from experimenters’ actions.
Observers do not learn through passive perceptual processes
We set out to characterize the requirements for observation learning. We hypothesized that OBS learned from experimenters' actions in response to the air-puffs. To test whether observers indeed learned from experimenters’ actions, we allowed experimenter and observer pairs to experience several thousand (7.5 ± 3.6*103) stimulus playbacks including the sound of air-puffs, but not the tactile sensation of the puffs. We realized this perceptual paradigm by directing the air outlet away from the experimenters, Fig. 4 A. Consequently, experimenters never experienced the air-puff as a force against their body. We refer to observers in such pairs as perceptual learners (PLs), because they could potentially learn from the pairing of stimuli with air-puff sounds.
Experimenters in this perceptual paradigm never produced dPesc values different from 0 (average dPesc after 5000 training trials in 3 experimenters: [−0.065, −0.002, 0.007], p=0.81, p=0.25, p=0.64, respectively; z-test of individual proportions), hence they did not show the discriminative behavior that we suspected would drive learning in observers. When we tested PLs (n=7 birds) with air-puffs directed at them, they needed significantly more trials to reach criterion than OBS (6.32 ± 6.3*103 trials in PLs versus 1.82 ± 1.7*103 in OBS; single-sided Wilcoxon rank sum test of alternative hypothesis PL > OBS, p = 0.008 (not exact), test statistic = 8.5; Effect size: Cohen’s d = 1.036, 95% CI not computed because of ties), Fig. 4E. PLs were slower than OBS even after removing an outlier bird (trials to criterion = 20300) in the PL group (p = 0.016, Cohen’s d: 1.29). PL performance at criterion was comparable to OBS performance (0.32 ± 0.2 in PL versus 0.47 ± 0.11 in OBS, p = 0.142, test statistic = 46, Wilcoxon rank sum test, Cohen’s d = 0.93, 95% CI = [−0.077 0.338]). The absence of rapid learning in PLs suggests that learning in OBS required an experimenter engaged in the task and responding to air puffs.
Experimenters needed to be expert models to induce learning in observers
We expected observation learning to be most effective when information is provided by an expert. To probe for sensitivity on experimenter performance, we tested a group of Valence Learners (VLs, n=5) that observed naïve experimenters who did not reach the performance criterion within (on average) 6.9 ± 3.0*103 trials. These naïve experimenters were hit by air puffs on average 539 times out of 1000 puffed trials, and escaped in unpuffed trials on average on 400/1000 trials. In addition, to give VLs direct experience of the reinforcer (its valence), 3/5 VLs were initially exposed to air puffs (approximately 500 strong 1-s air puffs, see Methods). When tested, VLs were much slower than OBS to reach the learning criterion (average number of trials to criterion in VL [n = 5]: 8.0 ± 2.5 *103 versus OBS [n = 9]: 1.82 ± 1.7*103, single sided Wilcoxon rank sum test of alternative hypothesis VL > OBS, p = 0.001 (not exact), test statistic = 0, Cohen’s d = 3.11, 95% CI not computed), Fig. 4E. The performance of VLs at criterion was lower than the performance of OBS (average dPesc for VL [n = 5]: 0.25 ± 0.11 versus for OBS [n = 9]: 0.47 ± 0.11, p= 0.007, test statistic = 42, Wilcoxon rank sum test, Cohen’s d = 1.87, 95% CI = [0.034 0.355]). The poor testing results in VLs suggest that OBS did not learn by predicting the reward value and by converting this prediction into an optimal action during testing. Instead, VL behavior suggests that OBS focus on experimenters’ discriminative actions, which must necessarily contain the information required for observation learning.
Overall, PL and VL behavior was closer to EXP than to OBS. That is, upon reaching the learning criterion, there was no statistically significant difference in performance (dPesc) between EXP and PLs (p = 0.41, test statistic = 26, 95% C.I = [−0.2 0.2], Wilcoxon rank sum test; Cohen’s d: 0.21) and a trend of higher performance in VL compared to EXP (p = 0.07, test statistic = 10, 95% C.I = [−0.21 0.05], Wilcoxon rank sum test; Cohen’s d: 1.14). In combination, PLs and VLs emphasize the importance of experimenters’ discriminative actions for observation learning.
Vocal exchanges are not required for observation learning
Given the importance of experimenter actions, we speculated that rapid learning in OBS was based on vocal exchanges between EXP and OBS through calls occurring during experimental trials, Fig 4 C. Indeed, on the last day of the training phase, when EXP had reached the learning criterion, we found a difference in calling behavior between puffed and unpuffed trials. In six OBS (on one day each), we inspected calling rates (defined as the probability of observing at least one call) in the stimulus period (from stimulus onset to stimulus offset) and in the delay period (defined from stimulus offset to air-puff onset), Fig. 4 D. In puffed trials, the calling rate was significantly lower in the delay period than in the stimulus period, whereas no such difference was seen in unpuffed trials (puffed trials: stim call probability 0.46 ± 0.26 vs delay call probability 0.27 ± 0.15, difference −0.19; unpuffed trials: stim call probability 0.39 ± 0.22 vs delay call probability 0.42 ± 0.26, difference 0.03; n=6 birds and 10 stimuli divided into two groups, p = 10−6, two t-sample t-test, t statistic = 6.29, df = 58). Hence, call reduction could signal the imminent arrival of an air puff.
To test whether observers used calls as a learning cue, we housed experimenters and observers (n=5 pairs) in separate soundproof boxes and gave them visual access to each other by virtue of two adjacent windows. Moreover, to trigger social interest, we allowed birds to vocally interact with each other using a custom digital communication system composed of two microphones and loudspeakers and an echo cancellation filter (Supplementary methods). We suppressed vocal exchanges during stimulus presentation by interrupting the communication system from stimulus onset to air-puff offset. We termed the observers in this paradigm no-trial-communication learners (-TCOM). Despite elimination of vocal interactions during the discrimination task, we found that - TCOM acquired stimulus-discriminative information in amounts comparable to OBS (trials to criterion: -TCOM (n = 5): 1.74 ± 1.78*103; OBS (n = 9): 1.82 ± 1.7*103, p = 0.776, test statistic = 25, Wilcoxon rank sum test, 95% CI = [−1200 2300], Cohen’s d: 0.05), Fig. 4G. Hence, it follows that OBS did not require immediate vocal interactions. They could learn from visual displays only or from vocal exchanges following trials.
Observation learning was not a simple form of stimulus enhancement
We did not find evidence that simple stimulus enhancement (Zentall 2006; Byrne 2003) could account for observers’ rapid discrimination learning. Here, stimulus enhancement is defined as learning in an observing animal through the increased interaction with a particular stimulus after a demonstrating animal has directed its attention to the stimulus. Hoppitt and Laland (Hoppitt and Laland 2013) provide a necessary condition for stimulus enhancement: observers must exhibit higher response rates to enhanced stimuli, which implies that enhancement could reveal itself as an increase in stimulus-contingent escape behavior even when there is no reinforcement.
To probe for stimulus enhancement, we quantified escape behavior during the first 100 pretesting trials during which two auditory stimuli were paired with air-puffs that were too weak to displace a bird from the perch. During these trials, OBS exhibited a dPesc of 0.09 ± 0.3 that was not significantly different from zero (p = 0.44, test statistic = 24, 95% CI = [−0.17 0.44]; Wilcoxon signed rank test), Fig. S4B. Hence, OBS were not initially drawn to either puffed or unpuffed stimuli. Rather, they expressed discriminative behavior only during later trials of the pre-testing phase, after we increased the strength of air-puffs. Thus, it seems that in addition to the stimuli and the actions of experimenters, observers needed also the aversive experience of air puffs to express their learned knowledge.
Logistic regression with L1 norm regularization differentiates observers and experimenters
We sought to identify possible explanations for the difference in generalization abilities between observers and experimenters. In the following, we draw upon insights from machine learning and statistical learning theory to suggest a simple mechanism that can account for the learning characteristics in experimenters and observers. In machine learning, the typical goal is to maximize generalization performance but not learning speed, which is why many state-of-art methods tend to learn slowly (LeCun et al. 2015). Typically, undesired overfitting arises when few training examples are classified using too many parameters. Alternatively, performance can be poor on both training and testing data when many examples are classified using too few parameters. These two performance shortcomings imply that there is a tradeoff between learning and generalization known as the ‘Bias-Variance dilemma’ (Geman et al. 1992). This tradeoff has led researchers to develop regularization methods such as lasso and ridge regression (Tibshirani 1996), maximal margin (Cortes and Vapnik 1995), and dropout (Srivastava et al. 2014). Essentially, regularization methods improve generalization performance by dynamically regulating the use of parameters and of training data (Srivastava et al. 2014; Tibshirani 1996).
In the context of our findings, these theoretical insights from statistical learning theory suggest that experience is associated with regularization whereas observation is not. We tested the hypothesis that regularization could set the divide between experimenter and observer performances, by training a simple artificial neuron with a logistic activation function to discriminate between the two stimulus sets, Fig. 5A. The neuron received input from a group of at least 22 input neurons tuned to diverse sound features such as amplitude, pitch, duration, and Wiener Entropy, collectively defining the feature set used in Sound Analysis Pro (SAP), a popular birdsong analysis software (Tchernichovski et al. 2000). To model observers, we trained the neuron to fire during puffed stimuli and to remain silent during unpuffed stimuli. We used a gradient descent learning rule that maximizes the likelihood of correct discrimination (Methods). We found that the discriminative performance of the ‘observer’ neuron increased rapidly to the theoretical limit on the training set, but when we interrupted the training at any time and evaluated the neuron’s performance on the testing set, we found poor generalization, Fig. 5B. The reason for poor generalization was that the neuron based its classification on exceedingly many sound features that by chance were slightly informative about the reinforcing air-puff, Fig 5D.
We then modeled experimenters by endowing the learning rule with L1 regularization. L1 regularization implements a conjunctive minimization of summed absolute synaptic weights (Tibshirani 1996). To explain why experimenters would have a non-zero positive value of the regularization penalty, we constructed a simple learning rule that dynamically regulates the regularization parameter λ in proportional to reward prediction error (Methods), which is known to be signaled by a class of dopaminergic neurons in the vertebrate brain (Schultz et al.1997; Hollerman et al. 1998; Gadagkar et al. 2016). In this way, regularization increases when the bird makes mistakes, whereas if the bird reaches a high rate of success, the reward prediction error reaches zero in expectation, which settles the value of λ. The observers’ brain would not modulate λ because observers do not directly experience rewards and punishments during the experimenter training phase.
We found that interrupting the training process of the regularized neuron at any time resulted in roughly equal performances on both training and testing stimulus sets, Fig. 5C, similar to experimenters’ behavior. However, the excellent generalization performance came at a cost: Because we implemented the L1 regularization as a small reduction of synaptic weights (Ng 2004), the synaptic weights of the experimenter neuron and with it the performance on the training set grew only slowly. The main effect of regularization was to concentrate the final synaptic weights on the duration feature, corresponding with our design of stimulus class, Fig. 5D.
To achieve robust generalization, the value of λ had to grow at a rate slower than the synaptic learning rate of the logistic neuron (α ≪ η). In numerical simulations, we found that the value of λ converged to a positive value, Fig 5 E. Training and generalization performance for the experimenter and observer were similar when λ was pre-fixed (to 0.013) or when it was dynamically altered.
We were unable to assert from our simulations whether the apparent learning cue in observers was the air puff’s auditory cue (e.g., the experimenter’s actions drew attention to the puff first, followed by a complex form of stimulus enhancement) or the experimenters’ escape behaviors (as in action imitation). That is, the simulation results matched well with experimental data both when the learning cues for the observer neuron were the air puff sounds (Fig 5F, solid lines) and when the cues were the modelled escape events (we modeled the escapes as binary random variables with a 30% chance of not representing the true class label, corresponding with the average false positive and false negative rates of EXP of about 30%, Fig. 5F dashed lines).
Most importantly, regularization was necessary to achieve a good match with experimental findings. Namely, when we endowed the observer neuron with the same regularization constant λ as the experimenter neuron, but let the neuron learn not from presence/absence of air puff sounds but from the experimenter’s noisy actions (see Methods), then training and generalization curves in the observer neuron were very similar and thus not representative of the data, (Fig. 5F, dot-dashed lines).
Discussion
We found that zebra finches can learn to discriminate auditory stimuli by observing expert discriminators. Experimenter and observers’ learning performance was subject to a tradeoff that depended on whether the learning cue was experienced or observed. We inferred this cue dependence thanks to our experiment design in which the stream of auditory stimuli was identical for experimenters and observers. Therefore, any differences in their abilities to learn and to generalize must have been entirely due to the learning cue, which was an aversive air-puff for experimenters and an observable action for observers. Our findings suggest that an experienced cue favors robust generalization, whereas an observed cue favors rapid learning.
Part of our findings are in line with social learning theories which suggest that to learn from others is a successful strategy with high payoff under a wide range of conditions (Axelrod and Hamilton 1981; Rendell et al. 2010). However, our findings also suggest a limitation to the ubiquitous success of social learning strategies. Namely, we find that social learning can lack robustness when environmental conditions even slightly change. In a sense, our findings evidence a sensory analogue to the common view that the best means to learn a (motor) skill is rigorous practice. As in the case of children who perform poorly in exams after neglecting their homework, insights gained through observation seem not to transfer well to new task instances. Currently, there is no reason to think that all forms of observation learning will be subject to lack of robustness. But our work raises the question as to whether there exist some forms of observation learning that promote robust transfer to new task instances.
Our work raises many interesting questions on the behavioral and neurobiological mechanisms used by observers to acquire stimulus-discriminative information. Behaviorally, observers could learn through social mechanisms of action imitation, of observational conditioning, and of stimulus enhancement, or a combination of these. Note that the definitions of these mechanisms are not strict enough to allow a discrete categorization of social learning in any one study (Hoppitt and Laland 2013). Our findings de-emphasize some known social learning mechanisms such as perceptual learning (evidenced by PL learners) and simple stimulus enhancement (evidenced by lack of discriminative behavior during pre-testing). Our experiments also de-emphasize vocal communication as a mechanism but reveal the importance of vision (-TCOM learners). Overall, the importance of a demonstrating expert suggests that experimenters signal statistical differences between puffed and unpuffed stimuli via their perching behavior such as their rates of leaving the perch. Possibly, observers focused their attention more on the actions of experimenters rather than the stimuli that elicited the actions, which is why they apparently failed to identify the simplest environmental signal that can explain experimenters’ behavior, which in our case was syllable duration.
Similar speed-robustness learning tradeoffs as we find exist in rapidly evolving artificial systems, in which high discrimination performance tends to be associated with slow learning as an unwanted side effect (Dauphin et al. 2014). The tradeoff we find between robustness in one learning paradigm and speed in another is most closely paralleled by regularization methods that control inference in artificial classifiers. Excellent generalization of experimenters agrees with strongly regularized classifiers whereas fast learning in observers agrees with weakly regularized classifiers. Our work suggests that the benefits of regularization may be inherent to experimenting but not to observing. Furthermore, subtractive weight depression through heterosynaptic competition has been observed in the amygdala (Royer and Paré 2003), which provides biological plausibility to our notion of L1 regularization. We hypothesize that such a form of weight subtraction is also seen in zebra finches when they are experimenting, but not when they are observing.
A common problem in machine learning is to set the degree of L1 penalty defined by the regularization parameter λ. This parameter is most often selected using grid search or random search methods, to localize the value that minimizes a cross-validation or held-out validation set error (Bergstra and Yoshua Bengio 2012). More sophisticated techniques, such as estimating a Gaussian process regression model between the hyperparameter (such as λ) and the validation error have recently been developed (Snoek et al. 2012). However, all these techniques require an evaluation of the validation error for optimization, for which there is currently no support in animals and their brains. Therefore, it is far from clear how a brain could implement dynamic regularization. Our speculative proposal is that the balance between learning and regularizing is controlled by a neuromodulatory signal. Such signals are ubiquitous in the animal kingdom and are well suited to convey the amount of regularization, given that they respond sensitively to external reinforcements and their prediction errors (Pawlak and Kerr 2008; Hangya et al. 2015; Yu and Dayan 2005; Iglesias et al. 2013; Wolfram Schultz 1998). One possibility is that air-puff reinforcers drive changes in regularization via experimenters’ escape actions, which is supported by the representation of action-specific reward values in brain areas innervated by neuromdulatory neurons (Samejima et al. 2005). This proposal delineates a possible neural system for comparative studies of learning from experience and from observation. It has been shown that reward prediction error and reinforcement learning algorithms in general, may be utilized by humans in order to understand the social value of others’ behavior (Behrens et al. 2008; Joiner et al. 2017), to feel vicarious rewards from their success or failure (Mobbs et al. 2009) or from their approval (Izuma et al. 2008). We believe that the computational role of reward prediction error can be extended to that of regularization of learning, mediated by neuromodulator systems such as acetylcholine.
The speculative implications of our simulations are that a prerequisite for the evolution of observation learning was a sufficiently large brain capacity that provided rich sensory representations and put few constraints on usable neural resources for sensory processing. Evolution might have chosen traits in observers that are complementary to those associated with experimenting, explaining the apparent differences in what these learning strategies extract from the sensory environment.
Materials and Methods
Experimental animals
We used adult (older than 90 days post hatch, dph) female zebra finches (Taeniopygia guttata, N = 46 females) raised in our colony. Reasons for choosing females are outlined in Supplemental Information. All experiments were licensed by the Veterinary Office of the Kanton of Zurich.
Experimental setup
We adapted an operant conditioning paradigm using social reinforcement (Tokarev and Tchernichovski 2014; Canopoli, Herbst, and Hahnloser 2014). An experimenter and observer pair were placed adjacent to each other in separate cages. The birds could interact from a restricted window in one corner of the cage, forcing them to sit on their respective perches, Fig 1A. The experimenter perch had a sensor to detect presence or absence and trigger stimulus playback through a speaker. We used strong air-puffs directed toward the experimenter as an aversive reinforcement agent. The puffs motivated the experimenter to escape from its perch during ‘puffed’ class trials. Details of housing, perch and nutritional requirements are detailed in the Supplemental Information.
Stimuli
We created a set of 10 stimuli from the songs of an adult male zebra finch (o7r14) from our colony, Fig 1C. We computed syllable durations via thresholding of sound amplitude traces. Each stimulus Si in this set (i=1,2, …,10) was made of a string of six syllable renditions, wherein each rendition was longer than the six renditions in stimulus Si-1, Fig. 1E. Based on the ten stimuli we defined two stimulus classes: the class ‘short’ was formed by stimuli S1 to S5, and the class ‘long’ was formed by stimuli S6 to S10. We use the terms ‘puffed’ and ‘unpuffed’ as class labels, irrespective of whether short or long stimuli were reinforced. We refer to the stimulus set {S1,…, S10} as the training set. To create a generalization set we formed another set of 10 stimuli {S’1…,S’10} from renditions of the same syllable recorded on the very next day, Fig. 1E.
Bird groups and experiment hypothesis
We used six different groups of experimenters and observers, as follows:
Experimenters (EXP, n=10 birds): These birds were trained to escape from the perch prior to arrival of air-puffs. The birds first underwent a pre-training phase in which they were accustomed to the setup, followed by a training phase (see Procedure in the Supplementary Information). Three out of nine experimenters were also tested on a generalization set of stimuli (Generalization phase) once the training phase was completed, Fig. 2A. Each phase ended when the bird’s performance reached a set criterion (see Performance measures and Statistical Criterion).
Observers (OBS, n = 9 birds): Observers were subjected to three phases: an observation phase in which they observed the entire pre-training and training phases of an experimenter, a pre-testing phase (identical to the experimenter’s pre-training phase), and a testing phase (identical to the experimenter’s training phase), Fig. 2C.
Generalizing experimenters (GENEXP, n = 9 birds): these birds were tested on the generalization set of stimuli after they had finished the pre-training and training phases on the training set, Fig. 3A. Note: During the training phase, the experimental group GENEXP is a biological replicate of the EXP group. As expected, there was no difference between EXP and GENEXP in learning time or discrimination accuracy on the training set (Trials to criterion: EXP = 4.8 ± 2.9*103, GENEXP = 6.5 ± 4.7*103, p = 0.45, test statistic = 90.5, Wilcoxon rank sum test; dPesc at criterion: EXP = 0.36 ± 0.06, GENEXP = 0.37 ± 0.09, p = 0.37, test statistic = 88.5, Wilcoxon rank sum test).
Generalizing observers (GENOBS, n = 9 birds): These birds underwent the same observation phase and pre-testing phase as OBS. Thereafter, during the testing phase, GENOBS were tested on the full generalization set, Fig. 3B.
Perceptual learners (PL, n = 7 birds): First, PL could watch an experimenter trigger several thousand stimuli and air-puffs. However, in their case the air-puffs were directed away from the experimenter (oriented downwards outside the cage, Fig 4 A) so PL never experienced or saw the effect of an air-puff against a bird prior to entering the pre-testing phase. Following this, PL then underwent the same pre-test and test phases as OBS.
Valence learners (VL, n = 5 birds): To test for sensitivity on demonstrator performance, we allowed n=5 VL birds to observe naive experimenters prior to their pre-testing phase. In addition, 3 of those VL were given several hundred stimulus-puff pairings (same protocol as training phase in EXP) prior to observing a naive experimenter. After completion of the pretesting phase, VL were subjected to the testing phase, Fig. 4B.
No Trial Communication learners (-TCOM, n = 5 birds): To test against learning in observers from vocal cues (or the lack thereof, Fig 4 B, C) we separated five observers from their experimenters (pre-training and training phases) into an adjacent, acoustically isolated box. These observers (-TCOM) could view the EXP bird through a window and communicate vocally through a custom (software controlled) communication channel (see Supplementary Methods) except during trial periods defined from stimulus onset to air-puff offset. During trials periods, the observers could only hear the stimuli and the sounds of air-puffs, but no sounds triggered by the experimenter. After completion of the EXP training phase, -TCOM observers were subjected to the pre-testing and testing phases as for OBS.
Performance measures and Statistical Criterion
For each bird, we partitioned the trials into non-overlapping bins of 100 trials. In each bin we computed the True Positive Rate (PT) as the probability of escaping on puffed trials and the False Positive Rate (PF) as the probability of escape during unpuffed trials. Our single measure of performance in each bin is the difference in escape probabilities dPesc = PT − PF. Within a bin, to decide whether a bird escaped significantly more on puffed trials than on unpuffed trials, we performed a z-test of independent proportions of the following null hypothesis H0 and alternative hypothesis Ha:
For the z-test of independent proportions, we computed in each bin the z-test statistics as follows (applying Yates’ continuity correction): where n is the number of puffed trials in that bin. The p-value Pr [z > zstat] was computed with the normcdf function in MATLAB (Mathworks Inc); a bin was “statistically significant” if the p-value in that bin was smaller than 0.01 (two-sided test).
Criterion and Trials to Criterion
We used two statistical measures to analyze performance: the first measure was used during the experiment to switch the experimental phase of the birds, the second was used to estimate, on a much finer scale, when a bird achieved high and stable discrimination accuracy.
Our criterion for determining when the pre-training/training phases of an EXP bird ends (pretesting/testing for OBS) was the following: we performed the z-test based significance test (with significance at p < 0.01) on dPesc over an entire day. If daily dPesc was significantly greater than zero for two consecutive days, we switched the phase. This “coarse” criterion allowed us to check performance in a logistically tractable manner and provided high power to the test because the sample size was large (average daily number of trials for n=10 EXP: 722.14 ± 317.5).
For analyzing the data post-hoc, we used the criterion mentioned in the article: z-tests in 8 (x100 trial) bin windows and checking for 7/8 bins significant. This latter criterion was used because we wanted to analyze the data on a finer temporal scale and still make sure that the performance was stable. Accordingly, we computed the fraction of 100 trial bins with significant dPesc in a sliding window of 8 bins, Fig. S1D. When this fraction crossed 90% (=0.875), we took the last bin in the window as the bin at which the performance criterion was reached (“criterion bin”). ‘Trials to criterion’ is then simply the number of all trials performed by the bird up to and including the criterion bin. Our conclusions of fast learning and poor generalization in observers were robust to changes in the definition of the learning criterion: For example, results were unchanged when we changed the criterion from 7/8 significant bins to 3/4 bins, or when we computed the criterion in 200-trial bins instead of 100-trial bins (supplementary methods, robustness of statistics).
Group level statistical Tests
No explicit power analysis was used for this study. Our main experimental groups (EXP vs OBS in the article) have a sample size of n = 9 (one extra EXP bird was included because its observer partner was not tested). We believed this to be an appropriate sample size considering the statistical test we planned on using (non-parametric Wilcoxon test, which is conservative but has higher power for low sample sizes than a parametric test) and the time it took for preliminary experiments to finish.
To compare birds between two groups, we used Wilcoxon rank sum tests (wilcox.test() in R), either one tailed (when there was a concrete alternative hypothesis, e.g trials to criterion in OBS vs EXP, alternative: EXP > OBS) or two tailed (when there was no concrete alternative hypothesis, e.g. trials to criterion in GENOBS vs GENEXP). We first checked (for all bird groups) whether the trials to criterion were significantly non-Gaussian using the Shapiro Wilk test of normality (shapiro.test in R). Because only the EXP and VL trials to criterion were sufficiently Gaussian, we chose to perform non-parametric Wilcoxon tests instead of t-tests. All group level statistical tests and effect size calculations were performed using the R package (R Studio, https://www.R-project.org/).
Logistic regression with L1 regularization
We modeled experimenter and observer behaviors using logistic regression, which is a simple machine learning classifier that learns linear decision boundaries. In this model, the bird’s behavior (leave or stay on the perch) is computed from the input to the logistic neuron, formed by 21 syllable features provided by Sound Analysis Pro (SAP) (Tchernichovski et al. 2000), a popular software tool for characterizing birdsong and its development (SAP features include mean Wiener Entropy, mean pitch goodness, mean frequency modulation, pitch variance, etc., where mean and variance are computed across syllable duration). Syllable duration formed the 21st feature. Feature 22 was formed by a vector of 1’s, endowing the logistic neuron with a bias term. Features 23 and beyond were formed by frozen noise that was randomly drawn from a Gaussian distribution and held fixed for a given syllable. In combination, the total dimensionality of sound-feature vectors x was n=22+nr, where nr is the dimensionality of frozen noise. The auditory input zi to the logistic neuron associated with syllable rendition i presented during trial t was the z-transformed feature vector , where mt = (1 − ɛ)mt-1+ɛ 〈xi〉i = 1…6 is the running mean feature vector and vt → (1 - ɛ)vt-1+ɛ 〈(xi − mt)2〉i= 1…6 is the running variance vector. Both the running vectors were updated after each trial (here ɛ is a small integration rate constant and 〈.〉 denotes averaging over the six syllables in a given trial).
The partial output of the logistic neuron in response to syllable i signals the probability f(zi) of an imminent air-puff, given by , where w is the synaptic weight vector that forms a scalar product with the auditory input. The bird decides to leave the perch (or to not return to it) if (majority vote).
The probability that all six syllables correctly (and independently) predict arrival (u = 1) or absence (u = 0) of an air puff (under a Binomial model) is given by . We train the synaptic weights by maximizing log PCorrect using gradient ascent (maximum likelihood), , where η is a small learning rate. Replacing the definition of f(zi) into this expression, we find for observers the simple perceptron-like learning rule that enforces after each trial the weight update . In simulations, we randomly picked a stimulus at each trial followed by the synaptic weight change. The only two parameters in this model are the integration rate ɛ and the learning rate η.
To model experimenters, we applied an additional constant weight subtraction ΔwEXP = ΔwOBS − λ on successful trials (leave if puffed and stay if non-puffed), provided the individual synaptic weight was of sufficient magnitude, |w| > λ (to prevent small synaptic weights from changing sign).
We dynamically regulated λ in the following manner: where the reward signal rt was given by and where is a running average estimate of past rewards obtained by the bird. Decisions are correct when birds leave the perch on puffed trials and stay on the perch on unpuffed trials. We set the learning rate α = 0.00025, γ = 0.99 and the initial value λt=0 = 0.0005.
For the experimenter, the learning cue u is the occurrence (u = 1) or absence of an air-puff (u = 0). For the observer, we hypothesized that u could be either a) the air-puff cue (u = 1 during air-puffs and u=0 otherwise) to which the observer’s attention is drawn through the experimenter’s behavior (Fig 5 B and Fig 5F solid lines), or b) the action of the experimenter (u = 1 during escapes and u = 0 otherwise). In this latter scenario, the observer is provided a noisy supervisory signal due to false positive and false negative decisions of the experimenter (average EXP false positive rate ~ 30%, average false negative rate ~ 35%, n= 10 EXP birds). To test hypothesis b, we simulated an observer neuron that on randomly chosen 30% of learning (Training set) trials was driven by erroneous learning cues (i.e. escapes on unpuffed trials and no-escapes on puffed trials), with and without regularization (Fig 5F dashed and dot-dashed curves, respectively).
Author contributions
G.N. designed the experiment, carried out the experiment, analyzed the data and wrote the manuscript. J.H contributed to experiment by developing software. R.H designed the experiment and wrote the manuscript.
Competing interests
The authors declare no financial or non-financial competing interests.
Acknowledgements
We thank Rodney Douglas, Fatih Yanik, Andreas Nieder, and Klaas-Enno Stephan for help with the manuscript. We also acknowledge help with experiments from Heiko Hörster and Aleksander Jovalekic. This work was supported by the Swiss National Science Foundation (grant 31003A_127024) and the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013 / ERC Grant AdG 268911) to R.H.R