Fly brain inspires computing algorithm
Flies use an algorithmic neuronal strategy to sense and categorize odors. Dasguptaet al.applied insights from the fly system to come up with a solution to a computer science problem. On the basis of the algorithm that flies use to tag an odor and categorize similar ones, the authors generated a new solution to the nearestneighbor search problem that underlies tasks such as searching for similar images on the web.
Science, this issue p.793
Abstract
Similarity search — for example, identifying similar images in a database or similar documents on the web — is a fundamental computing problem faced by largescale information retrieval systems . We discovered that the fruit fly olfactory circuit solves this problem with a variant of a computer science algorithm (called localitysensitive hashing). The fly circuit assigns similar neural activity patterns to similar odors, so that behaviors learned from one odor can be applied when a similar odor is experienced. The fly algorithm, however, uses three computational strategies that depart from traditional approaches. These strategies can be translated to improve the performance of computational similarity searches. This perspective helps illuminate the logic supporting an important sensory function and provides a conceptually new algorithm for solving a fundamental computational problem.
An essential task of many neural circuits is to generate neural activity patterns in response to input stimuli, so that different inputs can be specifically identified. We studied the circuit used to process odors in the fruit fly olfactory system and uncovered computational strategies for solving a fundamental machine learning problem: approximate similarity (or nearestneighbors) search.
The fly olfactory circuit generates a “tag” for each odor, which is a set of neurons that fire when that odor is presented (1). This tag is critical for learning behavioral responses to different odors () 2). For example, if a reward (eg, sugar water) or a punishment (eg, electric shock) is associated with an odor, that odor becomes attractive (a fly will approach the odor) or repulsive (a fly will avoid the odor), respectively. The tags assigned to odors are sparse — only a small fraction of the neurons that receive odor information respond to each odor (3–5) – and nonoverlapping : Tags for two randomly selected odors share few, if any, active neurons, so that different odors can be easily distinguished (1).
The tag for an odor is computed by a threestep procedure (Fig. 1A). The first step involves feedforward connections from odorant receptor neurons (ORNs) in the fly’s nose to projection neurons (PNs) in structures called glomeruli. There are 50 ORN types, each with a different sensitivity and selectivity for different odors. Thus, each input odor has a location in a 50 – dimensional space determined by the 50 ORN firing rates. For each odor, the distribution of ORN firing rates across the 50 ORN types is exponential, with a mean that depends on the concentration of the odor () ************** (6),7). For the PNs, this concentration dependence is removed () ************ (7),8); that is, the distribution of firing rates across the 50 PN types is exponential, with close to the same mean for all odors and all odor concentrations1). Thus, the first step in the circuit essentially “centers the mean” —a standard preprocessing step in many computational pipelines — using a technique called divisive normalization (8). This step is important so that the fly does not mix up odor intensity with odor type.
( (A) ) Schematic of the fly olfactory circuit. In step 1, 50 ORNs in the fly’s nose send axons to 50 PNs in the glomeruli; as a result of this projection, each odor is represented by an exponential distribution of firing rates, with the same mean for all odors and all odor concentrations. In step 2, the PNs expand the dimensionality, projecting to 2000 KCs connected by a sparse, binary random projection matrix. In step 3, the KCs receive feedback inhibition from the anterior paired lateral (APL) neuron, which leaves only the top 5% of KCs to remain firing spikes for the odor. This 5% corresponds to the tag (hash) for the odor. (B) Illustrative odor responses. Similar pairs of odors (e.g., methanol and ethanol) are assigned more similar tags than are dissimilar odors. Darker shading denotes higher activity. (C) Differences between conventional LSH and the fly algorithm. In the example, the computational complexity for LSH and the fly are the same. The input dimensionalityd=5. LSH computesm=3 random projections, each of which requires 10 operations (five multiplications plus five additions). The fly computesm=15 random projections, each of which requires two addition operations. Thus, both require 30 total operations.x, input feature vector;r, Gaussian random variable;w, bin width constant for discretization.
The second step, where the main algorithmic insight begins, involves a 40 – fold expansion in the number of neurons: Fifty PNs project to 2000 Kenyon cells (KCs), connected by a sparse, binary random connection matrix (9). Each KC receives and sums the firing rates from about six randomly selected PNs (9). The third step involves a winnertakeall (WTA) circuit in which strong inhibitory feedback comes from a single inhibitory neuron, called APL (anterior paired lateral neuron). As a result, all but the highestfiring 5% of KCs are silenced (1,3,(4) ). The firing rates of these remaining 5% correspond to the tag assigned to the input odor.
From a computer science perspective, we view the fly’s circuit as a hash function, whose input is an odor and whose output is a tag (called a hash) for that odor. Although tags should discriminate odors, it is also to the fly’s advantage to associate very similar odors with similar tags ( (Fig. 1B) ), so that conditioned responses learned for one odor can be applied when a very similar odor, or a noisy version of the learned odor, is experienced. This led us to conjecture that the fly’s circuit produces tags that are localitysensitive; that is, the more similar a pair of odors (as defined by the 50 ORN firing rates for that odor), the more similar their assigned tags. Localitysensitive hash [LSH (10,11)] functions serve as the foundation for solving numerous similarity search problems in computer science. We translated insights from the fly’s circuit to develop a class of LSH algorithms for efficiently finding approximate nearest neighbors of highdimensional points.
Imagine that you are provided an image of an elephant and seek to find the 100 images — out of the billions of images on the web — that look most similar to your elephant image. This is called the nearestneighbors search problem, which is of fundamental importance in information retrieval, data compression, and machine learning (10). Each image is typically represented as ad– dimensional vector of feature values. (Each odor that a fly processes is a 50 – dimensional feature vector of firing rates.) A distance metric is used to compute the similarity between two images (feature vectors), and the goal is to efficiently find the nearest neighbors of any query image. If the web contained only a few images, then brute force linear search could easily be used to find the exact nearest neighbors. If the web contained many images, but each image was represented by a lowdimensional vector (eg, 10 or 20 features), then spacepartitioning methods (12) would similarly suffice. However, for large databases with highdimensional data, neither approach scales (11).
In many applications, returning an approximate set of nearest neighbors that are “ close enough ”to the query is adequate, so long as they can be found quickly. This has motivated an approach for finding approximate nearest neighbors by LSH (10). For the fly, as noted, the localitysensitive property states that two odors that generate similar ORN responses will be represented by two tags that are themselves similar (Fig. 1B). Likewise, for image search, the tag of an elephant image will be more similar to the tag of another elephant image than to the tag of a skyscraper image.
Unlike a traditional (nonLSH) hash function, where the input points are scattered randomly and uniformly over the range, a LSH function provides a distancepreserving embedding of points fromd– dimensional space intom– dimensional space (the latter corresponds to the tag). Thus, points that are closer to one another in input space have a higher probability of being assigned the same or a similar tag than points that are far apart. [A formal definition is given in (13).]
To design a LSH function, one common trick is to compute random projections of the input data (10,11) – that is, to multiply the input feature vector by a random matrix. The JohnsonLindenstrauss lemma14,15) and its many variants (16–18) provide strong theoretical bounds on how well locality is preserved when embedding data from (d) *************** (into (m) dimensions by using various types of random projections.
The fly also assigns tags to odors through random projections (step 2 inFig. 1A; 50 PNs→2000 KCs), which provides a key clue to the function of this part of the circuit. There are, however, three differences between the fly algorithm and conventional LSH algorithms. First, the fly uses sparse, binary random projections, whereas LSH functions typically use dense, Gaussian random projections that require many more mathematical operations to compute. Second, the fly expands the dimensionality of the input after projection (d) (m), whereas LSH reduces the dimensionality (d»m). Third, the fly sparsifies the higherdimensionality representation by a WTA mechanism, whereas LSH preserves a dense representation.
In the supplementary materials (13), we show analytically that sparse, binary random projections of the type in the fly olfactory circuit generate tags that preserve the neighborhood structure of input points. This proves that the fly’s circuit represents a previously unknown LSH family.
We then empirically evaluated the fly algorithm versus traditional LSH (10,11) on the basis of how precisely each algorithm could identify nearest neighbors of a given query point. To perform a fair comparison, we fixed the computational complexity of both algorithms to be the same (Fig. 1C). That is, the two approaches were fixed to use the same number of mathematical operations to generate a hash of lengthk(ie, a vector withknonzero values) for each input (13
We compared the two algorithms for finding nearest neighbors in three benchmark data sets: SIFT ( (d)=128), GLOVE (d=300), and MNIST (d=784) (13). SIFT and MNIST both contain vector representations of images used for image similarity search, whereas GLOVE contains vector representations of words used for semantic similarity search. We used a subset of each data set with 10, 000 inputs each, in which each input was represented as a feature vector ind– dimensional space. To test performance, we selected 1000 random query inputs from the 10, 000 and compared true versus predicted nearest neighbors. That is, for each query, we found the top 2% (200) of its true nearest neighbors in input space, determined on the basis of Euclidean distance between feature vectors. We then found the top 2% of predicted nearest neighbors inm– dimensional hash space , determined on the basis of the Euclidean distance between tags (hashes). We varied the length of the hash (k) and computed the overlap between the ranked lists of true and predicted nearest neighbors by using the mean average precision (19). We averaged the mean average precision over 50 trials, in which, for each trial, the random projection matrices and the queries changed. We isolated each of the three differences between the fly algorithm and LSH to test their individual effect on nearestneighbors retrieval performance.
Replacing the dense Gaussian random projection of LSH with a sparse binary random projection did not hurt how precisely nearest neighbors could be identified (Fig. 2A). These results support our theoretical calculations that the fly’s random projection is localitysensitive. Moreover, the sparse, binary random projection achieved a computational savings of a factor of 20 relative to the dense, Gaussian random projection (fig. S1) (13).
When expanding the dimensionality, sparsifying the tag using WTA resulted in better performance than using random tag selection ( (Fig. 2B) )). WTA selected the topkfiring KCs as the tag, unlike random tag selection, which selectedkrandom KCs. For both, we used 20krandom projections for the fly to equate the number of mathematical operations used by the fly and LSH (13)). For example, for the SIFT data set with hash lengthk=4, random selection yielded a 17 .7% mean average precision, versus roughly double that (32 .4%) using WTA. Thus, selecting the top firing neurons best preserves relative distances between inputs; the increased dimensionality also makes it easier to segregate dissimilar inputs. For random tag selection, we selectedkrandom (but fixed for all inputs) KCs for the tag; hence, its performance is effectively the same as doingkrandom projections, as in LSH .
With further expansion of the dimensionality (from 20kto 10DKCs, closer to the actual fly s circuitry), we obtained further gains relative to LSH in identifying nearest neighbors across all data sets and hash lengths (Fig. 3). The gains were highest for very short hash lengths, where there was an almost threefold improvement in mean average precision (eg, for MNIST withk=4, 16 .0% for LSH, versus 44 8 % for the fly algorithm).
We also found similar gains in performance when testing the fly algorithm in higher dimensions and for binary LSH (20) (figs. S2 to S3). Thus, the fly algorithm is scalable and may be useful across other LSH families.
Our work identified a synergy between strategies for similarity matching in the brain (21) and hashing algorithms for nearestneighbors search in largescale information retrieval systems. It may also have applications in duplicate detection, clustering, and energyefficient deep learning (22). There are numerous extensions to LSH (23), including the use of multiple hash tables (11) to boost precision (we used one for both algorithms), the use of multiprobe (24) so that similar tags can be grouped together (which may be easier to implement for the fly algorithm because tags are sparse), various quantization tricks for discretizing hashes (25), and learning [called datadependent hashing (13)]. There are also methods to speed up the random projection multiplication, both for LSH schemes by fast JohnsonLindenstrauss transforms (26,27) and for the fly by fast sparse matrix multiplication. Our goal was to fairly compare two conceptually different approaches for the nearestneighbors search problem; in practical applications, all of these extensions will need to be ported to the fly algorithm.
Some of the fly algorithm’s strategies have been used before. For example, MinHash (28) and winnertakeall hash (29) both use WTA like components, though neither propose expanding the dimensionality; similarly, random projections are used in many LSH families, but none, to our knowledge, use sparse, binary projections. The fly olfactory circuit appears to have evolved to use a distinctive combination of these computational ingredients. The three hallmarks of the fly’s circuit motif may also appear in other brain regions and species ( (Table 1) ). Thus, localitysensitive hashing may be a general principle of computation used in the brain (30).
(Table 1)The generality of localitysensitive hashing in the brain .
Shown are the steps used in the fly olfactory circuit and their potential analogs in vertebrate brain regions.
References and Notes
 ↵
 ↵
 ↵
 ***
 ↵
 ↵
 ↵
 ***
 ↵
 ↵
 ↵
A. Gionis, P. Indyk, R. Motwani, inVLDB ’99, Proceedings of the 25 th International Conference on Very Large Data Bases, MP Atkinsonet al., Eds. Morgan Kaufman, 1999), pp. 158 – 529.
 ↵
H. Samet,Foundations of Multidimensional and Metric Data Structures(Morgan Kaufmann Series in Computer Graphics and Geometric Modeling, Morgan Kaufmann, 2005).
 ↵Materials and methods are available as supplementa ry materials.
 ↵
W. Johnson, J. Lindenstrauss, inConference on Modern Analysis and Probability, R. Beals , A. Beck, A. Bellow, A. Hajian, Eds., Vol. 26 ofContemporary Mathematics(American Mathematical Society, 1984), pp. 189 – 206.
 (↵)
 ↵
 ↵
Y. Lin, R. Jin, D. Cai, S. Yan, X. Li, in2013 IEEE Conference on Computer Vision and Pattern Recognition(IEEE Computer Society, 2013), pp. 446 – 451.
 ↵
M. S. Charikar, inProceedings of the ThirtyFourth Annual ACM Symposium on Theory of Computing,STOC ’02[Association for Computing Machinery (ACM), 2002], pp. 380 – 388.
 (↵)
C. Pehlevan, DB Chklovskii, inNIPS ’15, Proceedings of the 28 th International Conference on Neural Information Processing Systems(MIT Press, 2015), pp. 2269 – 2277.
 ↵
R. Spring, A. Shrivastava, Scalable and sustainable deep learning via randomized hashing.arXiv: 1602. 08194[stat.ML] ( (February) ).
 ↵
 ↵
Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, inVLDB ’07, Proceedings of the 33 rd International Conference on Very Large Data Bases(ACM, 2007), pp. 950 – 961.
 ↵
P. Li, M. Mitzenmacher, A. Shrivastava, inProceedings of the 31 st International Conference on Machine Learning(Proceedings of Machine Learning Research, 2014), pp. 676 – 684.
 ↵
A. Dasgupta, R. Kumar, T. Sarlos, inKDD ’11, The 17 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(ACM, 2011), pp. 1073 – 1081.
 ↵
A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, L. Schmidt, inNIPS ’15, Proceedings of the 28 th International Conference on Neural Information Processing SystemsMIT Press, 2015), pp. 1225 – 1233.
 ↵
A. Broder, inProceedings of the Compression and Complexity of Sequences 1997(IEEE Computer Society, 1997), p. 21.
 ↵
J. Yagnik, D. Strelow, D. A. Ross, R.s. Lin, in(International Conference on Computer Vision) IEEE Computer Society, 2011), pp. 2431 – 2438.
 ↵
 ↵

J. Pennington, R. Socher, CD Manning, inEMNLP (****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************) Conference on Empirical Methods in Natural Language Processing(Association for Computational Linguistics, 2014), pp. 1532 – 1543.

J. Wang, H. T. Shen, J. Song, J. Ji, Hashing for similarity search: A survey.arXiv: 1408. 2927 [cs.DS] (13 August 2014).

N. A. Lynch, C. Musco, M. Parter, Computational tradeoffs in biological neural networks: Selfstabilizing winnertakeall networks.arXiv: 1610. 02084[cs.NE] (6 October 2016).

Y. Weiss, A. Torralba, R. Fergus, inAdvances in Neural Information Processing Systems 21, D. Koller, D Schuurmans, Y. Bengio, L. Bottou, Eds. Curran Associates, 2009), pp. 1753 – 1760.

H. Zhu, M. Long, J. Wang, Y. Cao, inProceedings of the Thirtieth AAAI Conference on Artificial IntelligenceAAAI Press, 2016)), pp. 2415 – 2421.

K. Zhao, H. Lu, J. Mei, inProceedings of the TwentyEighth AAAI Conference on Artificial IntelligenceAAAI Press, 2014)), pp. 2874 – 2880.

J. Wang, T. Zhang, J. Song, N. Sebe, H. T. Shen, A survey on learning to hash. arXiv: 1606. 00185[cs.CV] (1 June 2017).
 ↵
Acknowledgments:For funding support, CFS thanks the NSF (grant EAGER PHY – 1444273), and SN thanks the Army Research Office (grant DOD W 911 NF – 17 – 100 45). All authors thank A. Lang and J. Berkowitz for helpful comments on the manuscript. Code and data are available athttps : //bitbucket.org/navlakha/flylsh.