Indexing Billions of Text Vectors, Hacker News

[10] [ibest,irestaurant,iof,imunich] [11] search

(nearest-neighbors word-embedding

graph-algorithms

rust [8] ********** Indexing Billions of Text Vectors

Optimizing memory-usage for approximate nearest neighbor search[ibest,irestaurant,iof,imunich] [10] ************** [ibest,irestaurant,iof,imunich]

December 7th,

***************** A frequently occurring problem within information retrieval is the one of findingsimilarpieces of text. As described in our previous posts (A New Search EngineandBuilding a Search Engine from Scratch , queries are an important building block at Cliqz. A query in this context can either be a user-generated one, (ie the piece of text that a user enters into a search engine), or a synthetic one generated by us. A common use case is that we want to match an input query with other queries already in our index. In this post we will see how we are able to build a system that solves this task at scale using billions of queries without spending a fortune (which we do not have) on server infrastructure.

Let us first formally define the problem:

Given a fixed set of queries () ****************************************Q) ******************************* (Q ************ Q [ibest,irestaurant,iof,imunich] ************************************ (**************************

(Q) ************************************ , an input query

************************** () ************ [10] *************************** (q) ****************************** [ibest,irestaurant,iof,imunich] q [ibest,irestaurant,iof,imunich] ********************** (************************** [10] **************************************

q [ibest,irestaurant,iof,imunich] **************************** and an integer

************************ [10] k [

ibest,irestaurant,iof,imunich] ****************k (****************************************************************************** (************************************************ (k) (**********************************, find a subset of queries ********************** () ************ (****************************** (R) ******************************={ (****************************************** (q)0 (********************************************** (*********************************************, (****************************************** (q)1 (1 ****************************************** (**************************************, ******************************************** [ibest,irestaurant,iof,imunich] ************************************ [ibest,irestaurant,iof,imunich] **************************************************************, ****************************************** [ibest,irestaurant,iof,imunich] ****************************** (qk [ibest,irestaurant,iof,imunich] ****************}
=() **************************************************
[

ibest,irestaurant,iof,imunich] ************************

{q

************************ [10]********************** (0)(**************************************

(**************************** [

ibest,irestaurant,iof,imunich] ************************************(************************************************** [ibest,irestaurant,iof,imunich], **********************************************

************************* (q)

************************ [10] ************************** () **************************************** () **************************************** (1) ************************************(****************************[ibest,irestaurant,iof,imunich] ************************************(************************************, (************************************** [10] ********************************(**************************. **********************************

,************** [

ibest,irestaurant,iof,imunich] ******************************** (q) *************************************

**********************************

k [ibest,irestaurant,iof,imunich] ************************** (************************************** [ibest,irestaurant,iof,imunich] ************** [ibest,irestaurant,iof,imunich] ******************************************** (**************************************

************************************* (**************************}} [ibest,irestaurant,iof,imunich] **********************⊂**************************

**************************Q (************************************, such that each query

************************** (***************************** (q) (i) ****************************** () ******************************************** (∈) ***************************************** (R) [

ibest,irestaurant,iof,imunich] ********************** q_i in R**************** () **************************************** (q) ************************************ [10] **********************************

[

ibest,irestaurant,iof,imunich] ************ [ibest,irestaurant,iof,imunich] **************************************** ((i) ************************************* (********************************************************************** [10] **********************************

(**************************************************(************************************** [10] ********************************************(R) **********************************(is more similar ************************************************ (to $************************$
********************************* (q) ********************************************************************** (q) ******************************** [ibest,irestaurant,iof,imunich] ************************** [ibest,irestaurant,iof,imunich] ************************ (q) ************************** [4] ************************************* (************************************** than every other query in (************************************** $**************************** (Q) (∖) **************************************** (R) ******************************** (Q setminus R) ********************************************************************** [$

ibest,irestaurant,iof,imunich] ****************************

********************************** (Q) ************************************ (************************************** [10] ∖************************************** (R) ************************************ () ************************************ [

ibest,irestaurant,iof,imunich] **************************** [10] ************************************************

For example, with the following set of queries

**************************** (Q) ************************************************************ (Q) ******************************************************************** [

ibest,irestaurant,iof,imunich] ************************ (**************************

(Q) ************************************(**************************************:

(****************************************************** (************************************ (******************************************************** ({ [ibest,irestaurant,iof,imunich] ************************************************** (tesla cybertruck) ************************************************************, beginner bicycle gear, eggplant dishes ************************************************, ************************************************ tesla new car (************************************************, how expensive is cybertruck [ibest,irestaurant,iof,imunich] ************************************************ , (vegetarian food) ************************************************************, ************************************************ shiman o vs ultegra (****************************************************************, ********************************************** building a carbon bike
, zucchini recipes () ****************************************************************}} (****************************************************** (**************************************************************** and ************************** [10] *********************************** $*********************************** (************************************ **************************$

******************

3 [ibest,irestaurant,iof,imunich] **************************** , we might expect the following results: [10] (***************************************************
(Input query $************************** [10]$
************ ()(q (********************************
q (************************** [10] ************************************ [ibest,irestaurant,iof,imunich] *************** (q) [ibest,irestaurant,iof,imunich] **************************** [ibest,irestaurant,iof,imunich] ****************************************** Similar Queries
************************** () ************ [10] *************************** (R) ****************************** [ibest,irestaurant,iof,imunich] R (**************************************************************** (R) ***************************************************************************** (********************************************************* (************************************************** tesla pickup (************************************************************ (************************************************************ {{************************************** tesla cybertruck (****************************************************, tesla new car (************************************************************, how expensive is cybertruck (**************************************************************} (************************************************ [4] ************************************************** (********************************************** (best bike) **************************************************************************************************************************************************** [
ibest,irestaurant,iof,imunich] (******************************************************** ({ [ibest,irestaurant,iof,imunich] ************************************************** (shimano) ****************************************************************************************************************************************************************** (vs ultegra) ******************************************************, ************************************************** are carbon bikes better (************************************************************, (bicycle gearing) ************************************************************** (**************************************************************} (******************************************************** [ibest,irestaurant,iof,imunich] ************************************************** (cooking with vegetables) ********************************************************** (************************************************************ ({************************************ eggplant dishes********************************************, zucchini recipes , vegetarian food [4] (**************************************************************} (************************************************************** Note
that we have not yet definedsimilar. In this context, it can mean almost anything, but it usually boils down to some form of keyword or vector based similarity. With keyword based similarity we can consider two queries similar if they have enough words in common. For example, the queriesopening a restaurant in munich and best restaurant of munich are similar because they share the words restaurantand (munich) **************************************************************, whereasbest restaurant of munich and where to eat in munichare less simila r because they only share a single word. Someone looking for a restaurant in Munich will however likely be better served by considering the second pair of queries to be similar. This is where vector based matching matching comes into play.

Word embedding is a machine-learning technique in Natural Language Processing for mapping text or words to vectors. By moving the problem into a vector space we can use mathematical operations, such as summing or computing distances, on the vectors. We can even use conventional vector clustering techniques to link together similar words. It is not necessarily obvious what these operationsmeanin the original space of words but the benefit is that we now have a rich set of mathematical tools available to us. The interested reader may want to have a look at e.g.word2vec
******************************************************************* [1] orGloVe [2]
for more information about word vectors and their applications.
Once we have a way of generating vectors from words, the next step is to combi ne them into text vectors (also known as document or sentence vectors). A simple and common way of doing this is to sum (or average) the vectors for all the words in the text together.
(************************************************************** [4] ******************************************************************** (*********************** (Figure 1: Query vectors) ****************************************************************************
We can decide how similar two snippets of text (or queries) are by mapping them both into a vector space and computing a distance between the vectors. A common choice is to use the angular distance.
[10] in all, word embedding allows us to do a different kind of text matching that complements the keyword-based matching mentioned above. We are able to explore the semantic similarity between queries (eg************************ best restaurant of munich and where to eat in munich) in a way that was not possible before.We are now ready to reduce our initial query matching problem into the following: Given a fixed set of query (vectors)**************************** (Q) ************************************************** () ******************************** (Q) ***********************************[ibest,irestaurant,iof,imunich] ******************************************************************** (Q) ************************************ () ************************************ [ibest,irestaurant,iof,imunich] **************************** (************************************, an input (vector) ********************************************** [ibest,irestaurant,iof,imunich] ************** [10] ************************** (q(******************************* (q) (************************************** [ibest,irestaurant,iof,imunich] ********************** q (************************************ and an integer ********************** [ ibest,irestaurant,iof,imunich] **************** (k (****************************** (k(********************************** (**************** k) ************************************* (**************************************, find a subset ofvectors [ibest,irestaurant,iof,imunich] ****************************[ibest,irestaurant,iof,imunich] ************** R) *******************************=(******************************** {{***************************************)q(0) ******************************************** [ibest,irestaurant,iof,imunich] ******************************, **************************************************** $(****************************,$
(Quantization) ****************************************************** Granne (RAM only)(**********************************************************(Memory) ************************************************** (********************************************************** **************************************************************************************************************************************************************** (GB) ************************************************************** [ibest,irestaurant,iof,imunich] ******************************************** () (GB) ************************************************************ [ibest,irestaurant,iof,imunich] ************************************************** ( ) ************************************************************************************************************************************************************ (GB) **************************************************************** (************************************************************************************************************************************************************ - (GB) ****************************************************************** ********************** (****************************************** (************************************ SSD [10] ************************************ - (********************************************************** (-) *****************************************************************- [4] (GB) ************************************************************* [ibest,irestaurant,iof,imunich] ****************************************** (Latency) ************************************************1 ms (****************************************************** (1 ms) (5 ms) **************************************************************** [4] - 80 ms (******************************************************** (************************************************************************ Table 1: A comparison of latency requirements for different setups We would like to point out that some of the optimizations mentioned in this post are, of course, not applicable to the generic nearest neighbor problem with non-decomposable vectors. However, any situation where the elements can be generated from a smaller number of pieces (as is the case with queries and words), can be accommodated. If this is not the case, then it is still possible to use Granne with the original vectors; it will just require more memory, like with other libraries. Footnotes (************************************************************************************************** [ibest,irestaurant,iof,imunich] **************************************************************************************** Efficient Estimation of Word Representations in Vector Space paper****************************************************************************************************** ↩︎ [10] ******************************************************************************************************** GloVe: Global Vectors for Word Representation [10] ************************************************************************************** (paper) ********************* ↩︎**************************************************************************************************** Nearest Neighbor Search: Exact methods -wiki************************************************************************************************************ (↩︎) ********************** Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - ↩︎↩︎ 4 billion Is large enough to make the problem interesting, while still making it possible to store node ids in regular 4 byte integers. [2] ****************** (************************************************************************************************ ( Quantization ************************************************************************************** (wiki) ********************* (↩︎) ********************** We tried a few other quantization techniques based on e.g. product-quantization, but did not manage to get them to work with sufficient quality at scale. (↩︎) (**************************************************************************************************************************** ANN Benchmarks [

Granne (RAM SSD)

Indexing Billions of Text Vectors, Hacker News

What do you think?

CentOS Linux 7 will end support on June 30. Enterprises/developers should also change their systems.

Microsoft announces it will end support for Office 2016/2019 on October 14, 2025

Ransomware group Dark Angels claims the theft of 1TB of data from chipmaker Nexperia

OPENAI launches half-price API for developers: supports batch processing but does not obtain results in real time

Hackers exploit Windows SmartScreen vulnerability to deliver DarkGate malware

Adobe Premiere Pro will introduce AI tools including OPENAI’s Sora to help users generate videos

GM, Ford Credit Arms May Lose Billions on Car-Price Plunge, Hacker News

Flaw in billions of Wi-Fi devices left communications open to eavesdroppng, Ars Technica

AT&T slashed billions from network spending, cut tens of thousands of jobs, Ars Technica

Water ice on the Moon may be billions of years old – BGR, Bgr.com

Some tea bags may shed billions of microplastics per cup | CBC News, Hacker News

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

Amazon FBA Product Research & Find Products for Amazon FBA

The Benefits and Features of Urgent Care Electronic Medical Records (EMRs)

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution, Hacker News

Review: Horrified is a terrific family-friendly monster-themed board game, Ars Technica

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections