in ,

The Architecture of a Large-Scale Web Search Engine, circa 2019, Hacker News

The Architecture of a Large-Scale Web Search Engine, circa 2019, Hacker News

[16] [17] [16] ************************ (real-time-search

) *************************** kubernetes

[16] ml-systems [9] ******************

cloud -native

********************** (oss) ************************** [17] The Architecture of a Large-Scale Web Search Engine, circa [10] ****************** Our Journey to Microservices, Kubernetes and beyond.

[“docker_registry”] **************** [“region”] [16] ************************ (December) th, (**************************** In (previous posts) of this advent series, we have described some of the technologies that power our privatesearch products. It is about time that we introduce the systems that bring everything together. It is important to understand that a web scale search engine is highly complex. It is a distributed system with strong constraints on performance and latency. On top of that it can easily become extremely costly to operate; both in human resource and, of course, in money.

This article explores the technology stack we employ today and some of our choices and decisions, which have been taken and iterated upon over the years, to cater both external and internal users.

The topic at hand is very broad and cannot be covered in a single sitting, but we hope to give you the gist of it. [17] We use a combination of prominent open source and cloud-native technologies wrapped with home grown tooling, which have been battle tested. Places where we haven’t found a solution in the open source world or commercial efforts, we have been prepared to dive deep and write some core systems from scratch, which has worked well for us at our scale. [17] Disclaimer:We describe how our system is, as of today. Of course we did not start like this. We had multiple architectural overhauls throughout the years, always considering constraints like costs, traffic and data size. By no means, we would suggest that this is a recipe to build a search engine; it is what is working today, as wiser people said:

“Premature optimization is the root of all evil” ~ Donald Knuth

[10] **************** And we agree wholeheartedly. As matter of fact, we really advise anyone, to never try to throw all the ingredients to the pot at once. But instead to add them one by one; slowly and incrementally adding complexity one step at a time.

[17] Given the nature of this post, we want to provide an ordered outline of all topics covered:

Cliqz search as a product and its system requirements.

Web Search Systems: A near real-time and truly automated search system. (************************************ (Data Processing Platform: Facilitating near Real-time and Batch Indexing.) ************************************* [10] **************************** How deployments were done in the past? The Pros and Cons of various approaches. Microservices Architecture: Orchestrating services involved to deliver content for a search engine result page. Our need for using containers and a container orchestration system (Kubernetes).

  • Introduce our Kubernetes stack – How we deploy, run and manage Kubernetes and various add-ons and the problems they solve for us. (************************************* Local Development on Kubernetes – An end to end use case. (************************************** (Optimizing on Costs.) ************************************** (Machine Learning Systems.) **************************************** [10] ******************************
  • Our Search Experience — Dropdown & SERP [14] The search engine at Cliqz has two consumers with different requirements.

    , with results available on the dropdown. This type of search requires fewer results (typically 3) but is extremely latency sensitive (less than****************************************************************************************************************************************************************************************** ms ); otherwise the user experience suffers.

    [“docker_registry”] Search in SERP

    [14] **********************

    Figure 2: Cliqz Search Engine Result Pagebeta.cliqz .com********************************** (************************** (Search on a) web page, the typical search engine results page everybody knows. In here, the depth of the search is unbounded but it is less demanding on latency (less than (ms) as compared to the dropdown version.

    (Fully Automated and Near Real-time Search) ******************************************

    Consider a query like“bayern munich”. Now, this may seem a very generic query, but when issued, it touches several services within our system. If we try to interpret the intent from the query, we will figure out that the user may be:

    (Researching about the club (in which case a Wikipedia snippet would be relevant) (************************************ Interested in booking tickets, buy merchandise, register as an official fan (Official Website) (************************************ Interested in current news about the club: Pre-Match news about the game

  • [16] -game information like: Live Scores, Live Updates or Commentary.
  • MapReduce and Spark based batch workflows managed through Luigi Used to train large scale Machine Learning Models over a large data -set. eg: Query and Word Embeddings, Approximate Nearest Neighbor Models, Language Models etc.. Keyvi, Cassandra , qpick and Granne for serving

  • [10] ****************************What is important to note here is that, Near Real-time and Weekly Index is responsible for a large portion of search related content served on SERP. This is a similar behavior to other search engines which promote recent content over historical content about the topic. The batch index is handling time independent queries, long tail of queries and content which is rare, historical or tricky in the context of understanding a search query. The combination of the three gives us the necessary ammunition to build Cliqz search in its current form. All systems are capable of answering all the queries, the final results, however, is a mixture of the results of all indexes.

    Deployments —A Historical Context

    You haven’t mastered a tool until you understand when it should not be used. ~ Kelsey Hightower

    [16] From the start, we have been focused on delivering our search services using a public cloud provider rather than managing infrastructure on-premises. In the last decade, that has become the norm across the industry, given the complexity and resources required to operate one’s own data center (s) compared with the relative ease of hosted services and the ease to digest pay as you go model for startups. Amazon Web Services (AWS)[‘XXXX:XXXX’]has been convenient for us as it allowed to abstract ourselves from managing our own machines and infrastructure. If not for AWS, it would have taken us a lot more effort to reach this stage. (But, they are as convenient as they are expensive. You will see in this article some tricks we came up with to reduce costs, but we advise you to be extremely careful using cloud services at scale.) [17] We typically strive to avoid a managed offering of a service that might be useful to us as costs can be unbearably high at the scale we typically operate at. To bring in some context let us start from the year 5656. A growing concern that we met early on was reliably provisioning resources and deploying applications on AWS.

    We started with putting some effort into building our own infrastructure and configuration management systems on top of AWS. We focused on shipping a solution which was native to python to ease developer on-boarding. We wrapped theFabric******************************** project and coupled it with [‘XXXX:XXXX’] ************** (Boto) ******************************* [‘XXXX:XXXX’] to provide nice interfaces to launch machines and configure one’s application with few lines of code driven through a deploy.pyFigure 2: Cliqz Search Engine Result Page beta.cliqz.comfile co-located to the service source. This was then wrapped into project template generators for easy onboarding of new projects. Back then, it was early days of docker and traditionally we shipped as python packages or plain python code which was challenging because of dependency management. Even though the project gained a lot of traction and was used by many services driving many products at Cliqz, there are certainly things a library driven approach to infrastructure and configuration management lacks. Global state management, central locking for infra changes, no central view of cloud resources utilized by a project or developer, reliance on external tools to cleanup orphaned resources, limited configuration management, limited observability into developer usage, context seep in from regular users of the tool are some of the things that created friction and which in turn led to an increased operational complexity.

    This led us to explore alternate external solutions as homegrown efforts had to be stopped due to limited resources. The alternative we eventually landed on is a combination of solutions from (Hashicorp) ****************************** (including) **************************************************************** Consul [“region”]. ****************, Terraform and (Packer) ******************************* and eventually configuration management tooling like (**************************************************** (Ansible) *****************************and (Salt) ****************************************************************. ************************** Terraform [“chart_repo”] presented an excellent declarative approach to infrastructure management which many of the current technologies in the cloud native space have leveraged. So we decided, after careful evaluation, to retire our fab-based deploy library for Terraform. Besides technical pros and cons, one always have to consider human factors. Some teams are slower to adopt changes than other, be it because of lack of resources or because the cost of transitions are not uniform. For us, it took a long time, about one year, to migrate.

    [14] (Terraform) ******************************certainly brought us some out of the box features which we were missing from our old deploy project, including:

    (Central state management of infrastructure. **************************** Verbose plan, patch and apply support.Easy teardown of resources with minimal orphaned resources.Support for multiple clouds.[“region”] Meanwhile, we also faced somechallenges in our journey with Terraform:

    ************************************ (Complex cloud specific DSL which is typically not DRY.) ************************************** (Difficult to wrap it in other tools.) **************************************** (Limited and sometimes complex templating support.) ************************************No feedback on the health of services.