TL;DR: We define the problem of web query similarity and train a transformer model to embed queries using web data. Our model beats large OpenAI embedding models significantly while many orders of magnitude faster and cheaper.
Every day, billions of people look for information and answers to questions using search engines in the form of queries. The diversity of these queries is mind-boggling; even for Google, despite its dominant market share, 15% unique queries everyday are unique; that number is much higher for Neeva.
Unsurprisingly, search engines do a lot of query processing (synonymy, spelling, lemmatization, term weighting, bigrams and compounds, and contextual rewriting) to understand user queries and to generate various forms of these queries which are used to match to relevant web pages. These query operations are crafted by mining signals from a variety of sources such as web page texts, anchors, clicks, etc.
For existing search engines, it has taken billions of queries, millions in infrastructure dollars, and hundreds of thousands of engineering hours to extract these signals. This is not an option for us at Neeva — we are small, scrappy, and want to make a lot of progress very quickly. Recent advancements in deep learning provide us with a unique opportunity to leapfrog many of these steps.
Query similarity problem
We will focus on a particular subproblem of query understanding: query similarity. Two queries are equivalent if the user(s) searching with those queries are looking for the same set of results or information. E.g. the queries [what’s an echocardiogram] and [what heart echo] have equivalent search intents.
Why is this a useful problem to solve? There are many ways in which we can leverage query similarity in search. The most important signals for search engines are strong associations between queries and results. A click on a result for a query is a commonly used proxy for utility of a result for that query. So if you can solve the query similarity problem, you can use good results that you have observed for one query as candidate results for another query. This is the most direct way of making impact on user experience via query understanding in the retrieval and ranking steps.
While easily described, this problem is quite challenging and multi-faceted because solving query similarity involves understanding a mental model of search users. Often the queries are highly contextual and query similarity is a highly nonlinear function of the input terms — e.g. for the query [the rock], you probably mean the wrestler/actor Dwayne Johnson; if, on the other hand, you type in the query [rock], you probably mean geology. Here are a few different facets of the query similarity problem:
Notably, query similarity adds an added difficulty not present in the related semantic similarity task. While the queries [used Nintendo Switch game] and [used Nintendo Switch games for sale] are not identical from a natural language perspective, in the context of search queries over the web, the [for sale] intent is likely implied and hence they are equivalent from a web intent perspective.
At Neeva, we are tackling this problem using transformer-based language models. In particular, we start with a pre-trained BERT model and fine-tune it on a proprietary dataset that contains pairs of queries and their similarity scores.
LLMs for query similarity
Our approach for query similarity is simple. We start from a contextual encoder model like BERT and use it to embed each query in an n-dimensional space. Then for any two queries, the cosine similarity between their embeddings gives us a similarity score which is a measure of their similarity.
This style of symmetrically encoding raw text is called the bi-encoder scheme because the same model is applied to both the queries.
Training data for query similarity
The next key question is: how do you train these models? As is well-known within the machine learning community, pushing these models to the edge requires carefully solving the data problem.
Most systems for training matching embeddings (e.g.https://aclanthology.org/K19-1049.pdf, https://arxiv.org/abs/2004.04906) find that in addition to good quality positives examples (i.e. pairs of queries that are different but equivalent), what is normally even more crucial is finding “hard negative” examples (i.e. pairs of queries which are not obviously unrelated but are not identical in the context of the web either.) Hard negative mining is the most challenging and noisy aspect of training similarity models. Without such examples, these language models would not learn the margin between not similar enough vs. just about similar.
We start by joining query result logs across Neeva and for each query pair (q1, q2), compute the overlap between the top 5 results for these queries. We bucket the query pairs by the number of overlapping results. We resample the results to get roughly equal fractions across buckets 1, 2, 3 and 4 (bucket 5 is small). Here is the final fraction of results we have in our training data.
Using this data, we have a natural way of constructing examples. Query pairs with high result overlap (>= 4 or 5) can form positive examples while pairs of queries with non-zero but low (typically <= 1 or 2), overlap form hard negative examples. We labeled many of these pairs manually and found these heuristics to have roughly ~ 20% false negative (i.e. when this heuristic falsely claims two queries are not equivalent) and 20% false positive rate. Additionally, for each training batch, we can generate a set of “easy negative” examples by picking random query pairs within the training batch. Given sufficiently large batches, these random pairs of queries will be valid negative samples with very high probability.
Using the dataset above, we used two styles of loss minimization.
Soft label: In this strategy, we create soft labels as overlap (q1,q2) / 5. So for a pair with overlap 1, the label is 0.2, with overlap 2 label is 0.4, and for random pairs is 0.0, and so on. Then we minimize a cosine similarity loss (see this https://www.sbert.net/docs/package_reference/losses.html#cosinesimilarityloss): that is simply minimize the l2 distance between cosine similarity of query embeddings and their soft label. The intuition behind this strategy is that it allows for a more nuanced interpretation of similarity and in this process also produces more calibrated similarity estimates.
We found that the soft label strategy works best in our experiments.
That is not the only data problem that we must tackle, however. Typical user queries often contain spelling or typographical errors which the model must be robust against. Ideally, we would have access to a dataset of frequent spelling errors and their spell-corrected versions but in the absence of this data at scale, we inject artificial noise into our data to simulate such errors. At training time, we randomly sample 10% of input queries and inject noise in various ways including dropping characters, adding new characters and/or substituting characters. This ensures that queries that are separated only by spelling errors are considered similar queries while also implicitly acting as a spell-correct service.
Choice of transformer model
With all this training data, we fine-tune a 6-layer BERT model with 384 dimensions (starting with this checkpoint sentence-transformers/all-MiniLM-L6-v2). Through the use of the sentence-transformers library, the training is setup in a siamese style. We input pairs of queries that are embedded after a final layer computes the cosine similarity between the original 2 queries. We train the model for 1M steps with a batch size of 1024 at a learning rate of 2e-5 using a cosine scheduler with 10000 warmup steps. We are releasing the model on HuggingFace (https://huggingface.co/neeva/query2query).
We collected a dataset of 1000 challenging query pairs with result overlap between 1 and 5 (so no easy negatives). Then we manually labeled this dataset for query similarity with our labels being “equivalent”, “not equivalent”, and “ambiguous”. A large subset of query pairs labeled as “ambiguous” were of the kind where one of the queries was a narrower version of the other( eg.“best photo editor” vs. “best photo editor for windows”). While these queries often share top web results, they aren’t necessarily equivalent queries from a user perspective. For our final evaluation dataset, we filtered out these ambiguous query pairs, leaving us with 635 high confidence labeled pairs. We are releasing this dataset on HuggingFace (https://huggingface.co/datasets/neeva/query2query_evaluation).
Below, we report the performance of our fine-tuned models against some notable baselines on this evaluation set.
Fasttext is a well-known word embedding based model which embeds individual query words without taking the rest of the query into context.
Pretrained sentence BERT model.
State-of-the-art embeddings from OpenAI’s embedding models. We compare with curie because at the time of this experiment that was the largest released model. Note that this model is significantly larger than our BERT model, outputs significantly higher dimensional vectors(4096 vs 384), and is 100000x more expensive to run at scale.
Neeva Q2Q model: trained using the soft label technique.
The Neeva Q2Q model beats all baselines by a significant margin. While OpenAI does do reasonably well for this task (unlike what was observed here), our model still defeats it by a large margin.
Applying query similarity
Once we have a language model fine-tuned to identify queries that are similar to each other, we can use it in various ways. For example, we can pre-compute latent representations of all the queries we have seen in the past. Now, on each new user query, a nearest neighbors search provides us with the k most similar queries. Past ranking signals (eg. clicked documents) for these similar queries can now assist with ranking results for the new user query. We also use this model to create contextual synonyms. Or finally, given a user query, we can pick out query terms that can be dropped without changing the meaning of the query (called the optionalization problem) by computing the similarity between the original query and the query with terms dropped.
All of these different applications have the same end goal i.e. generate a set of alternative formulations of the user query to make it easier for our retrieval and ranking systems to understand user intent and retrieve the best documents.
While closing the engineering, data, and compute gap is still an incredibly difficult task, such techniques that utilize recent deep learning advancements and fine-tuned language models are promising and can help us leapfrog hundreds of thousands of software engineering hours.