The Trouble with Query Expansion
The hardest part of building a search engine is building an understanding of user intent, i.e., what the user expects to see when they type in a search query. Serving our customers well means accurately aligning user intent to an understanding of the central concepts of each web page. Often, user queries do not perfectly match the language used in relevant documents to describe the same concept which makes this alignment challenging.
To bridge the gap between query and document, it is generally recommended to do query expansion: supplement customer queries with synonyms, spell-corrected words, and related phrases. These techniques help increase the number of relevant documents retrieved from the search index.
This post is about a novel synonymy relationship which we find can fix a severe flaw in query expansion.
Before we dive into details, let’s consider an example. Consider an example query
q = connect display to laptop and a relevant web page
d which talks about connecting
monitor to laptop but does not use the term
display. To retrieve this page
d and score it highly, we need to be able to infer that
monitor are synonymous and interchangeable in
q. A simple solution would be to expand
q by all synonyms of
show, exhibit, exhibition, monitor,… — then we should be able to retrieve and favorably score
d. Not so fast! We need to make sure that the added synonyms don’t change the meaning of the query — otherwise we will start retrieving irrelevant documents. E.g. when we swap
q, we create the query
connect show to laptop which doesn’t have the same meaning as
q (perhaps it relates to streaming a tv show on screen). So indiscriminately adding all possible synonyms in query expansion is not an option for an effective search engine. With billions of pages in our search index, matching an incomprehensible number of term combinations, we must avoid adding noisy terms in query expansion to prevent matching millions of incorrect pages.
We need a more fine-grained and precise tool than the blunt hammer of synonymy. Here we make a key observation: many synonymy relations that are useful for document retrieval exist only in the context of other query terms. We call this relationship Contextual Synonymy. Contextual Synonyms are crucial in query expansion for search engines: providing a much better precision-recall tradeoff when it comes to retrieval and ranking than general synonyms.
The rest of this blog post will follow this structure:
- Provide details of search retrieval that explain how query expansion is implemented in the search stack.
- Formalize our intuition for Contextual Synonymy and justify its importance and novelty.
- Explain our process of data collection and describe our experiments for predicting contextual synonyms using machine learning.
We are releasing all annotation guidelines and data we created for contextual synonymy detection to encourage further investigation.
Index-based Retrieval and Contextual Synonymy
Neeva’s index, like other search engines, is structured to optimize the method of searching for all the documents that match a query. Because there are billions of documents on the Internet, and this number is ever-growing, it is not realistic to scan through all the documents on every user query. As such, our search index is an inverted index mapping a term to the set of documents that contain the term. With this inverted index, the cost of finding documents that match a query is proportional to the number of matching documents, not the total number of documents on the Internet.
To increase the number of documents that are retrievable for a user query, we need to add more terms that can be matched — as the name for query expansion not so subtly suggests, we need to expand the query with more words.
The following query illustrates the power of query expansion:
"leading browser for mac besides safari"
By adding the word “top” which synonym for “leading” a result that does not contain the original query term “leading” but does contain “top” (and at least one variation of each of the other query terms) will be retrievable from the search index.
When analyzing retrieval losses on our internal dataset, we found that 50% of our retrieval losses (i.e. cases where we fail to retrieve the most relevant documents) were due to shortcomings in our query expansion. As argued earlier, it is a bad idea to indiscriminately add a lot of synonyms (e.g. expand
display by all possible synonyms.) We found that the missing words can be determined in a more clever way — these words were semantically equivalent to original query terms, conditioned on the presence of the other query terms. E.g. “display” and “monitor” are equivalent in the context of “laptop”; so they can be used interchangeably and maintain semantically similar queries:
[("how to connect my laptop to display", "how to connect my laptop to monitor"), ("multiple displays for laptop", "multiple monitors for laptop"…] etc. However,
monitor are different in the context of
food: the queries
"how to best display food" and
"how to best monitor food" have very different user intent that is best not to conflate to avoid angering hungry dieters.
Contextual Synonymy (CS) aims to quantify the degree of semantic equivalence of two words in a shared context. Concretely, our goal is to determine whether two words w1 and w2 are synonymous with each other in the context of some other words c. The context c can be a phrase, a word, or any other metadata. In contrast, a subset of Contextual Synonyms is General Synonymy (GS): two words are general synonyms if they are contextual synonyms without the support of other words (this is just a fancy way of saying words like “buy” and “purchase” are just plain synonyms).
Why Contextual Synonymy is not the same as Semantic Textual Similarity (STS)?
Semantic Textual Similarity (e.g. as proposed here) seeks to measure the semantic equivalence of two pieces of text. However it does not imply anything — at least not directly — about the terms in those texts. Contextual Synonymy is more fine-grained because it is applicable at the term level. This makes it much more generalizable and effective for search engines.
As a concrete example, semantic textual similarity simply says
how to connect laptop to display and
how to connect laptop to monitor are identical and nothing more — learning this relationship allows us to improve exactly 2 queries (the texts themselves).
Whereas contextual synonymy implies
display are synonymous in the context of laptop — this fact enables us to generate a lot of STS relationships because now we can say
[("multiple displays for laptop", "multiple monitors for laptop"), ("laptop not recognizing display", "laptop not recognizing monitor")…] are all identical and thereby can cover a lot more customer queries!
Due to its importance and novelty, we treat Contextual Synonymy as a task in and of itself. We are using the word "task” akin to describing new problems as is often done in NLP papers e.g. the STS task which has released guidelines and datasets (both of which we release as well). We will show techniques for CS that will use semantic textual similarity as well as search result overlap to accurately align our predictions of good contextual synonyms with human judgement.
The following table highlights more such examples of contextual synonyms and the contexts.
Before we proceed, just something to note: the notion of synonymy is very nuanced. Sometimes even the obvious synonyms, like stems (e.g.
share, shared, shares, sharing), have semantic differences in certain contexts. This nuance helps crystallize our problem space.
Consider the following two queries:
- how to increase public good in the world?
- how to increase public goods in the world?
Albeit both queries are almost identical, it is plausible that the intent of the first query is “how to increase public wellbeing and health,” whereas the intent of the second query is “how to increase commodities or services to the public.” The terms “good” and “goods” take on different meanings in the contexts of these query, and notably in the context of the word “public.”
To date, even other search engines have a hard time capturing the subtle differences in intents of these queries:
Google captures a single dominant user intent.
To provide a more holistic search experience, Neeva captures several variations of user intent.
Experiments to generate contextual synonyms
This section will describe our method of synonym candidate generation, data collection, and using machine learning for classifying contextual synonyms.
We first generate a large number of contextual synonym candidates using data from web search sessions. The process uses a simple insight: for two semantically similar queries that share a set of terms, the other terms(that are different) are good candidates for contextual synonymy. We identify semantically similar pairs of queries as two queries that have high result overlap overlap in their top 5 results. Contextual synonym candidates are extracted from pairs of semantically similar queries using the following simple algorithm:
1. Consider a query pair (query1, query2)
query1: .mp3 app replacement for itunes
query2: iphone app alternative to itunes
2. Tokenize and remove stopwords
query1: [.mp3, app, replacement, itunes]
query2: [iphone, app, alternative, itunes]
3. Compute consecutive token pairs
query1: [[.mp3, app], [app, replacement], [replacement, itunes]]
query2: [[iphone, app], [app, alternative], [alternative, itunes]]
4. If a token in
query1, matches a token in
query2 (in the same position), the two other tokens are considered synonym candidates.
[term, rewrite | context]
([.mp3, app], [iphone, app]) -> [.mp3, iphone | app]([app, replacement], [app, alternative]) -> [replacement, alternative | app]([replacement, itunes], [alternative, itunes]) -> [replacement, alternative | itunes]
([.mp3, app], [app, alternative]) -> Does not yield a synonym candidate because "app" appears in different positions in both pairs.
5. The output of the above algorithm is a set of synonym candidates and their contexts: [w1, w2 | context-word]
Machine learning to filter contextual synonyms
This algorithm for mining synonym candidates yields a set of candidates that contains a lot of false positives.
[temperature, time | oven] is generated from the query pairs
["what temperature to bake chicken breast", "time to bake chocolate chip cookies"]
We will use machine learning to improve the precision of the generated candidates while maintaining high recall. The input to our ML predictor at inference time will be a query q and a set of contextual synonyms
(w1, w2 | c) and the predictor will decide which contextual synonym expansion rules can apply to the
Data Collection and LabelingTo train our machine learning predictor, we took a subset of candidate contextual synonyms and labeled them using our in house rating team. Our guidelines for rating CS can be found here.
At the end of the process, we collected about 8M contextual synonym pairs. We are also releasing these pairs which you can find here. Our engineering team rated a subset of these and found about 80% agreement with the raters.
BERT-based semantic similarity feature: this feature is a BERT-based similarity measure between a user query and a reformulated query using contextual synonyms. For the query
q = w1 w2 w3 w4 and synonym candidate
(w1, w5 | w2), we can generate a possible query variation:
q' = w5 w2 w3 w4. Now we can measure the semantic similarity between
q' and use that as a feature to determine if w1 and w5 are synonyms in the context of this query. Let’s call this feature BERT similarity score (BSC).
In addition to the BSC feature, we added a few other features e.g. NPMI based on co-occurrence of
w5 over a set of queries (words that appear in the same query are less likely to be contextual synonyms), count of query pairs that shared >=4 results in top 5, etc.
The intuition behind some of the added statistical features is as follows. Considering synonym candidates
a, b and context word
bare synonym candidates for many different
c, then likely
bare general synonyms because they are interchangeable in many different contexts.
E.g. Normalized count of
context | (w1, w2)
ahas many different synonym candidates
bfor the same context
bare likely not contextual synonyms. The distribution of terms in the original query set is uneven so likely
bappear frequently in the datasets and therefore appear in many synonym candidates.
E.g. Normalized count of
[w1 | (w2, context)], [w2 | (w1, context)]
bappears as a context for
afrequently and vice versa, then likely
bare part of a bigram. Same applies for
E.g. Normalized count of queries that contain bigram
(w1, context), (w2, context), (w1, w2)
Contextual Synonym Linear Classifier (CSLC): A linear model trained using logistic regression with all the features.
Contextual Synonym Random Forest Classifier (CSRFC): Random forest classifier with the same features as CSLC.
We generate precision-recall curves using a held out test set from our labeled data using 3 of the techniques above.
BSC was treated as the baseline model because it used semantic similarity of one query pair, but had no notion of contextual synonymy.
CSLC > BLC: CSLC performed better than BLC, confirming that contextual synonymy predicted semantic equivalence more closely aligned with human judgement.
CSRFC > CSLC > BLC: CSRFC performed better than both CSLC and BLC, again confirming the benefit added by contextual synonymy, and showing how random forests are ideal models for learning complex feature relationships without overfitting.
To further analyze and explain the performance of CSRFC, we conducted experiments to identify which classes of synonyms CSRFC outperforms BLC and CSLC. As such, we introduced two new measures, Interesting Recall and Non-interesting Recall.
Interesting Recall: Recall over the set of candidates that are not general synonyms, but are contextual synonyms.
Examples of interesting contextual synonym candidates:
Non-interesting Recall: Recall over candidates that are general synonyms(and are by definition also contextual synonyms).
Examples of non-interesting contextual synonym candidates:
To our delight, CSRFC with a low model score threshold (0.3) outperformed BLC on all fronts: achieving simultaneously higher precision, higher interesting recall and higher non-interesting recall.
We thank Nicole Filippelli and all of our raters teams for helping us generate data for this task.
We thank Andrei Damian for contributions with defining the task, user labeling guidelines, and final implementation.
Dataset and guidelines
Labeling guidelines can be found here: https://docs.google.com/document/d/1_PldhHh2djTi2xbaQj_MVk_KXdTvdh1l1Bpmr7qio9g/edit
Labeled dataset can be found here: