Human raters play a critical role in evaluating the accuracy of search results. It’s essential to use data generated by human raters to train and evaluate search ranking models to better serve results at scale. Rating search queries is a complex task (more so than e.g. labeling bounding boxes for CV) because search is highly contextual. Having a tight set of guidelines is imperative to get quality data and high rater agreement.
In a continued effort to promote transparency around Neeva’s search, we are releasing our query result rating instructions that guide our human evaluations. These are focused on Technical Queries as this vertical contains less ambiguous queries. We will release broader guidelines in the future.
Query Result Rating Instructions
10 point scale
Decide if the web result shown for a question made on a search engine is relevant and has good quality.
Contents
Goal
Components of Neeva Search Query Relevance tasks
How to complete the task
Query understanding
Page Quality
Low Quality
Dead Pages
Foreign Language
Malware
Clones
Paywall
Porn/NSFW
Medium Quality
High ad load
5+ years old pages
10+ years old Stackoverflow pages
Page loads slowly
Format of page makes it difficult to extract information
Forked github repo
Question page with no response
High Quality
6. Page Match
1: Significantly Poor Match
2: Especially Poor Match
3: Poor Match
4: Soft Match
5: On Topic but Incomplete Match
6: Non-Dominant Match
7: Satisfactory Match
8: Solid Match
9: Wonderful Match
10. Vital Match
Goal
In this task you will help a search engine user (think someone using Google/Bing/DuckDuckGo/Neeva) find the best websites/images/videos (the Search Result) to answer their question (the Query).
Please click through each of the Search Results presented and rate the relevance of all the websites/images/videos.
For each Query, try to assess first what the user’s intent is and how specific the answer to the query should be. Read all the results for the Query, look up on other search engines and estimate the best answer.
For each Search Result, make a decision about how well it answers the user’s Query.
In making your evaluation, you should consider two things about the result:
Page Quality: The quality and trustworthiness of the Search Result
Page Match: How good the Search Result is at answering the Query
Components of Search Query Relevance tasks
There are 3 parts of Neeva search query relevance task:
Query understanding: This is only about the query but you may use results to help understand. Indicate the intent of the user query and classify the query by intent.
Page quality rating: Evaluate the page only. Match the quality of a web page independent of the query based on the quality of its content and the amount of ads on the page.
Page match rating: Evaluate the content on the page with respect to the query. Determine how likely users are to find a web page relevant to a query.
How to complete a Search Relevance Task
1. Open the task or navigate to the assigned query.
2. Evaluate the query without looking at the provided results. If you already know what the query is asking and what the answer is, great! It is more likely that you’ll need to investigate a touch, please use these steps:
Be aware of anything you can glean from the query, for example language, type of request, etc
Search the query on other search engines.
For the sake of this rating task, when we mention “other search engines”, please look at Google, Bing, and DuckDuckGo.
Looking at the Search Results Page note the type of results you’re receiving.
If the Devices/OSs/Tools/Languages are not mentioned in the query, please note any dominant Devices/OSs/Tools/Languages.
To determine dominant intent, feel free to search around and make a determination based on the number of mentions of an intent in top results if something is dominant intent for a query. Any Device/OSs/Tools/Languages mentioned in these results would be considered Dominant. There can be more than one.
3. Once you have finished evaluating the query, mark the Query Intent in the appropriate field.
4. Open the first link in an incognito window of a browser with all ad/tracker blocking turned off. Using the guide to determine the page quality.
5. Mark down the Page Quality in the appropriate field. Note your reason in the appropriate field.
Please note, there might be many reasons a page is low or medium quality, you will only need to mark one.
If the page is not in english it should be marked foriegn language
After that, if the page is a clone it should be marked clone.
After that, any matching reason for that category can be marked.
6. Mark down the Page Match in the appropriate field. Note your reason in the appropriate field.
Please note, there might be many reasons a page match falls into a particular category, you will only need to mark one.
7. Repeat steps 4 through 6 until you’re done with all of the results for the query.
8. Move to the next query.
Query understanding
At the beginning of each task, you should first try to understand the intent of the query. You should think about questions like:
Does the query mention any specific device, OS, tool, or programming language or can you easily infer it from the query?
Do the other search engines include results from multiple devices/OS/PL/etc?
We suggest that the rater first understands how specific the answer to the query should be by looking at all results in the session as well as looking at other search engines.
You’ll then be asked “Query Type”. You’ll need to identify the type of query presented which will help you better answer the relevance of results after this. Here are the types to choose from:
The “How to” category covers query intent where the person is seeking instructions on how to complete a task. This could be step by step instructions or code. The relevant results will return steps, instructions, or code snippets. The user knows the solution they want, but needs to know how to do it.
This user does not know the exact solution they’re looking for, but is looking for a problem definition. They want to know what went wrong. The relevant pages will also often have solutions if possible.
This user is looking to learn or define the query.
This user is looking to discover or purchase a product, either physical or virtual.
This user is looking to discover a single page or person.
This user is has not given enough.
Page Quality
Page quality ratings are query-independent. The goal is to determine if there are aspects of a page that make useful content difficult to digest.
You must open each page or linked PDF in incognito mode and without any ad blocker to determine the appropriate page quality rating. If you are not located in the US please use a VPN to access these sites, as this task is evaluating for US based users.
There are 3 levels of quality:
High quality
Medium quality
Low quality
We will review this section from low to high. Low pages are egregious and easy to identify. Medium page quality has the most specific rules. These include ads, age and formatting. High page quality is generally not low or medium.
When you review sites you should first check if it’s a low quality page, if it is do not continue. If it is not a low quality page, continue to medium, then continue to high.
Low Quality pages
Low quality pages are defined by ads, age and formatting. Only one of these categories has to be met to be categorized as low. The following categories are:
Dead pages
Malware pages
Porn/NSFW pages
Foreign Language
Pages behind a paywall
Clones
[Dead pages] This covers cases when
404 errors - the page does not load
the page loads, but there are messages that the content was moved or could not be found anymore
[Malware pages] If your computer warns you that the website is not safe, it should fall under this category. This includes, for example, pages served through HTTP.
[Porn/NSFW pages] Pages with pornographic/NSFW content should be marked as low quality.
[Foreign Language] If a page is not usable by an English language speaker because of non-English foreign language content, it is low quality. A browser translation is not sufficient to mark the page as English.
[Pages behind a paywall] Paywalls are the case when the page asks you to pay money or to subscribe in order to see the content of the page.
If you hit a paywall, try to load the page in “Incognito mode” in Chrome or “Private Browsing” in a second browser.
If it still does not load and the content is severely truncated, do not assume that it answers the user query and consider it Low Quality.
Otherwise, if it loads without the paywall then the page is not low quality; you may now judge the utility based on the content
Pages where paragraphs are cloned but then enriched with unique information are NOT clones (e.g. Cliff Notes, quote in a larger paper or article)
Clones are the exception to other low quality pages in that page match in general will be the same as their counterparts.
To find these, you may need to pay attention to other results in the same task or to top results on other search engines. Here are a few examples:
Medium Quality pages
Medium quality pages are defined by ads, age and formatting. Only one of these categories has to be met to be categorized as medium. The following categories are:
[3+ ads when scrolling / 1 large banner ad / pop-up interstitial or video ads] Ads are the most common reason a page gets marked medium quality. In order to detect this, be sure to
Open the page in incognito mode
Let the page load
Move around the cursor
Scroll down
Ads on the bottom of the page generally do not count towards this. Ads in videos do not count. The ads should be an intrusive or distracting experience and take away from the actual content.
[Page is 5+ years old (excluding stackoverflow/stackexchange articles or academic papers)] Old content for technical queries loses relevance because of new software versions and new products make websites outdated. Please note page age is determined by the last answer for forums, and most recent update for blogs.
Use the following guidelines for age:
Old applies to technical articles more than 5 years old
Old applies to technical articles more than 2 software versions old
Oldness does not apply to academic papers like PDFs
For stackoverflow/stackexchange articles, old applies for 10 years since last answer
This result is from 07 May 2014. That’s +5 years ago and so the result is rated as Medium Quality.
[Page loads slowly] Some websites do not load quickly, this makes Search Results lower their Quality. These cases are considered Medium Quality. If you are unsure, you can always verify the loading speed in a second browser.
[Format of page makes it difficult to extract information (e.g. many random and distracting videos and images)] Some sites have text formatted in a way that makes it hard to digest. They may have random / distracting video and image links as well. Here’s an example:
This is a screenshot of 3 steps of a WikiHow article.
Each of these steps has a screenshot of another website as the article adding to what is this site vs what is examples.
Many links are not clickable.
This all makes it generally difficult to find the content.
[Forked github repo] Forks of github repos are considered medium quality, unless the query is specifically searching for the fork. For example, https://github.com/abhinav700/finta is a fork of the main finta repo at https://github.com/peerchemist/finta. You can see that this repo is forked here:
[Pages behind a login or non-dismissable email capture] Pages that require a login or email entry to view (that cannot be dismissed in a second browser). In these cases payment information is not required, but the exchange of data is.
Please note: There are sites like Medium or Quora which have daily article limits that you might hit the daily limit of views. Websites with daily limits can still be high page quality sites.
Please note: If the site suspects you of being a robot it does not fall into this category.
[Question page with no response] Forum questions, e.g. on Stackoverflow, that have no answer are considered medium quality. E.g. this url which has no correct answer as of the writing of this instructions.
High Quality pages
High quality pages are defined as not low or medium quality.
These pages should:
Meet the age criteria
Meet the ads criteria
Be well formatted
Page Match
This section is describing how to judge the match between the query and a webpage. Before measuring the match of each page for the query, it is important to understand what the search engine user is looking for. We suggest that you first understand what the user is looking for by checking other search engines for the query intent.
Roughly speaking, you can think of 10 levels of match going from 1 (a dead page) to 10 (a bullseye match).
1: Significantly Poor Match
Does not load, page is inaccessible.
2: Especially Poor Match
Page is wholly unrelated to the query. Missing key terms.
3: Poor Match
Page may have some query phrases, but not related to the query.
4: Soft Match
Page is related to query, but broad, overly specific, or tangential.
5: On Topic but Incomplete Match
Page is on topic for the query, but not useful in a wide scope, potentially due to incomplete answers or older versions.
6: Non-Dominant Match
Page is related to the query and useful, but not for the dominant intent shown.
7: Satisfactory Match
This page satisfies the query, but may have to look elsewhere to round out the information.
8: Solid Match
This page satisfies the query in a strict sense. There is not much extra, or beyond what is asked for.
9: Wonderful Match
This page satisfies the query in a robust, detailed sense. It anticipates questions/pitfalls that might come up and/or adds appropriate framing to the query.
10: Vital Match
This is a bullseye match. It is not available on all queries. The user has found exactly what they were looking for.
1: Significantly poor match.
An inaccessible or unreadable (foreign language) low quality site is always going to be a significantly poor page match as the page is not accessible.
2: Especially poor match.
This result may include some of the query terms, but not in the correct combinations and some terms might be missing. While the best alternative might also be missing terms, in this case the answer does not answer the query. It is completely off topic.
3: Poor match.
Sometimes, the result may have sufficient coverage of important query keywords. However, the result may take an incorrect interpretation of those keywords or may not have sufficient information to satisfy the query intent.
4: Soft match.
A soft match is related to the query, but in a minimal way. It could be an especially narrow intent, adjacent intent, incorrect information, under developed information. We know this result will not satisfy the query, but it is somewhat related. This generally falls into a few categories:
Result is related but not direct answer - A result could address an intent adjacent to the query, only partially addressing the query intent itself. In other words, the result is addressing a specific thing about a concept while the query is asking for another thing about the concept.
This is a rare occurrence, it is far more likely something is relevant or irrelevant and this case will end up in high or low.
If the topic of the result is far enough apart from the query, it may qualify as an especially poor match.
Simple Content Aggregator - As one case of relevant information being insufficient, if a page just aggregates a small list of urls addressing the user intent, but does not add additional information, it is considered a soft match.
Information is Too Outdated to be Relevant - A result may be too old for a query. If the result is sufficiently outdated and no longer applicable, it can be a soft match.
Context Too Narrow to be relevant - Query language exists on page but not in correct/relevant context. This can happen if a query is more narrow in its ask and then page is more general or vice versa.
5: On Topic but Incomplete match.
Query language exists on the page but it is ambiguous that the result is helpful and hence may only satisfy a very small number of users. The reason why these are results are not a 4 is that for 4 you know the page is not going to be relevant to the query, but for 5 it’s not immediately obvious if the results are useless. But there is no clear way of inferring how useful the results are… which could be because the page is difficult to full understand or has sparse information. These are rare cases, but fall into many categories:
Other matches that are not so off topic to be a Soft Match, but do not fit into a non-dominant category
Relevant Information in Result is Insufficient - A result could be about the query intent, but simply may not have enough content or information to be helpful to the user and is hence considered a on topic but incomplete match. A relevant question with no answer falls into this category.
Less relevant profile page for person / organization / group - A specific case of the narrow context designation are profiles or social pages. As described in the higher matches sections above, some profile pages have relatively common intent, making them a higher matches. However, in some cases, the profile page is a rather unlikely intent.
It’s not just the site, but whether it’s relevant for the specific query. If it is not on the first page of other search engines, it is incomplete.
See Relevant Profile Page For Person / Organization / Group for information on when profile pages can be higher matches.
6: Non-Dominant match.
Some queries have multiple possible interpretations of varying likelihoods. For these queries, if a result takes on an uncommon interpretation that is useful only to a small number of people, it can be a non-dominant match.
Non-dominant match is not a default state. If a result is a reasonably common interpretation of the query it could still be a higher match. If it is completely extremely uncommon, it would move too low. Research and weigh the likelihood of the interpretations to make the judgment. The following examples illustrate this point:
User reads the page for information but the result matches the intent of the query only partially and many people will not be satisfied with the result because it is not a dominant intent
There are a few cases that will be considered satisfactory. The thought is that a good fraction of users will be satisfied with this results but we cannot be sure that a large number of users will.
Sometimes, user queries are overpecified, and there are no good results that match all the query words. In this case, we want to allow some results to be satisfactory match which may not match the query exactly, but still provide a reasonable interpretation to the query, when no other result gives a better interpretation. Satisfactory matches fall into several categories:
Best alternative to overspecified search
Non-primary landing pages
Listing a concept
Good but less popular
We can do this by understanding which words in the query can be safely ignored. So we can use the “strikethrough” pattern in other search engines where they cut off a few words because it treats them as missing.
For people and organizations, their non-primary landing pages. The main landing pages will be a vital match. Here are some relevant social results:
For queries that are just listing a concept, dictionary results are a 7.
Results that are a good match but may be less popular (low installs, low likes) are a 7
[glob symlinks node] https://www.npmjs.com/package/symlinks package is much less popular compared to this one https://www.npmjs.com/package/glob even if it is relevant. So should be a 7.
8: Solid match.
Solid matches are good, relevant answers to a query. To qualify as a solid match, a result must sufficiently address fairly common intent(s) of the query. In general the user would not have to search again to find an answer, but may choose to. If a query is over or under specific it’s results may fall satisfactory or non-dominant matches. Here are some categories:
Inferred dominant intent
Information is buried in a larger context
Other solid matches
Inferred dominant intent. The result is about a specific intent (inferred device, tool, language, or os) that is not explicitly mentioned in the query; and this intent is dominant enough to be useful for most people.
Inferred dominance can be determined by looking at if the device/OS/tool/language is mentioned twice in the top 5 results on other search engines, in either the title or the snippet. If the device/OS/tool/language or other context is over specific enough, the result can be a 6 or below.
The page is relevant for an inferred dominant intent (>=2 times on top 5 search results)
Information on the page answers the query but the information, while easily retrievable, is buried inside a bigger context.
Wonderful matches are complete, robust, relevant answers to a query. To qualify as a wonderful, a result must sufficiently address common intent(s) of the query.
In general the user would not want to search again to find an answer.
Often these are related to discovery, so you might have many equally detailed, well presented lists. If the query is for product discovery there can be many 9s, if it’s not these lists might be an 8 or a 7.
Relevant Tools Solution / Products examples listed below. These are high quality tool or a website providing a software solution that addresses the intent of a query looking for a solution is a wonderfulmatch.
Relevant List of Items examples listed below. If there is a reasonable interpretation of the query which is satisfied by a list of items, then an aggregated (curated or user generated) list that directly addresses the intent of a query is a wonderfulmatch. The page must be more than simply a short aggregated list; it should either add additional information/opinion about the list of items or it is larger than 5 items. We essentially want to look for cases where someone has put in a good amount of work in accumulating all items.
However, there are some simple aggregators which simply do not add additional value on top of the items. Please see simple content aggregator for how lists of items without sufficient information can be a soft match.
10. Vital match.
Think Bullseye. This match is for cases when the query is expecting to directly navigate to a certain result. The result fully meets the query intent for most users.
For a result to be a vital match, the following must hold:
The result should not just be relevant and current.
Almost every user issuing the query would not need additional results.
Always will be an official page or long title match.
The result must be #1 result on other search engines.
Note: being a top result on other search is a required condition to be a vital match but NOT every top result on other search engines is a vital match.
Be conservative when applying a Vital Match rating. Not every query will be eligible for a vital match (for example [order by] with no other details). If in doubt, apply a lower rating.
Vital match scenarios include:
[Bullseye Official Website / Webpage / Documentation] Exact query response is the most prominent information presented to the user on the returned page. Official document required.
[Bullseye Long Text Or Title Match] The query is clearly seeking a specific webpage (e.g. a blog post), and you can be very confident when matching the query with the webpage that the webpage is the only intended result. For example, if the query is a long title of a specific blog post, that would be a vital match.
If one page has the same blog post text as the original but is not the original post, this is NOT considered a Vital match. For query [CatBoost vs. Light GBM vs. XGBoost], the original post is #1 result so it would be a Vital Match, but the #2 result has a copy of the original post so it is only a Wonderful or Solid Match.
NOTE: Two URLs may be slightly different or redirect to one another. If they point to the same exact page, they should be treated the same when rating vital match. For query [powerapps], both https://powerapps.microsoft.com/ and https://powerapps.microsoft.com/en-us/ would be marked as vital matches.
NOTE: If there is a snippet on other search result pages both the snippet and the top full result could be marked as vital. That being said, it still has to match a long title match or bullseye. In general, pages with snippets are faqs or help pages and are not going to be a vital match.
The following are cases that would NOT qualify as vital match:
Tips and Tricks
Aligning to the scale
You can generally figure out where on the scale an item belongs based off of three questions:
Is it related to the query?
Is it useful for the dominant intent?
Is it a bullseye?
Dominant Intent
Dominant intent is what the query is most likely using for. If it is a company name, it’s information on the company. If it’s an error message it’s a solution to the problem. Sometimes the query is not specific enough or the rater is not sure what the dominant intent might be. In that case you would use the other search pages to help find the dominant intent.
When determining dominant intent, feel free to search on other search engines and use your judgment to determine when something is a dominant enough intent.
The most likely use case for this task will be determining a programming language or OS when one is not specified. For example Podcast Apps will return results for Android, IOS and agnostic results multiple times in the top 5.
For an entity/person search it’s not to determine what information they’re looking for but who they are more likely looking for. For example [Jane Seymour] can be Actor Jane Seymour OR historical figure Jane Seymour. Once you have discovered dominant intent many different types of pages might be marked as high.
This should not be used as what someone is looking for related to the query, but rather the whole of the query.
In rare cases there might be two dominant intents.
Entity and Person Matches, Social Pages
For entity and social matches there are no hard and fast rules for how things would rate. You might see one facebook page be a 5 while another is a 9. Some tech companies might have a linked in page be a 7, others might have it be a 3.
There are many pages that can be relevant to a company and what is relevant can change even for companies that are nearly identical competitors. When making a judgment on these matches please consider:
How much information is on the page
The presentation of the page
The site with the information (medium is more credible than xanga)
How up to date is the information
How frequently is the page updated
There are no hard and fast rules for this section outside of maintained entity/personal websites are vital matches. For example, amazon.com for amazon, aliexpress.com for alibaba are both vital matches. Amazon does have a facebook page that would be a 7, alibaba does not.
Nora Ephron and Steven King are both authors/producers/directors. He has a personal website that would be 10. She does not have a 10 vital match. Both IMDB pages would be a 7.
List Matches
When considering whether or not a list related to the query is a 5, 7, or 9 there are a few things to keep in mind.
Is the query type best satisfied by a list? Are we looking for a product, or something that might have several different solutions?
Is the page we’re looking at a hardy, unbiased page?
Are there many items on the page?
A query of [best podcast apps for iphone] might have many options that are 9, options at 8 or 7 if the list is for iOS/Android, and options at 5 if it’s a bulleted list.
A query of [salesforce] might have results for sites with “best CRMs.” With no intent to in the query to look beyond salesforce this would clearly be a 5, 4, or 3. The same result for the query [salesforce alternative] could be a 7, 8 or 9.