Judgement Rating Best Practices

This document outlines how ratings should be used for each document when judging relevance for a query. This assumes you are using one of the communal graded scorers (CG@10, DCG@10, or NDCG@10).

What are we rating?

We are rating what the customer would think to be relevant for a query. More relevant results are rated higher, and should be returned in descending order by the search engine.

It is critical to behave as a customer, by thinking of a customer’s feelings when rating! Make sure you don’t fall in the trap of taking into account how the search engine or the currently configured algorithm works, or in thinking about the search results with your own knowledge of the inner workings of the system.

Customer feelings per rating

3 "This is what I am looking for! I’m going to buy one."
2 "I like these! Now I can look more and decide which one to buy."
1 "These results aren’t what I’m looking for."
0 "These results are terrible! Maybe I’ll look somewhere else."

What’s a 3 (perfect)?

Customer feeling: "This is what I am looking for! I’m going to buy one."

A ‘3’ (perfect) is usually reserved for exact results in response to a targeted information need query. These should be one or two products that the customer is specifically looking for. A good search metric for ‘3’ results is Precision@1 or Precision@4.

This one result in the first position would be scored a ‘3’ for the query samsung 65" 8k flat screen

What’s a 2 (good)?

Customer feeling: "These results are good! Now I can look more and decide which to buy."

A ‘2’ (good) is for relevant results in response to an exploratory or survey information need query. 2’s are what the customer is looking for, but they haven’t been specific enough to provide exact results.

Only the following documents would each score a ‘2’ for the query gaming keyboard

What’s a 1 (fair)?

Customer feeling: "These results aren’t what I’m looking for, I can see why they are returned."

A ‘1’ (fair) is used for results that are likely not relevant for the query, and are usually considered to be noise. It can be OK when there are 3’s and 2’s at the top of the results, and 1’s are at the bottom. It’s not OK when 1’s are mixed in with 2’s and 3’s, and they should never be at the top.

As an example query gaming mouse, I would score the mouse pads as a ‘1’ in this query, but only because there are gaming mice in the resultset!

What’s a 0 (poor)?

Customer feeling: "These results are terrible! Maybe I’ll look somewhere else."

0’s are used when there is clearly something wrong. Most queries should never show 0’s, they should only show 1’s, 2’s and 3’s.

For example, the current results are poor for the query computer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly