Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return raw outputs in TextClassificationPipeline #8328

Merged
merged 5 commits into from
Aug 4, 2021

Conversation

LysandreJik
Copy link
Member

Currently the TextClassificationPipeline does a softmax over the output values when num_labels > 1, and does a sigmoid over the output when num_labels == 1.

As seen in #8259, this may be problematic when systems depend on a different output range.

Adding a flag return_raw_outputs that will not apply the sigmoid or softmax in that case.

closes #8259

@julien-c
Copy link
Member

julien-c commented Nov 5, 2020

Will that flag be configurable on a model-level for the hosted Inference API, and if yes, how?

@LysandreJik
Copy link
Member Author

Through the configuration, as specified in the original issue would be the easiest way to control it through the API. Haven't thought about it this far however.

The other option would be to disable the sigmoid over the output and let the user handle that instead, but this would break existing models on the inference API such as the DialogRPT model.

@LysandreJik
Copy link
Member Author

LysandreJik commented Nov 5, 2020

Here's a proposal @julien-c (see latest commit).

The current TextClassificationPipeline has a task set to "" as it is not initialized from pipeline(task).
This proposal puts the default task as "text-classification" for the TextClassificationPipeline. Since the pipeline will fetch the task_specific_params according to the given task, it can therefore fetch the following configuration option (which can of course already be set in the configuration on S3):

config = GPT2Config.from_pretrained("microsoft/DialogRPT-updown", task_specific_params={"text-classification": {"return_raw_outputs": True}})
model = GPT2ForSequenceClassification.from_pretrained("microsoft/DialogRPT-updown", config=config)

This will be used as the default value for the pipeline, and allow users to specify pipeline-specific arguments directly in their model configuration.

The proposal was only made for the TextClassificationPipeline, but it would need to be made to all other pipelines to stay coherent. Let me know if this is an approach you would be interested in.

@julien-c
Copy link
Member

julien-c commented Nov 5, 2020

pinging @Narsil and @mfuntowicz for their thoughts on this (we have ways to get configurable params on the API, just wanted to make sure we could use them for this)

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A nit, and more of a question in argument management.

@@ -516,7 +516,7 @@ def __init__(
if framework is None:
framework = get_framework(model)

self.task = task
self.task = task or self.task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not defined for Pipeline and most others, so we probably should change that to

getattr(self, 'task', "")

self.return_all_scores = return_all_scores
self.return_raw_outputs = return_raw_outputs if return_raw_outputs is not None else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but we probably should be more consistent in our way of managing pipe arguments vs call arguments....

@laurahanu
Copy link
Contributor

I've also run into this issue recently when uploading some multi-label text classification models to the HuggingFace hub, where it seems like the default activation for multiple classes is a softmax.

Having it return raw outputs is definitely useful, however, it still wouldn't show the scores I would expect in the inference api - in our case using a sigmoid over the results to allow more than one positive label?

Would it be useful to also have an argument like output_nonlinearity for the desired activation function e.g. sigmoid/softmax?

@LysandreJik
Copy link
Member Author

Hi @laurahanu, thanks for your proposal. I'll try and see how to best integrate that in this PR.

@LysandreJik LysandreJik force-pushed the text-classification-pipeline-raw-output branch from 669e668 to 9a7ae2a Compare November 30, 2020 22:40
Copy link
Member Author

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julien-c @Narsil @mfuntowicz @laurahanu please review and tell me if this implementation fits your needs.

Comment on lines 1005 to 1006
return_all_scores = return_all_scores if return_all_scores is not None else self.return_all_scores
function_to_apply = function_to_apply if function_to_apply is not None else self.function_to_apply
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @mfuntowicz's request, the arguments can now be handled via the __call__ method of the pipeline, as well as the __init__ and the model configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor NIT: I find the following slightly more readable. Feel free to ignore if you don't agree.

if return_all_scores is None:
    return_all_score = self.return_all_scores

if function_to_apply is None:
   function_to_apply = self.function_to_apply

Comment on lines 1014 to 1028
if function_to_apply == "default":
if self.model.config.num_labels == 1:
scores = sigmoid(outputs)
else:
scores = softmax(outputs)
elif function_to_apply == "sigmoid":
scores = sigmoid(outputs)
elif function_to_apply == "softmax":
scores = softmax(outputs)
elif function_to_apply.lower() == "none":
scores = outputs
else:
scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
if self.return_all_scores:
raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}")

if return_all_scores:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handles several cases:

  • Default cases, sigmoid for single labels and softmax for multi-label
  • Sigmoid for all (@laurahanu)
  • Softmax for all
  • No function applied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this suits my case, thank you!

Comment on lines 2357 to 2358
task = "text2text-generation"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pipelines now have a default task, for consistency.

small_models = [
"sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english"
] # Default model - Models tested without the @slow decorator
small_models = ["distilbert-base-cased"] # Default model - Models tested without the @slow decorator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary as the previous model was not outputting valid results (0.0 on everything).

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're really missing a test on the actual SHAPE and values of the output of the pipelines called with different arguments.

The rest is more of proposal for nits if you agree with my remarks.

# Compare all outputs (call, init, model argument)
for example_result_0, example_result_1 in zip(pipeline_output_0, pipeline_output_1):
# Iterate through the results
print(np.allclose(example_result_0["score"], example_result_1["score"], atol=1e-6))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an assert instead of a print ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch!

return _string_output, _string_list_output

string_output = classifier(string_input)
string_list_output = classifier(string_list_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing from these tests an actual tests of what string_output and string_list_output look like:

Ideally self.assertEqual(string_output, {'label': XX, 'score': XX}).
Realistically because assertEqual won't work on floats:

assertTrue(isinstance(string_output, dict))
assertTrue(set(string_output.keys()), {'label', 'score'})
assertEquals(string_output['score'].shape, [12, 1064])
# checking only one value is probably good enough, but at least we will know if the output changes
assertAlmostEquals(string_output['score'][0, 0], -12.125)

Comment on lines 1005 to 1006
return_all_scores = return_all_scores if return_all_scores is not None else self.return_all_scores
function_to_apply = function_to_apply if function_to_apply is not None else self.function_to_apply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor NIT: I find the following slightly more readable. Feel free to ignore if you don't agree.

if return_all_scores is None:
    return_all_score = self.return_all_scores

if function_to_apply is None:
   function_to_apply = self.function_to_apply

scores = sigmoid(outputs)
elif function_to_apply == "softmax":
scores = softmax(outputs)
elif function_to_apply.lower() == "none":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personnally would leave out .lower() and be more strict.

else:
scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
if self.return_all_scores:
raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the valid choices in the error message ? I don't like adding them manually and maybe programmatically but that would complexify the code overall so I'm not sure about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good idea.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Apr 24, 2021
@laurahanu
Copy link
Contributor

Hello, was just wondering if there are any updates on allowing the user to choose which function to run on the outputs? It seems like in the current documentation, the pipeline would still only run a sigmoid over the result if there is one label.

If multiple classification labels are available (model.config.num_labels >= 2), the pipeline will run a softmax over the results. If there is a single label, the pipeline will run a sigmoid over the result.

@LysandreJik
Copy link
Member Author

Sorry for taking a long time to merge this; I wonder if we can't re-use the recently introduced problem_type instead, wdyt @sgugger @abhi1thakur?

@sgugger
Copy link
Collaborator

sgugger commented Jun 7, 2021

Yes, this flag is there for this reason specifically :-)

@laurahanu
Copy link
Contributor

Thanks for reopening this and having another look at it! Do you have a timeline in mind for when this would be merged/available to use with the model hub?

@LysandreJik LysandreJik force-pushed the text-classification-pipeline-raw-output branch from 56b33cc to 9e07b95 Compare June 30, 2021 16:18
@LysandreJik LysandreJik requested review from Narsil and sgugger July 1, 2021 07:14
@LysandreJik
Copy link
Member Author

Hey @laurahanu, the PR should now be ready for merging. I'm pinging two team members for review; this should be merged in the coming days and deployed on the API shortly after.

@laurahanu
Copy link
Contributor

Great, thank you @LysandreJik!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

src/transformers/pipelines/text_classification.py Outdated Show resolved Hide resolved
src/transformers/pipelines/text_classification.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR because I think it can go as-is, however I think a few changes should be made (could be in the PR or later).

  • Remove "default" option (those are usually confusing, None should be the default instead)
  • Change the tests for a more linear approach that do include actual scores (important to make sure we don't break scores later)

elif self.model.config.problem_type == "single_label_classification":
self.function_to_apply = "softmax"

def sigmoid(_outputs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: functions that can be at the file level should in general (prevents accidental state leak though variable name)

shifted_exp = np.exp(_outputs - maxes)
return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True)

if function_to_apply == "default":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stay away from "default" in that case.

IMO this should be taken care of when inferring the function_to_apply this will make logic a bit more simple to understand.

scores = sigmoid(outputs)
else:
scores = softmax(outputs)
elif function_to_apply == "sigmoid":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: we could use an enum here. it would be a bit more consistent with other pipelines (here function to apply is equivalent to aggregation_strategy for token classification for instance.

It does make the code a bit more verbose. We still do need to support raw strings as arguments as it's still much simpler to use than importing the enum from somewhere.

(I'm not sure it's worth it, feel free to ignore)

# Iterate through the results
self.assertTrue(
np.allclose(example_result_0["score"], example_result_1["score"], atol=1e-6)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those tests are super hard to read and to maintain. Because it contains lot of logic, determining of this test is correct or not is kind of hard.
Also the expected values are not hardcoded in the test meaning drift in output vlues would not be caught by this test.

I am a proponent of much simpler tests in logic, but much more verbose.

classifier = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)

self.assertEqual(nested_simplify(classifier("I really disagree with what you've said.")), [{'score': 0.1, 'label' : "LABEL_0"}])

# other function to apply
self.assertEqual(nested_simplify(classifier("I really disagree with what you've said.", function_to_apply="sigmod")), [{'score': 0.2, 'label' : "LABEL_0"}])

# function_to_apply as pipeline arg
classifier2 = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer, function_to_apply="sigmoid")
self.assertEqual(nested_simplify(classifier("I really disagree with what you've said.")), [{'score': 0.2, 'label' : "LABEL_0"}])

# from config
# check default strategy depends on  num_labels
# ....

I would happy to make a separate PR to suggest this and be compliant with the current written tests.

@LysandreJik Do you agree ?

@LysandreJik LysandreJik force-pushed the text-classification-pipeline-raw-output branch from f905253 to d6ea1c1 Compare July 5, 2021 13:23
function_to_apply = ClassificationFunction.SOFTMAX

if isinstance(function_to_apply, str):
function_to_apply = ClassificationFunction[function_to_apply.upper()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try except with clean error message, or is the current exception good enough ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking the exception is good enough, but happy to update if you feel strongly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no I think it's fine, I didn't check what the actual message was, in the previous iteration it was explicit so I was wondering if it was ok here. I trust you there.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@laurahanu
Copy link
Contributor

@LysandreJik is this ready to be merged?

@LysandreJik LysandreJik force-pushed the text-classification-pipeline-raw-output branch from d6ea1c1 to 40e163f Compare August 4, 2021 12:38
@LysandreJik LysandreJik merged commit 3f44a66 into master Aug 4, 2021
@LysandreJik LysandreJik deleted the text-classification-pipeline-raw-output branch August 4, 2021 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable default sigmoid function for single label classification Inference API
5 participants