Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carbon fuzzing 3/3: added actual fuzzer implementation and a fuzzverter utility for investigating crashing protos #1156

Merged
merged 46 commits into from
Apr 11, 2022

Conversation

pk19604014
Copy link
Contributor

Added new dependency to WORKSPACE - libprotobuf_mutator. The library does not come with a bazel BUILD file, but it's simple to craft one.

Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto.

fuzzverter utility can be used to convert crashing inputs in binary proto format to carbon source or text proto:
fuzzverter --from=binary_proto --input /tmp/crash.binaryproto --to=carbon_source

or to generate new binary protos from carbon source for seeding the fuzzer corpus:
fuzzverter --from=carbon_source --input testdata/simple.carbon --to=binary_proto

@pk19604014 pk19604014 added the explorer Action items related to Carbon explorer code label Mar 29, 2022
@pk19604014 pk19604014 marked this pull request as ready for review March 30, 2022 18:42
@pk19604014 pk19604014 requested a review from a team as a code owner March 30, 2022 18:42
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not getting too far here -- I think the ASAN issues needs significant work.

WORKSPACE Outdated Show resolved Hide resolved
WORKSPACE Outdated Show resolved Hide resolved
WORKSPACE Outdated Show resolved Hide resolved
bazel/cc_toolchains/clang_cc_toolchain_config.bzl Outdated Show resolved Hide resolved
@pk19604014 pk19604014 requested a review from jonmeow April 1, 2022 14:45
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

I think documentation would help clarify the big-picture view and make finishing this review easier.

A few related things I'm looking at:

  • "Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."

    • How do binary protos support frequent changes?
    • Wouldn't plain source code be more resilient to proto changes?
    • Why would text protos be a problem?
    • (text protos and source code would both be more readable and easily edited than binary protos)
  • "or to generate new binary protos from carbon source for seeding the fuzzer corpus:"

    • Why wouldn't someone just add a file to executable_semantics/testdata in this case?
  • What's your intent for a fuzzer corpus?

    • It's empty -- is it going to be generated?
    • Do we need actually need one, given executable_semantics/testdata?
  • Are all the conversion modes supported by fuzzverter necessary?

    • What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.

    • If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.

    • I'll review fuzzverter more carefully once I'm clear on the approach being taken.

  • AddPrelude already has a way to add code to an AST; MaybeAddMain feels like a different approach to the same goal. Could MaybeAddMain mimic AddPrelude instead?

WORKSPACE Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/BUILD Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzer_util.h Outdated Show resolved Hide resolved
executable_semantics/fuzzing/BUILD Show resolved Hide resolved
bazel/cc_toolchains/clang_cc_toolchain_config.bzl Outdated Show resolved Hide resolved
@pk19604014
Copy link
Contributor Author

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

Added a readme.

I think documentation would help clarify the big-picture view and make finishing this review easier.

A few related things I'm looking at:

  • "Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."

    • How do binary protos support frequent changes?

With binary protos, it's easier to make changes to proto definition without breaking existing serialized instances of the protos and the code compiled with the old definition (https://developers.google.com/protocol-buffers/docs/proto#updating). For example, renaming a field is essentially a no-op ( as fields are keyed by tag ids not strings), and text proto using an old field name either fails to parse or ignores the field and its contents.

  • Wouldn't plain source code be more resilient to proto changes?
  • Why would text protos be a problem?
  • (text protos and source code would both be more readable and easily edited than binary protos)
  • "or to generate new binary protos from carbon source for seeding the fuzzer corpus:"

    • Why wouldn't someone just add a file to executable_semantics/testdata in this case?

libprotobuf_mutator requires either a binary or text proto as its input format. The fuzzing framework operates by applying mutations to protobuf instances, which can be done generically using proto reflection so works for any concrete message type.

  • What's your intent for a fuzzer corpus?

    • It's empty -- is it going to be generated?
    • Do we need actually need one, given executable_semantics/testdata?

I was going to populate the corpus using fuzzverter --from=carbon_source --to=binary_proto on select (or all parseable) files in executable_semantics/testdata, in a separate PR. A corpus is needed because lib_protobuf_mutator needs inputs in proto format (either binary or text).

  • Are all the conversion modes supported by fuzzverter necessary?

    • What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.

carbon_source -> binary_proto for generating corpus entries, and binary_proto to carbon_source as a convenient way to see the source for a crash, and be able to run executable_semantics on it directly, and maybe binary proto <-> text proto for some finer-grained experimentation and debugging proto-related logic.

  • If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.

Right but it needs to be able to support parsing the source for carbon_source -> proto conversion for seeding the corpus, and needs a dependency on executable_semantics.

  • I'll review fuzzverter more carefully once I'm clear on the approach being taken.
  • AddPrelude already has a way to add code to an AST; MaybeAddMain feels like a different approach to the same goal. Could MaybeAddMain mimic AddPrelude instead?

In the current implementation, MaybeAddMain() needs to be able to add to the proto representation, so that source code can then be generated from it. This is reused in fuzzverter as well, so that it can print full Carbon source executable by executable_semantics. Alternatively I guess I could just append a predefined string with Carbon source for dummy Main() definition after running the proto -> Carbon conversion code.

@pk19604014 pk19604014 requested a review from jonmeow April 2, 2022 22:25
@pk19604014
Copy link
Contributor Author

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

Added a readme.

I think documentation would help clarify the big-picture view and make finishing this review easier.
A few related things I'm looking at:

  • "Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."

    • How do binary protos support frequent changes?

With binary protos, it's easier to make changes to proto definition without breaking existing serialized instances of the protos and the code compiled with the old definition (https://developers.google.com/protocol-buffers/docs/proto#updating). For example, renaming a field is essentially a no-op ( as fields are keyed by tag ids not strings), and text proto using an old field name either fails to parse or ignores the field and its contents.

  • Wouldn't plain source code be more resilient to proto changes?
  • Why would text protos be a problem?
  • (text protos and source code would both be more readable and easily edited than binary protos)
  • "or to generate new binary protos from carbon source for seeding the fuzzer corpus:"

    • Why wouldn't someone just add a file to executable_semantics/testdata in this case?

libprotobuf_mutator requires either a binary or text proto as its input format. The fuzzing framework operates by applying mutations to protobuf instances, which can be done generically using proto reflection so works for any concrete message type.

  • What's your intent for a fuzzer corpus?

    • It's empty -- is it going to be generated?
    • Do we need actually need one, given executable_semantics/testdata?

I was going to populate the corpus using fuzzverter --from=carbon_source --to=binary_proto on select (or all parseable) files in executable_semantics/testdata, in a separate PR. A corpus is needed because lib_protobuf_mutator needs inputs in proto format (either binary or text).

  • Are all the conversion modes supported by fuzzverter necessary?

    • What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.

carbon_source -> binary_proto for generating corpus entries, and binary_proto to carbon_source as a convenient way to see the source for a crash, and be able to run executable_semantics on it directly, and maybe binary proto <-> text proto for some finer-grained experimentation and debugging proto-related logic.

  • If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.

Right but it needs to be able to support parsing the source for carbon_source -> proto conversion for seeding the corpus, and needs a dependency on executable_semantics.

If I understand correctly, you prefer binary protos becomes it makes renaming fields cheap. I think that then leads to a lot of the other decisions, e.g. that fuzzverter needs to support more formats because binary protos aren't human readable, and human readability is critical. Even with your explantion, I'm still leaning that source code would be the best format -- I'm suggesting meeting to discuss approaches, it may work better than review comments.

Just a quick note, for structure fuzzing with libprotobuf_mutator, one has to use protocol buffer format, either text or binary, as this is what the fuzzing framework understands, can mutate and generate new instances of inputs, etc.

clang's proto fuzzers use binary proto format - e.g. https://github.com/llvm/llvm-project/blob/main/clang/tools/clang-fuzzer/ExampleClangProtoFuzzer.cpp

Updated per the discussion during the meeting: switched to text proto format for the fuzzer, and removed binary_proto from fuzzverter.

Also rewrote MaybeAddMain logic to use a string of Carbon source instead of a string of text proto representation of Carbon proto message. Not using AddPrelude() logic there so that the full Carbon source can be easily printed by e.g. fuzzverter --to=carbon_source.

PTAL

@pk19604014 pk19604014 requested a review from jonmeow April 7, 2022 21:23
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're both on the same page about direction here, and thank you for the updates.

I've gone through with comments, but I view these as fairly narrow at this point, I think collapsing the --from/--to flags would be the most substantial change.

executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/prelude.cpp Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@pk19604014 pk19604014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review!

executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/prelude.h Outdated Show resolved Hide resolved
executable_semantics/prelude.cpp Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
bazel/cc_toolchains/clang_cc_toolchain_config.bzl Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzverter.cpp Outdated Show resolved Hide resolved
@pk19604014 pk19604014 requested a review from jonmeow April 8, 2022 14:26
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/README.md Outdated Show resolved Hide resolved
executable_semantics/fuzzing/fuzzer_util.h Outdated Show resolved Hide resolved
Comment on lines 98 to 129
} // namespace Carbon

auto main(int argc, char* argv[]) -> int {
llvm::InitLLVM init_llvm(argc, argv);

llvm::cl::opt<Carbon::ConversionMode> mode(
"mode", llvm::cl::desc("Conversion mode"),
llvm::cl::values(
clEnumValN(Carbon::ConversionMode::TextProtoToCarbon,
"proto_to_carbon", "Convert text proto to Carbon source"),
clEnumValN(Carbon::ConversionMode::CarbonToTextProto,
"carbon_to_proto",
"Convert Carbon source to text proto")));
llvm::cl::opt<std::string> input_file_name(
"input", llvm::cl::desc("<input file>"), llvm::cl::init("/dev/stdin"));
llvm::cl::opt<std::string> output_file_name(
"output", llvm::cl::desc("<output file>"), llvm::cl::init("/dev/stdout"));
llvm::cl::ParseCommandLineOptions(argc, argv);

const Carbon::ErrorOr<Carbon::Success> result = [&]() {
switch (mode) {
case Carbon::ConversionMode::TextProtoToCarbon:
return Carbon::TextProtoToCarbon(input_file_name, output_file_name);
case Carbon::ConversionMode::CarbonToTextProto:
return Carbon::CarbonToTextProto(input_file_name, output_file_name);
}
}();
if (!result.ok()) {
llvm::errs() << GetEnumString(mode)
<< " conversion failed: " << result.error().message() << "\n";
return EXIT_FAILURE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} // namespace Carbon
auto main(int argc, char* argv[]) -> int {
llvm::InitLLVM init_llvm(argc, argv);
llvm::cl::opt<Carbon::ConversionMode> mode(
"mode", llvm::cl::desc("Conversion mode"),
llvm::cl::values(
clEnumValN(Carbon::ConversionMode::TextProtoToCarbon,
"proto_to_carbon", "Convert text proto to Carbon source"),
clEnumValN(Carbon::ConversionMode::CarbonToTextProto,
"carbon_to_proto",
"Convert Carbon source to text proto")));
llvm::cl::opt<std::string> input_file_name(
"input", llvm::cl::desc("<input file>"), llvm::cl::init("/dev/stdin"));
llvm::cl::opt<std::string> output_file_name(
"output", llvm::cl::desc("<output file>"), llvm::cl::init("/dev/stdout"));
llvm::cl::ParseCommandLineOptions(argc, argv);
const Carbon::ErrorOr<Carbon::Success> result = [&]() {
switch (mode) {
case Carbon::ConversionMode::TextProtoToCarbon:
return Carbon::TextProtoToCarbon(input_file_name, output_file_name);
case Carbon::ConversionMode::CarbonToTextProto:
return Carbon::CarbonToTextProto(input_file_name, output_file_name);
}
}();
if (!result.ok()) {
llvm::errs() << GetEnumString(mode)
<< " conversion failed: " << result.error().message() << "\n";
return EXIT_FAILURE;
}
int Main(int argc, char* argv[]) -> ErrorOr<Success> {
llvm::InitLLVM init_llvm(argc, argv);
llvm::cl::opt<ConversionMode> mode(
"mode", llvm::cl::desc("Conversion mode"),
llvm::cl::values(
clEnumValN(ConversionMode::TextProtoToCarbon,
"proto_to_carbon", "Convert text proto to Carbon source"),
clEnumValN(ConversionMode::CarbonToTextProto,
"carbon_to_proto",
"Convert Carbon source to text proto")));
llvm::cl::opt<std::string> input_file_name(
"input", llvm::cl::desc("<input file>"), llvm::cl::init("/dev/stdin"));
llvm::cl::opt<std::string> output_file_name(
"output", llvm::cl::desc("<output file>"), llvm::cl::init("/dev/stdout"));
llvm::cl::ParseCommandLineOptions(argc, argv);
switch (mode) {
case ConversionMode::TextProtoToCarbon:
return TextProtoToCarbon(input_file_name, output_file_name);
case ConversionMode::CarbonToTextProto:
return CarbonToTextProto(input_file_name, output_file_name);
}
}
} // namespace Carbon
auto main(int argc, char* argv[]) -> int {
auto result = Carbon::Main(argc, argv)
if (!result.ok()) {
llvm::errs() << GetEnumString(mode)
<< " conversion failed: " << result.error().message() << "\n";
return EXIT_FAILURE;
}

Note how moving the Main() into the Carbon namespace allows removing some of the repetition of Carbon::, but really I'm suggesting this because I'd like to eliminate the inline lambda call (which could also be done by a more targeted helper function, but I think this has side-benefits).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note how moving the Main() into the Carbon namespace allows removing some of the repetition of Carbon::, but really I'm suggesting this because I'd like to eliminate the inline lambda call (which could also be done by a more targeted helper function, but I think this has side-benefits).

With the 'full' Main in Carbon namespace, there is an issue that mode is not available in 'real' main for printing the final error message, so I went with a helper function in the end - PTAL

executable_semantics/BUILD Outdated Show resolved Hide resolved
@pk19604014 pk19604014 requested a review from jonmeow April 8, 2022 17:25
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes! I may take a closer look at the main, but this seems good for now.

executable_semantics/syntax/BUILD Outdated Show resolved Hide resolved
@pk19604014 pk19604014 merged commit 3e17e8d into carbon-language:trunk Apr 11, 2022
chandlerc pushed a commit that referenced this pull request Jun 28, 2022
…er utility for investigating crashing protos (#1156)

* finished fuzzer and added fuzzverter util

* fixed typo

* renamed cmd line params

* fixed libproto_mutator download path

* small fixes

* small fixes

* small fixes

* renamed sample corpus proto

* small fixes

* try building on github with LIBCPP_DEBUG enabled

* temporarily marked proto fuzzer as a manual test

* code review

* use a dedicated proto-fuzzer feature to work around LIBCPP_DEBUG=1 crash in proto code

* code review comments, added README.md

* minor fixes to the text

* Update bazel/cc_toolchains/clang_cc_toolchain_config.bzl

Co-authored-by: Jon Meow <[email protected]>

* use Carbon source representation for "empty Main()" instead of text format proto representation

* fixed typo

* made FuzzerUtil produce the full carbon source (proto converted + Main if needed) to decrease code duplication a bit

* typo

* switched to text proto format per code review

* Update executable_semantics/prelude.h

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* review comments

* removed unnecessary file mode variables

* Update executable_semantics/fuzzing/fuzzverter.cpp

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* Update executable_semantics/fuzzing/README.md

Co-authored-by: Jon Meow <[email protected]>

* code review comments

* Update executable_semantics/syntax/BUILD

Co-authored-by: Jon Meow <[email protected]>

* buildifier

Co-authored-by: Jon Meow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
explorer Action items related to Carbon explorer code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants