Carbon fuzzing 3/3: added actual fuzzer implementation and a fuzzverter utility for investigating crashing protos #1156

pk19604014 · 2022-03-29T12:35:38Z

Added new dependency to WORKSPACE - libprotobuf_mutator. The library does not come with a bazel BUILD file, but it's simple to craft one.

Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto.

fuzzverter utility can be used to convert crashing inputs in binary proto format to carbon source or text proto:
fuzzverter --from=binary_proto --input /tmp/crash.binaryproto --to=carbon_source

or to generate new binary protos from carbon source for seeding the fuzzer corpus:
fuzzverter --from=carbon_source --input testdata/simple.carbon --to=binary_proto

…into executable_semantics_fuzzer

jonmeow

Sorry, not getting too far here -- I think the ASAN issues needs significant work.

WORKSPACE

bazel/cc_toolchains/clang_cc_toolchain_config.bzl

…ash in proto code

jonmeow

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

I think documentation would help clarify the big-picture view and make finishing this review easier.

A few related things I'm looking at:

"Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."
- How do binary protos support frequent changes?
- Wouldn't plain source code be more resilient to proto changes?
- Why would text protos be a problem?
- (text protos and source code would both be more readable and easily edited than binary protos)
"or to generate new binary protos from carbon source for seeding the fuzzer corpus:"
- Why wouldn't someone just add a file to executable_semantics/testdata in this case?
What's your intent for a fuzzer corpus?
- It's empty -- is it going to be generated?
- Do we need actually need one, given executable_semantics/testdata?
Are all the conversion modes supported by fuzzverter necessary?
- What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.
- If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.
- I'll review fuzzverter more carefully once I'm clear on the approach being taken.
AddPrelude already has a way to add code to an AST; MaybeAddMain feels like a different approach to the same goal. Could MaybeAddMain mimic AddPrelude instead?

WORKSPACE

executable_semantics/prelude.h

executable_semantics/BUILD

executable_semantics/fuzzing/fuzzer_util.h

executable_semantics/fuzzing/executable_semantics_fuzzer.cpp

executable_semantics/fuzzing/BUILD

bazel/cc_toolchains/clang_cc_toolchain_config.bzl

…into executable_semantics_fuzzer

pk19604014 · 2022-04-02T22:25:15Z

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

Added a readme.

I think documentation would help clarify the big-picture view and make finishing this review easier.

A few related things I'm looking at:

"Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."

How do binary protos support frequent changes?

With binary protos, it's easier to make changes to proto definition without breaking existing serialized instances of the protos and the code compiled with the old definition (https://developers.google.com/protocol-buffers/docs/proto#updating). For example, renaming a field is essentially a no-op ( as fields are keyed by tag ids not strings), and text proto using an old field name either fails to parse or ignores the field and its contents.

Wouldn't plain source code be more resilient to proto changes?

Why would text protos be a problem?

(text protos and source code would both be more readable and easily edited than binary protos)

"or to generate new binary protos from carbon source for seeding the fuzzer corpus:"

Why wouldn't someone just add a file to executable_semantics/testdata in this case?

libprotobuf_mutator requires either a binary or text proto as its input format. The fuzzing framework operates by applying mutations to protobuf instances, which can be done generically using proto reflection so works for any concrete message type.

What's your intent for a fuzzer corpus?

It's empty -- is it going to be generated?

Do we need actually need one, given executable_semantics/testdata?

I was going to populate the corpus using fuzzverter --from=carbon_source --to=binary_proto on select (or all parseable) files in executable_semantics/testdata, in a separate PR. A corpus is needed because lib_protobuf_mutator needs inputs in proto format (either binary or text).

Are all the conversion modes supported by fuzzverter necessary?

What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.

carbon_source -> binary_proto for generating corpus entries, and binary_proto to carbon_source as a convenient way to see the source for a crash, and be able to run executable_semantics on it directly, and maybe binary proto <-> text proto for some finer-grained experimentation and debugging proto-related logic.

If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.

Right but it needs to be able to support parsing the source for carbon_source -> proto conversion for seeding the corpus, and needs a dependency on executable_semantics.

I'll review fuzzverter more carefully once I'm clear on the approach being taken.

AddPrelude already has a way to add code to an AST; MaybeAddMain feels like a different approach to the same goal. Could MaybeAddMain mimic AddPrelude instead?

In the current implementation, MaybeAddMain() needs to be able to add to the proto representation, so that source code can then be generated from it. This is reused in fuzzverter as well, so that it can print full Carbon source executable by executable_semantics. Alternatively I guess I could just append a predefined string with Carbon source for dummy Main() definition after running the proto -> Carbon conversion code.

…into executable_semantics_fuzzer

pk19604014 · 2022-04-07T21:23:50Z

Stepping back a moment, looking over the tooling, can you please add a README.md to executable_semantics/fuzzing? Think about usage in terms of workflows: what would an engineer be trying to do when running these tools?

Added a readme.

I think documentation would help clarify the big-picture view and make finishing this review easier.
A few related things I'm looking at:

"Using binary proto format for the fuzzer in anticipation of frequent changes to carbon.proto."

How do binary protos support frequent changes?

With binary protos, it's easier to make changes to proto definition without breaking existing serialized instances of the protos and the code compiled with the old definition (https://developers.google.com/protocol-buffers/docs/proto#updating). For example, renaming a field is essentially a no-op ( as fields are keyed by tag ids not strings), and text proto using an old field name either fails to parse or ignores the field and its contents.

Wouldn't plain source code be more resilient to proto changes?

Why would text protos be a problem?

(text protos and source code would both be more readable and easily edited than binary protos)

"or to generate new binary protos from carbon source for seeding the fuzzer corpus:"

Why wouldn't someone just add a file to executable_semantics/testdata in this case?

libprotobuf_mutator requires either a binary or text proto as its input format. The fuzzing framework operates by applying mutations to protobuf instances, which can be done generically using proto reflection so works for any concrete message type.

What's your intent for a fuzzer corpus?

It's empty -- is it going to be generated?

Do we need actually need one, given executable_semantics/testdata?

I was going to populate the corpus using fuzzverter --from=carbon_source --to=binary_proto on select (or all parseable) files in executable_semantics/testdata, in a separate PR. A corpus is needed because lib_protobuf_mutator needs inputs in proto format (either binary or text).

Are all the conversion modes supported by fuzzverter necessary?

What would be a minimum set? It feels a bit like proto -> text is the only conversion path needed, and the fuzzer itself could be printing source code equivalents, so I'm not too sure of fuzzverter's need as a distinct tool. But some documentation on typical fuzzer-related workflows might help clarify your intent here.

carbon_source -> binary_proto for generating corpus entries, and binary_proto to carbon_source as a convenient way to see the source for a crash, and be able to run executable_semantics on it directly, and maybe binary proto <-> text proto for some finer-grained experimentation and debugging proto-related logic.

If fuzzverter is providing distinct value, would it make sense to put it with the fuzzer proto (//common/fuzzing) instead of making it executable-semantics-specific? There, it could still convert between proto modes and print source code; the only thing it would avoid is parsing source code, but it's not obvious to me that that's required here.

Right but it needs to be able to support parsing the source for carbon_source -> proto conversion for seeding the corpus, and needs a dependency on executable_semantics.

If I understand correctly, you prefer binary protos becomes it makes renaming fields cheap. I think that then leads to a lot of the other decisions, e.g. that fuzzverter needs to support more formats because binary protos aren't human readable, and human readability is critical. Even with your explantion, I'm still leaning that source code would be the best format -- I'm suggesting meeting to discuss approaches, it may work better than review comments.

Just a quick note, for structure fuzzing with libprotobuf_mutator, one has to use protocol buffer format, either text or binary, as this is what the fuzzing framework understands, can mutate and generate new instances of inputs, etc.

clang's proto fuzzers use binary proto format - e.g. https://github.com/llvm/llvm-project/blob/main/clang/tools/clang-fuzzer/ExampleClangProtoFuzzer.cpp

Updated per the discussion during the meeting: switched to text proto format for the fuzzer, and removed binary_proto from fuzzverter.

Also rewrote MaybeAddMain logic to use a string of Carbon source instead of a string of text proto representation of Carbon proto message. Not using AddPrelude() logic there so that the full Carbon source can be easily printed by e.g. fuzzverter --to=carbon_source.

PTAL

jonmeow

I think we're both on the same page about direction here, and thank you for the updates.

I've gone through with comments, but I view these as fairly narrow at this point, I think collapsing the --from/--to flags would be the most substantial change.

executable_semantics/fuzzing/README.md

executable_semantics/prelude.h

executable_semantics/prelude.cpp

executable_semantics/prelude.h