Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #34

Open
mewmew opened this issue Nov 9, 2018 · 4 comments
Open

Performance #34

mewmew opened this issue Nov 9, 2018 · 4 comments
Labels
Milestone

Comments

@mewmew
Copy link
Member

mewmew commented Nov 9, 2018

This issue is intended to profile the performance of the llir/llvm library, measure it against the official LLVM distribution and evaluate different methods for improving the performance.

This is a continuation of mewspring/mewmew-l#6

The benchmark suite is at https://github.com/decomp/testdata. Specifically, the LLVM IR assembly of these projects are used in the benchmark:

Below follows a first evaluation of using concurrency to speed up parsing. The evaluation is based on a very naiive implementation of concurrency, just to get some initial runtime numbers. It is based on 3011396 of the development branch, and subsets of the following patch has been applied https://gist.github.com/mewmew/d127b562fdd8f560222b4ded739861a7

Official LLVM results

For comparison, below are the runtime results of the opt tool from the official LLVM distribution (using opt -verify foo.ll).

Coreutils

real 8.18
user 7.22
sys 0.88

SQLite

real 1.90
user 1.73
sys 0.13

llir/llvm results

Coreutils

no concurrency

total time for file "testdata/coreutils/testdata/yes.ll": 55.744113ms
real 11.54
user 14.70
sys 0.16

concurrent translateTopLevelEntities

  • 4 go-routines with waitgroup in translateTopLevelEntities
total time for file "testdata/coreutils/testdata/yes.ll": 53.49785ms
real 10.28
user 16.06
sys 0.15

concurrent translateGlobals

  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/coreutils/testdata/yes.ll": 55.567134ms
real 9.83
user 17.18
sys 0.17

concurrent translateTopLevelEntities and translateGlobals

  • 4 go-routines with waitgroup in translateTopLevelEntities
  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/coreutils/testdata/yes.ll": 58.474581ms
real 9.23
user 18.08
sys 0.16

SQLite3

no concurrency

total time for file "shell.ll": 3.147106433s
real 3.18
user 3.86
sys 0.32

concurrent translateTopLevelEntities

  • 4 go-routines with waitgroup in translateTopLevelEntities
total time for file "shell.ll": 2.848574349s
real 2.88
user 4.67
sys 0.32

concurrent translateGlobals

  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/sqlite/testdata/shell.ll": 2.86919391s
real 2.90
user 4.90
sys 0.32

concurrent translateTopLevelEntities and translateGlobals

  • 4 go-routines with waitgroup in translateTopLevelEntities
  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "shell.ll": 2.897873366s
real 2.93
user 4.79
sys 0.33
@mewmew

This comment has been minimized.

@mewmew mewmew modified the milestones: v0.3, future Dec 7, 2018
@mewmew
Copy link
Member Author

mewmew commented Dec 7, 2018

As we've reached the hand-wavy ~2x slowdown target for the v0.3.0 release, I've changed the milestone to future, as we will constantly seek to improve the performance of llir. It would be really interesting also to explore using concurrency to spin up the cores of multicore machines. I tried a naive approach for this and it seems to be a valid approach for bringing wall clock run time improvements. We just need to figure out how to do it in a less naive way, and also to find the trade-off factors for when we should spin up Go-routines and how many to run. We can define threshold limits for these.

Challenge: beat LLVM wall time performance using concurrency

Anyone interested in experimenting a bit, have a look at asm/translate.go which defines the high-level AST to IR translation order. In it, a few good candidates for concurrent execution have been identified:

Order of translation

Note: step 3 and the substeps of 4a can be done concurrently.
Note: the substeps of 4a can be done concurrently.
Note: the substeps of 4b can be done concurrently.
Note: steps 5-7 can be done concurrently.
Note: the substeps of 8 can be done concurrently.

  1. Index AST top-level entities.

  2. Resolve IR type definitions.

    a. Index type identifiers and create scaffolding IR type definitions
    (without bodies).

    b. Translate AST type definitions to IR.

  3. Translate AST comdat definitions to IR.

Note: step 3 and the substeps of 4a can be done concurrently.

  1. Resolve remaining IR top-level entities.

    a. Index top-level identifiers and create scaffolding IR top-level
    declarations and definitions (without bodies but with types).

    Note: the substeps of 4a can be done concurrently.

    1. Index global identifiers and create scaffolding IR global
      declarations and definitions, indirect symbol definitions, and
      function declarations and definitions (without bodies but with
      types).

    2. Index attribute group IDs and create scaffolding IR attribute group
      definitions (without bodies).

    3. Index metadata names and create scaffolding IR named metadata
      definitions (without bodies).

    4. Index metadata IDs and create scaffolding IR metadata definitions
      (without bodies).

    b. Translate AST top-level declarations and definitions to IR.

    Note: the substeps of 4b can be done concurrently.

    1. Translate AST global declarations and definitions, indirect symbol
      definitions, and function declarations and definitions to IR.

    2. Translate AST attribute group definitions to IR.

    3. Translate AST named metadata definitions to IR.

    4. Translate AST metadata definitions to IR.

Note: steps 5-7 can be done concurrenty.

  1. Translate use-list orders.

  2. Translate basic block specific use-list orders.

  3. Fix basic block references in blockaddress constants.

  4. Add IR top-level declarations and definitions to the IR module in order of
    occurrence in the input.

    Note: the substeps of 8 can be done concurrently.

    a. Add IR type definitions to the IR module in alphabetical order.

    b. Add IR comdat definitions to the IR module in alphabetical order.

    c. Add IR global variable declarations and definitions, indirect symbol
    definitions, and function declarations and definitions to the IR module
    in order of occurrence in the input.

    d. Add IR attribute group definitions to the IR module in numeric order.

    e. Add IR named metadata definitions to the IR module in order of
    occurrence in the input.

    f. Add IR metadata definitions to the IR module in numeric order.

@pwaller
Copy link
Member

pwaller commented Dec 9, 2018

It would be great to improve this in the future! And there is definitely some low hanging fruit to be found. @pwaller has gotten a ~10% speed increase by making changes to Textmapper, so that the parser uses values instead of pointers for subtype (%interface) AST nodes, which puts less preasure on the Go garbage collector.

Filed: inspirer/textmapper#29

@mewmew mewmew added the meta label Mar 14, 2019
@mewmew
Copy link
Member Author

mewmew commented Dec 29, 2019

Moving performance related note from TODO.md to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants