Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive and dotted replacement with muladd #57

Merged
merged 4 commits into from
Jul 30, 2017
Merged

Recursive and dotted replacement with muladd #57

merged 4 commits into from
Jul 30, 2017

Conversation

devmotion
Copy link
Member

Hi!

As a reaction to SciML/OrdinaryDiffEq.jl#86 (comment), this PR changes the current implementation of @muladd in the following ways:

  • Conversion of combined multiplication and addition into a call of muladd is applied recursively
  • If one of the operands or operators (i.e. one of x, y, z, +, * in the expression x*y+z) is a dot call, muladd is also applied as a dot call, resulting in muladd.(x,y,z)

This is a first step towards fixing SciML/OrdinaryDiffEq.jl#86 (comment), since with this new implementation use of @muladd in https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/integrators/verner_rk_integrators.jl leads to the following expressions:

julia> macroexpand(:(@inbounds atmp[i] = (@muladd(update[i] - dt*(bhat1*k1[i] + bhat4*k4[i] + bhat5*k5[i] + bhat6*k6[i] + bhat7*k7[i] + bhat10*k10[i]))./@muladd(atol+max(abs(uprev[i]),abs(u[i])).*rtol))))
quote 
    $(Expr(:inbounds, true))
    atmp[i] = (update[i] - dt * (muladd)(bhat1, k1[i], (muladd)(bhat4, k4[i], (muladd)(bhat5, k5[i], (muladd)(bhat6, k6[i], (muladd)(bhat7, k7[i], bhat10 * k10[i])))))) ./ (muladd).(max(abs(uprev[i]), abs(u[i])), rtol, atol)
    $(Expr(:inbounds, :pop))
end
julia> macroexpand(:(integrator.EEst = integrator.opts.internalnorm( ((update - dt*(@muladd(bhat1*k1 + bhat4*k4 + bhat5*k5 + bhat6*k6 + bhat7*k7 + bhat10*k10)))./@muladd(integrator.opts.abstol+max.(abs.(uprev),abs.(u)).*integrator.opts.reltol)))))
:(integrator.EEst = integrator.opts.internalnorm((update - dt * (muladd)(bhat1, k1, (muladd)(bhat4, k4, (muladd)(bhat5, k5, (muladd)(bhat6, k6, (muladd)(bhat7, k7, bhat10 * k10)))))) ./ (muladd).(max.(abs.(uprev), abs.(u)), integrator.opts.reltol, integrator.opts.abstol)))

Moreover, it enables a shorter notation. The last example could also be written as

@muladd integrator.EEst = integrator.opts.internalnorm((update - dt*(bhat1*k1 + bhat4*k4 + bhat5*k5 + bhat6*k6 + bhat7*k7 + bhat10*k10))./ @. (integrator.opts.abstol+max(abs(uprev),abs(u))*integrator.opts.reltol))

which is transformed to the same expression

julia> macroexpand(:(@muladd integrator.EEst = integrator.opts.internalnorm((update - dt*(bhat1*k1 + bhat4*k4 + bhat5*k5 + bhat6*k6 + bhat7*k7 + bhat10*k10))./ @. (integrator.opts.abstol+max(abs(uprev),abs(u))*integrator.opts.reltol))))
:(integrator.EEst = integrator.opts.internalnorm((update - dt * (muladd)(bhat1, k1, (muladd)(bhat4, k4, (muladd)(bhat5, k5, (muladd)(bhat6, k6, (muladd)(bhat7, k7, bhat10 * k10)))))) ./ (muladd).(max.(abs.(uprev), abs.(u)), integrator.opts.reltol, integrator.opts.abstol)))

Or e.g. in line https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/integrators/verner_rk_integrators.jl#L207

 k3 = f(@muladd(t + c3*dt),@muladd(uprev+dt*(a031*k1+a032*k2)))

can be replaced by

 @muladd k3 = f(t + c3*dt, @. uprev+dt*(a031*k1+a032*k2))

which leads to

julia> macroexpand(:(@muladd k3 = f(t + c3*dt, @. uprev+dt*(a031*k1+a032*k2))))
:(k3 = f((muladd)(c3, dt, t), (muladd).(dt, (muladd).(a031, k1, *.(a032, k2)), uprev)))

@ChrisRackauckas
Copy link
Member

This is absolutely beautiful. The code is very clean so even though it's complicated, it's still easy to follow. This is probably the best way to fix SciML/OrdinaryDiffEq.jl#86 too. I like that you got the macro in front of the equals working, and I never thought about using macroexpand in a macro.

IMO it's looking complicated enough yet widely useful enough that we might want to spin it off as a separate package. Other people might want to use this if a README is stuck on it. I'll merge when tests pass, or I can quickly generate a MuladdMacro.jl in DiffEq and give you write privileges, transfer the tests over, get that registered, and then replace its usage in OrdinaryDiffEq.jl/StochasticDiffEq.jl.

@devmotion
Copy link
Member Author

devmotion commented Jul 26, 2017

The main work was already done, these are just some adjustments. If you think this is worth a separate package, I can transfer it and also add a README (basically a (maybe a bit longer) copy of the existing documentation and some examples, I guess?).

Of course, one still has to apply some fixes to OrdinaryDiffEq.jl - e.g. the last example @muladd(uprev+dt*(a031*k1+a032*k2)), and hence the expression :((muladd)(dt, (muladd)(a031, k1, a032 * k2), uprev)), still has to replaced by a dot call (as shown), I guess, since there is no vectorized version of muladd and it just falls back to x*y+z

julia> muladd([1.], [1.], [1.])
ERROR: DimensionMismatch("Cannot multiply two vectors")
Stacktrace:
 [1] muladd(::Array{Float64,1}, ::Array{Float64,1}, ::Array{Float64,1}) at ./math.jl:943

and even in the use case above with scalar x it leads to

julia> @code_llvm muladd([1.], 1., [1.])

define i8** @julia_muladd_60850(i8** dereferenceable(40), double, i8** dereferenceable(40)) #0 !dbg !5 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
  %ptls = bitcast i8* %ptls_i8 to i8****
  %3 = alloca [3 x i8**], align 8
  %.sub = getelementptr inbounds [3 x i8**], [3 x i8**]* %3, i64 0, i64 0
  %4 = getelementptr [3 x i8**], [3 x i8**]* %3, i64 0, i64 2
  store i8** null, i8*** %4, align 8
  %5 = bitcast [3 x i8**]* %3 to i64*
  store i64 2, i64* %5, align 8
  %6 = getelementptr [3 x i8**], [3 x i8**]* %3, i64 0, i64 1
  %7 = bitcast i8* %ptls_i8 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast i8*** %6 to i64*
  store i64 %8, i64* %9, align 8
  store i8*** %.sub, i8**** %ptls, align 8
  %10 = call i8** @"julia_*_60853"(i8** nonnull %0, double %1)
  store i8** %10, i8*** %4, align 8
  %11 = call i8** @"julia_+_60851"(i8** %10, i8** nonnull %2)
  %12 = load i64, i64* %9, align 8
  store i64 %12, i64* %7, align 8
  ret i8** %11
}

which does not call @llvm.fmuladd in contrast to

julia> @code_llvm muladd(1., 1., 1.)

define double @jlsys_muladd_59756(double, double, double) #0 !dbg !5 {
top:
  %3 = call double @llvm.fmuladd.f64(double %0, double %1, double %2)
  ret double %3
}

and the huge expression

julia> @code_llvm muladd.([1.], [1.], [1.])

define i8** @japi1_broadcast_60847(i8**, i8***, i32) #0 !dbg !5 {
top:
  %3 = alloca %OneTo, align 8
  %4 = alloca %OneTo, align 8
  %5 = alloca %OneTo, align 8
  %6 = alloca i8***, align 8
  store volatile i8*** %1, i8**** %6, align 8
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #5
  %ptls = bitcast i8* %ptls_i8 to i8****
  %7 = alloca [4 x i8**], align 8
  %.sub = getelementptr inbounds [4 x i8**], [4 x i8**]* %7, i64 0, i64 0
  %8 = getelementptr [4 x i8**], [4 x i8**]* %7, i64 0, i64 2
  %9 = bitcast [4 x i8**]* %7 to i64*
  %10 = bitcast i8*** %8 to i8*
  call void @llvm.memset.p0i8.i64(i8* %10, i8 0, i64 16, i32 8, i1 false)
  store i64 4, i64* %9, align 8
  %11 = getelementptr [4 x i8**], [4 x i8**]* %7, i64 0, i64 1
  %12 = bitcast i8* %ptls_i8 to i64*
  %13 = load i64, i64* %12, align 8
  %14 = bitcast i8*** %11 to i64*
  store i64 %13, i64* %14, align 8
  store i8*** %.sub, i8**** %ptls, align 8
  %newout = alloca %OneTo, align 8
  %15 = getelementptr i8**, i8*** %1, i64 2
  switch i32 %2, label %pass26 [
    i32 2, label %fail
    i32 3, label %fail25
  ]

fail:                                             ; preds = %top
  call void @jl_bounds_error_tuple_int(i8*** %15, i64 0, i64 1)
  unreachable

fail25:                                           ; preds = %top
  call void @jl_bounds_error_tuple_int(i8*** %15, i64 1, i64 2)
  unreachable

pass26:                                           ; preds = %top
  %16 = getelementptr i8**, i8*** %1, i64 1
  %17 = load i8**, i8*** %16, align 8
  %18 = load i8**, i8*** %15, align 8
  %19 = getelementptr i8**, i8*** %1, i64 3
  %20 = load i8**, i8*** %19, align 8
  %21 = getelementptr i8*, i8** %17, i64 3
  %22 = bitcast i8** %21 to i64*
  %23 = load i64, i64* %22, align 8
  %24 = getelementptr i8*, i8** %18, i64 3
  %25 = bitcast i8** %24 to i64*
  %26 = load i64, i64* %25, align 8
  %27 = getelementptr i8*, i8** %20, i64 3
  %28 = bitcast i8** %27 to i64*
  %29 = load i64, i64* %28, align 8
  %30 = icmp slt i64 %26, 0
  %31 = select i1 %30, i64 0, i64 %26
  %32 = getelementptr inbounds %OneTo, %OneTo* %5, i64 0, i32 0
  store i64 %31, i64* %32, align 8
  %33 = icmp slt i64 %29, 0
  %34 = select i1 %33, i64 0, i64 %29
  %35 = getelementptr inbounds %OneTo, %OneTo* %4, i64 0, i32 0
  store i64 %34, i64* %35, align 8
  %36 = icmp slt i64 %23, 0
  %37 = select i1 %36, i64 0, i64 %23
  %38 = getelementptr inbounds %OneTo, %OneTo* %3, i64 0, i32 0
  store i64 %37, i64* %38, align 8
  %39 = call %OneTo @julia__bcs1_60823(%OneTo* nocapture nonnull readonly %3, %OneTo* nocapture nonnull readonly %5)
  %40 = extractvalue %OneTo %39, 0
  %41 = getelementptr inbounds %OneTo, %OneTo* %newout, i64 0, i32 0
  store i64 %40, i64* %41, align 8
  %42 = call %OneTo @julia__bcs1_60823(%OneTo* nocapture nonnull readonly %newout, %OneTo* nocapture nonnull readonly %4)
  %43 = extractvalue %OneTo %42, 0
  %44 = call i8** inttoptr (i64 139923554842592 to i8** (i8**, i64)*)(i8** inttoptr (i64 139923408648656 to i8**), i64 %43)
  store i8** %44, i8*** %8, align 8
  %45 = icmp slt i64 %43, 1
  br i1 %45, label %L139.backedge, label %if28.lr.ph

L139.loopexit.loopexit:                           ; preds = %middle.block206, %if28.us.us.us
  br label %L139.loopexit

L139.loopexit.loopexit64:                         ; preds = %middle.block183, %if28.us.us
  br label %L139.loopexit

L139.loopexit.loopexit65:                         ; preds = %middle.block160, %if28.us.us48
  br label %L139.loopexit

L139.loopexit.loopexit66:                         ; preds = %middle.block139, %if28.us
  br label %L139.loopexit

L139.loopexit.loopexit67:                         ; preds = %middle.block116, %if28.us44.us
  br label %L139.loopexit

L139.loopexit.loopexit68:                         ; preds = %middle.block95, %if28.us44
  br label %L139.loopexit

L139.loopexit.loopexit69:                         ; preds = %middle.block75, %if28.us46
  br label %L139.loopexit

L139.loopexit.loopexit70:                         ; preds = %middle.block, %if28
  br label %L139.loopexit

L139.loopexit:                                    ; preds = %L139.loopexit.loopexit70, %L139.loopexit.loopexit69, %L139.loopexit.loopexit68, %L139.loopexit.loopexit67, %L139.loopexit.loopexit66, %L139.loopexit.loopexit65, %L139.loopexit.loopexit64, %L139.loopexit.loopexit
  %46 = getelementptr [4 x i8**], [4 x i8**]* %7, i64 0, i64 3
  store i8** %44, i8*** %46, align 8
  br label %L139.backedge

L139.backedge:                                    ; preds = %pass26, %L139.loopexit
  %47 = load i64, i64* %14, align 8
  store i64 %47, i64* %12, align 8
  ret i8** %44

if28.lr.ph:                                       ; preds = %pass26
  %48 = load i64, i64* %28, align 8
  %49 = icmp slt i64 %48, 0
  %50 = select i1 %49, i64 0, i64 %48
  %51 = icmp eq i64 %43, %50
  %52 = load i64, i64* %25, align 8
  %53 = icmp slt i64 %52, 0
  %54 = select i1 %53, i64 0, i64 %52
  %55 = icmp eq i64 %43, %54
  %56 = load i64, i64* %22, align 8
  %57 = icmp slt i64 %56, 0
  %58 = select i1 %57, i64 0, i64 %56
  %59 = icmp eq i64 %43, %58
  %60 = bitcast i8** %17 to double**
  %61 = bitcast i8** %18 to double**
  %62 = bitcast i8** %20 to double**
  %63 = bitcast i8** %44 to double**
  %64 = load double*, double** %60, align 8
  %65 = load double*, double** %61, align 8
  %66 = load double*, double** %62, align 8
  %67 = load double*, double** %63, align 8
  br i1 %59, label %if28.lr.ph.split.us, label %if28.lr.ph.if28.lr.ph.split_crit_edge

if28.lr.ph.if28.lr.ph.split_crit_edge:            ; preds = %if28.lr.ph
  br i1 %55, label %if28.lr.ph.split.split.us, label %if28.lr.ph.split.if28.lr.ph.split.split_crit_edge

if28.lr.ph.split.us:                              ; preds = %if28.lr.ph
  br i1 %55, label %if28.lr.ph.split.us.split.us, label %if28.lr.ph.split.us.if28.lr.ph.split.us.split_crit_edge

if28.lr.ph.split.us.if28.lr.ph.split.us.split_crit_edge: ; preds = %if28.lr.ph.split.us
  br i1 %51, label %if28.us.us48.preheader, label %if28.us.preheader

if28.us.preheader:                                ; preds = %if28.lr.ph.split.us.if28.lr.ph.split.us.split_crit_edge
  %min.iters.check141 = icmp ult i64 %43, 4
  br i1 %min.iters.check141, label %scalar.ph140, label %min.iters.checked142

min.iters.checked142:                             ; preds = %if28.us.preheader
  %n.vec144 = and i64 %43, -4
  %cmp.zero145 = icmp eq i64 %n.vec144, 0
  br i1 %cmp.zero145, label %scalar.ph140, label %vector.ph146

vector.ph146:                                     ; preds = %min.iters.checked142
  br label %vector.body138

vector.body138:                                   ; preds = %vector.body138, %vector.ph146
  %index147 = phi i64 [ 0, %vector.ph146 ], [ %index.next148, %vector.body138 ]
  %68 = getelementptr double, double* %64, i64 %index147
  %69 = bitcast double* %68 to <2 x double>*
  %wide.load155 = load <2 x double>, <2 x double>* %69, align 8
  %70 = getelementptr double, double* %68, i64 2
  %71 = bitcast double* %70 to <2 x double>*
  %wide.load156 = load <2 x double>, <2 x double>* %71, align 8
  %72 = load double, double* %65, align 8
  %73 = insertelement <2 x double> undef, double %72, i32 0
  %74 = insertelement <2 x double> %73, double %72, i32 1
  %75 = insertelement <2 x double> undef, double %72, i32 0
  %76 = insertelement <2 x double> %75, double %72, i32 1
  %77 = load double, double* %66, align 8
  %78 = insertelement <2 x double> undef, double %77, i32 0
  %79 = insertelement <2 x double> %78, double %77, i32 1
  %80 = insertelement <2 x double> undef, double %77, i32 0
  %81 = insertelement <2 x double> %80, double %77, i32 1
  %82 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load155, <2 x double> %74, <2 x double> %79)
  %83 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load156, <2 x double> %76, <2 x double> %81)
  %84 = getelementptr double, double* %67, i64 %index147
  %85 = bitcast double* %84 to <2 x double>*
  store <2 x double> %82, <2 x double>* %85, align 8
  %86 = getelementptr double, double* %84, i64 2
  %87 = bitcast double* %86 to <2 x double>*
  store <2 x double> %83, <2 x double>* %87, align 8
  %index.next148 = add i64 %index147, 4
  %88 = icmp eq i64 %index.next148, %n.vec144
  br i1 %88, label %middle.block139, label %vector.body138

middle.block139:                                  ; preds = %vector.body138
  %cmp.n150 = icmp eq i64 %43, %n.vec144
  br i1 %cmp.n150, label %L139.loopexit.loopexit66, label %scalar.ph140

scalar.ph140:                                     ; preds = %middle.block139, %min.iters.checked142, %if28.us.preheader
  %bc.resume.val149 = phi i64 [ %n.vec144, %middle.block139 ], [ 0, %if28.us.preheader ], [ 0, %min.iters.checked142 ]
  br label %if28.us

if28.us.us48.preheader:                           ; preds = %if28.lr.ph.split.us.if28.lr.ph.split.us.split_crit_edge
  %min.iters.check162 = icmp ult i64 %43, 4
  br i1 %min.iters.check162, label %scalar.ph161, label %min.iters.checked163

min.iters.checked163:                             ; preds = %if28.us.us48.preheader
  %n.vec165 = and i64 %43, -4
  %cmp.zero166 = icmp eq i64 %n.vec165, 0
  br i1 %cmp.zero166, label %scalar.ph161, label %vector.ph167

vector.ph167:                                     ; preds = %min.iters.checked163
  br label %vector.body159

vector.body159:                                   ; preds = %vector.body159, %vector.ph167
  %index168 = phi i64 [ 0, %vector.ph167 ], [ %index.next169, %vector.body159 ]
  %89 = getelementptr double, double* %64, i64 %index168
  %90 = bitcast double* %89 to <2 x double>*
  %wide.load176 = load <2 x double>, <2 x double>* %90, align 8
  %91 = getelementptr double, double* %89, i64 2
  %92 = bitcast double* %91 to <2 x double>*
  %wide.load177 = load <2 x double>, <2 x double>* %92, align 8
  %93 = load double, double* %65, align 8
  %94 = insertelement <2 x double> undef, double %93, i32 0
  %95 = insertelement <2 x double> %94, double %93, i32 1
  %96 = insertelement <2 x double> undef, double %93, i32 0
  %97 = insertelement <2 x double> %96, double %93, i32 1
  %98 = getelementptr double, double* %66, i64 %index168
  %99 = bitcast double* %98 to <2 x double>*
  %wide.load178 = load <2 x double>, <2 x double>* %99, align 8
  %100 = getelementptr double, double* %98, i64 2
  %101 = bitcast double* %100 to <2 x double>*
  %wide.load179 = load <2 x double>, <2 x double>* %101, align 8
  %102 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load176, <2 x double> %95, <2 x double> %wide.load178)
  %103 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load177, <2 x double> %97, <2 x double> %wide.load179)
  %104 = getelementptr double, double* %67, i64 %index168
  %105 = bitcast double* %104 to <2 x double>*
  store <2 x double> %102, <2 x double>* %105, align 8
  %106 = getelementptr double, double* %104, i64 2
  %107 = bitcast double* %106 to <2 x double>*
  store <2 x double> %103, <2 x double>* %107, align 8
  %index.next169 = add i64 %index168, 4
  %108 = icmp eq i64 %index.next169, %n.vec165
  br i1 %108, label %middle.block160, label %vector.body159

middle.block160:                                  ; preds = %vector.body159
  %cmp.n171 = icmp eq i64 %43, %n.vec165
  br i1 %cmp.n171, label %L139.loopexit.loopexit65, label %scalar.ph161

scalar.ph161:                                     ; preds = %middle.block160, %min.iters.checked163, %if28.us.us48.preheader
  %bc.resume.val170 = phi i64 [ %n.vec165, %middle.block160 ], [ 0, %if28.us.us48.preheader ], [ 0, %min.iters.checked163 ]
  br label %if28.us.us48

if28.lr.ph.split.us.split.us:                     ; preds = %if28.lr.ph.split.us
  br i1 %51, label %if28.us.us.us.preheader, label %if28.us.us.preheader

if28.us.us.preheader:                             ; preds = %if28.lr.ph.split.us.split.us
  %min.iters.check185 = icmp ult i64 %43, 4
  br i1 %min.iters.check185, label %scalar.ph184, label %min.iters.checked186

min.iters.checked186:                             ; preds = %if28.us.us.preheader
  %n.vec188 = and i64 %43, -4
  %cmp.zero189 = icmp eq i64 %n.vec188, 0
  br i1 %cmp.zero189, label %scalar.ph184, label %vector.ph190

vector.ph190:                                     ; preds = %min.iters.checked186
  br label %vector.body182

vector.body182:                                   ; preds = %vector.body182, %vector.ph190
  %index191 = phi i64 [ 0, %vector.ph190 ], [ %index.next192, %vector.body182 ]
  %109 = getelementptr double, double* %64, i64 %index191
  %110 = bitcast double* %109 to <2 x double>*
  %wide.load199 = load <2 x double>, <2 x double>* %110, align 8
  %111 = getelementptr double, double* %109, i64 2
  %112 = bitcast double* %111 to <2 x double>*
  %wide.load200 = load <2 x double>, <2 x double>* %112, align 8
  %113 = getelementptr double, double* %65, i64 %index191
  %114 = bitcast double* %113 to <2 x double>*
  %wide.load201 = load <2 x double>, <2 x double>* %114, align 8
  %115 = getelementptr double, double* %113, i64 2
  %116 = bitcast double* %115 to <2 x double>*
  %wide.load202 = load <2 x double>, <2 x double>* %116, align 8
  %117 = load double, double* %66, align 8
  %118 = insertelement <2 x double> undef, double %117, i32 0
  %119 = insertelement <2 x double> %118, double %117, i32 1
  %120 = insertelement <2 x double> undef, double %117, i32 0
  %121 = insertelement <2 x double> %120, double %117, i32 1
  %122 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load199, <2 x double> %wide.load201, <2 x double> %119)
  %123 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load200, <2 x double> %wide.load202, <2 x double> %121)
  %124 = getelementptr double, double* %67, i64 %index191
  %125 = bitcast double* %124 to <2 x double>*
  store <2 x double> %122, <2 x double>* %125, align 8
  %126 = getelementptr double, double* %124, i64 2
  %127 = bitcast double* %126 to <2 x double>*
  store <2 x double> %123, <2 x double>* %127, align 8
  %index.next192 = add i64 %index191, 4
  %128 = icmp eq i64 %index.next192, %n.vec188
  br i1 %128, label %middle.block183, label %vector.body182

middle.block183:                                  ; preds = %vector.body182
  %cmp.n194 = icmp eq i64 %43, %n.vec188
  br i1 %cmp.n194, label %L139.loopexit.loopexit64, label %scalar.ph184

scalar.ph184:                                     ; preds = %middle.block183, %min.iters.checked186, %if28.us.us.preheader
  %bc.resume.val193 = phi i64 [ %n.vec188, %middle.block183 ], [ 0, %if28.us.us.preheader ], [ 0, %min.iters.checked186 ]
  br label %if28.us.us

if28.us.us.us.preheader:                          ; preds = %if28.lr.ph.split.us.split.us
  %min.iters.check208 = icmp ult i64 %43, 4
  br i1 %min.iters.check208, label %scalar.ph207, label %min.iters.checked209

min.iters.checked209:                             ; preds = %if28.us.us.us.preheader
  %n.vec211 = and i64 %43, -4
  %cmp.zero212 = icmp eq i64 %n.vec211, 0
  br i1 %cmp.zero212, label %scalar.ph207, label %vector.ph213

vector.ph213:                                     ; preds = %min.iters.checked209
  br label %vector.body205

vector.body205:                                   ; preds = %vector.body205, %vector.ph213
  %index214 = phi i64 [ 0, %vector.ph213 ], [ %index.next215, %vector.body205 ]
  %129 = getelementptr double, double* %64, i64 %index214
  %130 = bitcast double* %129 to <2 x double>*
  %wide.load222 = load <2 x double>, <2 x double>* %130, align 8
  %131 = getelementptr double, double* %129, i64 2
  %132 = bitcast double* %131 to <2 x double>*
  %wide.load223 = load <2 x double>, <2 x double>* %132, align 8
  %133 = getelementptr double, double* %65, i64 %index214
  %134 = bitcast double* %133 to <2 x double>*
  %wide.load224 = load <2 x double>, <2 x double>* %134, align 8
  %135 = getelementptr double, double* %133, i64 2
  %136 = bitcast double* %135 to <2 x double>*
  %wide.load225 = load <2 x double>, <2 x double>* %136, align 8
  %137 = getelementptr double, double* %66, i64 %index214
  %138 = bitcast double* %137 to <2 x double>*
  %wide.load226 = load <2 x double>, <2 x double>* %138, align 8
  %139 = getelementptr double, double* %137, i64 2
  %140 = bitcast double* %139 to <2 x double>*
  %wide.load227 = load <2 x double>, <2 x double>* %140, align 8
  %141 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load222, <2 x double> %wide.load224, <2 x double> %wide.load226)
  %142 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %wide.load223, <2 x double> %wide.load225, <2 x double> %wide.load227)
  %143 = getelementptr double, double* %67, i64 %index214
  %144 = bitcast double* %143 to <2 x double>*
  store <2 x double> %141, <2 x double>* %144, align 8
  %145 = getelementptr double, double* %143, i64 2
  %146 = bitcast double* %145 to <2 x double>*
  store <2 x double> %142, <2 x double>* %146, align 8
  %index.next215 = add i64 %index214, 4
  %147 = icmp eq i64 %index.next215, %n.vec211
  br i1 %147, label %middle.block206, label %vector.body205

middle.block206:                                  ; preds = %vector.body205
  %cmp.n217 = icmp eq i64 %43, %n.vec211
  br i1 %cmp.n217, label %L139.loopexit.loopexit, label %scalar.ph207

scalar.ph207:                                     ; preds = %middle.block206, %min.iters.checked209, %if28.us.us.us.preheader
  %bc.resume.val216 = phi i64 [ %n.vec211, %middle.block206 ], [ 0, %if28.us.us.us.preheader ], [ 0, %min.iters.checked209 ]
  br label %if28.us.us.us

if28.us.us.us:                                    ; preds = %scalar.ph207, %if28.us.us.us
  %"i#668.043.us.us.us" = phi i64 [ %156, %if28.us.us.us ], [ %bc.resume.val216, %scalar.ph207 ]
  %148 = getelementptr double, double* %64, i64 %"i#668.043.us.us.us"
  %149 = load double, double* %148, align 8
  %150 = getelementptr double, double* %65, i64 %"i#668.043.us.us.us"
  %151 = load double, double* %150, align 8
  %152 = getelementptr double, double* %66, i64 %"i#668.043.us.us.us"
  %153 = load double, double* %152, align 8
  %154 = call double @llvm.fmuladd.f64(double %149, double %151, double %153)
  %155 = getelementptr double, double* %67, i64 %"i#668.043.us.us.us"
  store double %154, double* %155, align 8
  %156 = add nuw nsw i64 %"i#668.043.us.us.us", 1
  %exitcond56 = icmp eq i64 %156, %43
  br i1 %exitcond56, label %L139.loopexit.loopexit, label %if28.us.us.us

if28.us.us:                                       ; preds = %scalar.ph184, %if28.us.us
  %"i#668.043.us.us" = phi i64 [ %164, %if28.us.us ], [ %bc.resume.val193, %scalar.ph184 ]
  %157 = getelementptr double, double* %64, i64 %"i#668.043.us.us"
  %158 = load double, double* %157, align 8
  %159 = getelementptr double, double* %65, i64 %"i#668.043.us.us"
  %160 = load double, double* %159, align 8
  %161 = load double, double* %66, align 8
  %162 = call double @llvm.fmuladd.f64(double %158, double %160, double %161)
  %163 = getelementptr double, double* %67, i64 %"i#668.043.us.us"
  store double %162, double* %163, align 8
  %164 = add nuw nsw i64 %"i#668.043.us.us", 1
  %exitcond53 = icmp eq i64 %164, %43
  br i1 %exitcond53, label %L139.loopexit.loopexit64, label %if28.us.us

if28.us.us48:                                     ; preds = %scalar.ph161, %if28.us.us48
  %"i#668.043.us.us49" = phi i64 [ %172, %if28.us.us48 ], [ %bc.resume.val170, %scalar.ph161 ]
  %165 = getelementptr double, double* %64, i64 %"i#668.043.us.us49"
  %166 = load double, double* %165, align 8
  %167 = load double, double* %65, align 8
  %168 = getelementptr double, double* %66, i64 %"i#668.043.us.us49"
  %169 = load double, double* %168, align 8
  %170 = call double @llvm.fmuladd.f64(double %166, double %167, double %169)
  %171 = getelementptr double, double* %67, i64 %"i#668.043.us.us49"
  store double %170, double* %171, align 8
  %172 = add nuw nsw i64 %"i#668.043.us.us49", 1
  %exitcond54 = icmp eq i64 %172, %43
  br i1 %exitcond54, label %L139.loopexit.loopexit65, label %if28.us.us48

if28.us:                                          ; preds = %scalar.ph140, %if28.us
  %"i#668.043.us" = phi i64 [ %179, %if28.us ], [ %bc.resume.val149, %scalar.ph140 ]
  %173 = getelementptr double, double* %64, i64 %"i#668.043.us"
  %174 = load double, double* %173, align 8
  %175 = load double, double* %65, align 8
  %176 = load double, double* %66, align 8
  %177 = call double @llvm.fmuladd.f64(double %174, double %175, double %176)
  %178 = getelementptr double, double* %67, i64 %"i#668.043.us"
  store double %177, double* %178, align 8
  %179 = add nuw nsw i64 %"i#668.043.us", 1
  %exitcond50 = icmp eq i64 %179, %43
  br i1 %exitcond50, label %L139.loopexit.loopexit66, label %if28.us

if28.lr.ph.split.if28.lr.ph.split.split_crit_edge: ; preds = %if28.lr.ph.if28.lr.ph.split_crit_edge
  br i1 %51, label %if28.us46.preheader, label %if28.preheader

if28.preheader:                                   ; preds = %if28.lr.ph.split.if28.lr.ph.split.split_crit_edge
  %min.iters.check = icmp ult i64 %43, 4
  br i1 %min.iters.check, label %scalar.ph, label %min.iters.checked

min.iters.checked:                                ; preds = %if28.preheader
  %n.vec = and i64 %43, -4
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %min.iters.checked
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %180 = load double, double* %64, align 8
  %181 = insertelement <2 x double> undef, double %180, i32 0
  %182 = insertelement <2 x double> %181, double %180, i32 1
  %183 = insertelement <2 x double> undef, double %180, i32 0
  %184 = insertelement <2 x double> %183, double %180, i32 1
  %185 = load double, double* %65, align 8
  %186 = insertelement <2 x double> undef, double %185, i32 0
  %187 = insertelement <2 x double> %186, double %185, i32 1
  %188 = insertelement <2 x double> undef, double %185, i32 0
  %189 = insertelement <2 x double> %188, double %185, i32 1
  %190 = load double, double* %66, align 8
  %191 = insertelement <2 x double> undef, double %190, i32 0
  %192 = insertelement <2 x double> %191, double %190, i32 1
  %193 = insertelement <2 x double> undef, double %190, i32 0
  %194 = insertelement <2 x double> %193, double %190, i32 1
  %195 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %182, <2 x double> %187, <2 x double> %192)
  %196 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %184, <2 x double> %189, <2 x double> %194)
  %197 = getelementptr double, double* %67, i64 %index
  %198 = bitcast double* %197 to <2 x double>*
  store <2 x double> %195, <2 x double>* %198, align 8
  %199 = getelementptr double, double* %197, i64 2
  %200 = bitcast double* %199 to <2 x double>*
  store <2 x double> %196, <2 x double>* %200, align 8
  %index.next = add i64 %index, 4
  %201 = icmp eq i64 %index.next, %n.vec
  br i1 %201, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %43, %n.vec
  br i1 %cmp.n, label %L139.loopexit.loopexit70, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %min.iters.checked, %if28.preheader
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %if28.preheader ], [ 0, %min.iters.checked ]
  br label %if28

if28.us46.preheader:                              ; preds = %if28.lr.ph.split.if28.lr.ph.split.split_crit_edge
  %min.iters.check77 = icmp ult i64 %43, 4
  br i1 %min.iters.check77, label %scalar.ph76, label %min.iters.checked78

min.iters.checked78:                              ; preds = %if28.us46.preheader
  %n.vec80 = and i64 %43, -4
  %cmp.zero81 = icmp eq i64 %n.vec80, 0
  br i1 %cmp.zero81, label %scalar.ph76, label %vector.ph82

vector.ph82:                                      ; preds = %min.iters.checked78
  br label %vector.body74

vector.body74:                                    ; preds = %vector.body74, %vector.ph82
  %index83 = phi i64 [ 0, %vector.ph82 ], [ %index.next84, %vector.body74 ]
  %202 = load double, double* %64, align 8
  %203 = insertelement <2 x double> undef, double %202, i32 0
  %204 = insertelement <2 x double> %203, double %202, i32 1
  %205 = insertelement <2 x double> undef, double %202, i32 0
  %206 = insertelement <2 x double> %205, double %202, i32 1
  %207 = load double, double* %65, align 8
  %208 = insertelement <2 x double> undef, double %207, i32 0
  %209 = insertelement <2 x double> %208, double %207, i32 1
  %210 = insertelement <2 x double> undef, double %207, i32 0
  %211 = insertelement <2 x double> %210, double %207, i32 1
  %212 = getelementptr double, double* %66, i64 %index83
  %213 = bitcast double* %212 to <2 x double>*
  %wide.load = load <2 x double>, <2 x double>* %213, align 8
  %214 = getelementptr double, double* %212, i64 2
  %215 = bitcast double* %214 to <2 x double>*
  %wide.load91 = load <2 x double>, <2 x double>* %215, align 8
  %216 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %204, <2 x double> %209, <2 x double> %wide.load)
  %217 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %206, <2 x double> %211, <2 x double> %wide.load91)
  %218 = getelementptr double, double* %67, i64 %index83
  %219 = bitcast double* %218 to <2 x double>*
  store <2 x double> %216, <2 x double>* %219, align 8
  %220 = getelementptr double, double* %218, i64 2
  %221 = bitcast double* %220 to <2 x double>*
  store <2 x double> %217, <2 x double>* %221, align 8
  %index.next84 = add i64 %index83, 4
  %222 = icmp eq i64 %index.next84, %n.vec80
  br i1 %222, label %middle.block75, label %vector.body74

middle.block75:                                   ; preds = %vector.body74
  %cmp.n86 = icmp eq i64 %43, %n.vec80
  br i1 %cmp.n86, label %L139.loopexit.loopexit69, label %scalar.ph76

scalar.ph76:                                      ; preds = %middle.block75, %min.iters.checked78, %if28.us46.preheader
  %bc.resume.val85 = phi i64 [ %n.vec80, %middle.block75 ], [ 0, %if28.us46.preheader ], [ 0, %min.iters.checked78 ]
  br label %if28.us46

if28.lr.ph.split.split.us:                        ; preds = %if28.lr.ph.if28.lr.ph.split_crit_edge
  br i1 %51, label %if28.us44.us.preheader, label %if28.us44.preheader

if28.us44.preheader:                              ; preds = %if28.lr.ph.split.split.us
  %min.iters.check97 = icmp ult i64 %43, 4
  br i1 %min.iters.check97, label %scalar.ph96, label %min.iters.checked98

min.iters.checked98:                              ; preds = %if28.us44.preheader
  %n.vec100 = and i64 %43, -4
  %cmp.zero101 = icmp eq i64 %n.vec100, 0
  br i1 %cmp.zero101, label %scalar.ph96, label %vector.ph102

vector.ph102:                                     ; preds = %min.iters.checked98
  br label %vector.body94

vector.body94:                                    ; preds = %vector.body94, %vector.ph102
  %index103 = phi i64 [ 0, %vector.ph102 ], [ %index.next104, %vector.body94 ]
  %223 = load double, double* %64, align 8
  %224 = insertelement <2 x double> undef, double %223, i32 0
  %225 = insertelement <2 x double> %224, double %223, i32 1
  %226 = insertelement <2 x double> undef, double %223, i32 0
  %227 = insertelement <2 x double> %226, double %223, i32 1
  %228 = getelementptr double, double* %65, i64 %index103
  %229 = bitcast double* %228 to <2 x double>*
  %wide.load111 = load <2 x double>, <2 x double>* %229, align 8
  %230 = getelementptr double, double* %228, i64 2
  %231 = bitcast double* %230 to <2 x double>*
  %wide.load112 = load <2 x double>, <2 x double>* %231, align 8
  %232 = load double, double* %66, align 8
  %233 = insertelement <2 x double> undef, double %232, i32 0
  %234 = insertelement <2 x double> %233, double %232, i32 1
  %235 = insertelement <2 x double> undef, double %232, i32 0
  %236 = insertelement <2 x double> %235, double %232, i32 1
  %237 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %225, <2 x double> %wide.load111, <2 x double> %234)
  %238 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %227, <2 x double> %wide.load112, <2 x double> %236)
  %239 = getelementptr double, double* %67, i64 %index103
  %240 = bitcast double* %239 to <2 x double>*
  store <2 x double> %237, <2 x double>* %240, align 8
  %241 = getelementptr double, double* %239, i64 2
  %242 = bitcast double* %241 to <2 x double>*
  store <2 x double> %238, <2 x double>* %242, align 8
  %index.next104 = add i64 %index103, 4
  %243 = icmp eq i64 %index.next104, %n.vec100
  br i1 %243, label %middle.block95, label %vector.body94

middle.block95:                                   ; preds = %vector.body94
  %cmp.n106 = icmp eq i64 %43, %n.vec100
  br i1 %cmp.n106, label %L139.loopexit.loopexit68, label %scalar.ph96

scalar.ph96:                                      ; preds = %middle.block95, %min.iters.checked98, %if28.us44.preheader
  %bc.resume.val105 = phi i64 [ %n.vec100, %middle.block95 ], [ 0, %if28.us44.preheader ], [ 0, %min.iters.checked98 ]
  br label %if28.us44

if28.us44.us.preheader:                           ; preds = %if28.lr.ph.split.split.us
  %min.iters.check118 = icmp ult i64 %43, 4
  br i1 %min.iters.check118, label %scalar.ph117, label %min.iters.checked119

min.iters.checked119:                             ; preds = %if28.us44.us.preheader
  %n.vec121 = and i64 %43, -4
  %cmp.zero122 = icmp eq i64 %n.vec121, 0
  br i1 %cmp.zero122, label %scalar.ph117, label %vector.ph123

vector.ph123:                                     ; preds = %min.iters.checked119
  br label %vector.body115

vector.body115:                                   ; preds = %vector.body115, %vector.ph123
  %index124 = phi i64 [ 0, %vector.ph123 ], [ %index.next125, %vector.body115 ]
  %244 = load double, double* %64, align 8
  %245 = insertelement <2 x double> undef, double %244, i32 0
  %246 = insertelement <2 x double> %245, double %244, i32 1
  %247 = insertelement <2 x double> undef, double %244, i32 0
  %248 = insertelement <2 x double> %247, double %244, i32 1
  %249 = getelementptr double, double* %65, i64 %index124
  %250 = bitcast double* %249 to <2 x double>*
  %wide.load132 = load <2 x double>, <2 x double>* %250, align 8
  %251 = getelementptr double, double* %249, i64 2
  %252 = bitcast double* %251 to <2 x double>*
  %wide.load133 = load <2 x double>, <2 x double>* %252, align 8
  %253 = getelementptr double, double* %66, i64 %index124
  %254 = bitcast double* %253 to <2 x double>*
  %wide.load134 = load <2 x double>, <2 x double>* %254, align 8
  %255 = getelementptr double, double* %253, i64 2
  %256 = bitcast double* %255 to <2 x double>*
  %wide.load135 = load <2 x double>, <2 x double>* %256, align 8
  %257 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %246, <2 x double> %wide.load132, <2 x double> %wide.load134)
  %258 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %248, <2 x double> %wide.load133, <2 x double> %wide.load135)
  %259 = getelementptr double, double* %67, i64 %index124
  %260 = bitcast double* %259 to <2 x double>*
  store <2 x double> %257, <2 x double>* %260, align 8
  %261 = getelementptr double, double* %259, i64 2
  %262 = bitcast double* %261 to <2 x double>*
  store <2 x double> %258, <2 x double>* %262, align 8
  %index.next125 = add i64 %index124, 4
  %263 = icmp eq i64 %index.next125, %n.vec121
  br i1 %263, label %middle.block116, label %vector.body115

middle.block116:                                  ; preds = %vector.body115
  %cmp.n127 = icmp eq i64 %43, %n.vec121
  br i1 %cmp.n127, label %L139.loopexit.loopexit67, label %scalar.ph117

scalar.ph117:                                     ; preds = %middle.block116, %min.iters.checked119, %if28.us44.us.preheader
  %bc.resume.val126 = phi i64 [ %n.vec121, %middle.block116 ], [ 0, %if28.us44.us.preheader ], [ 0, %min.iters.checked119 ]
  br label %if28.us44.us

if28.us44.us:                                     ; preds = %scalar.ph117, %if28.us44.us
  %"i#668.043.us45.us" = phi i64 [ %271, %if28.us44.us ], [ %bc.resume.val126, %scalar.ph117 ]
  %264 = load double, double* %64, align 8
  %265 = getelementptr double, double* %65, i64 %"i#668.043.us45.us"
  %266 = load double, double* %265, align 8
  %267 = getelementptr double, double* %66, i64 %"i#668.043.us45.us"
  %268 = load double, double* %267, align 8
  %269 = call double @llvm.fmuladd.f64(double %264, double %266, double %268)
  %270 = getelementptr double, double* %67, i64 %"i#668.043.us45.us"
  store double %269, double* %270, align 8
  %271 = add nuw nsw i64 %"i#668.043.us45.us", 1
  %exitcond55 = icmp eq i64 %271, %43
  br i1 %exitcond55, label %L139.loopexit.loopexit67, label %if28.us44.us

if28.us44:                                        ; preds = %scalar.ph96, %if28.us44
  %"i#668.043.us45" = phi i64 [ %278, %if28.us44 ], [ %bc.resume.val105, %scalar.ph96 ]
  %272 = load double, double* %64, align 8
  %273 = getelementptr double, double* %65, i64 %"i#668.043.us45"
  %274 = load double, double* %273, align 8
  %275 = load double, double* %66, align 8
  %276 = call double @llvm.fmuladd.f64(double %272, double %274, double %275)
  %277 = getelementptr double, double* %67, i64 %"i#668.043.us45"
  store double %276, double* %277, align 8
  %278 = add nuw nsw i64 %"i#668.043.us45", 1
  %exitcond51 = icmp eq i64 %278, %43
  br i1 %exitcond51, label %L139.loopexit.loopexit68, label %if28.us44

if28.us46:                                        ; preds = %scalar.ph76, %if28.us46
  %"i#668.043.us47" = phi i64 [ %285, %if28.us46 ], [ %bc.resume.val85, %scalar.ph76 ]
  %279 = load double, double* %64, align 8
  %280 = load double, double* %65, align 8
  %281 = getelementptr double, double* %66, i64 %"i#668.043.us47"
  %282 = load double, double* %281, align 8
  %283 = call double @llvm.fmuladd.f64(double %279, double %280, double %282)
  %284 = getelementptr double, double* %67, i64 %"i#668.043.us47"
  store double %283, double* %284, align 8
  %285 = add nuw nsw i64 %"i#668.043.us47", 1
  %exitcond52 = icmp eq i64 %285, %43
  br i1 %exitcond52, label %L139.loopexit.loopexit69, label %if28.us46

if28:                                             ; preds = %scalar.ph, %if28
  %"i#668.043" = phi i64 [ %291, %if28 ], [ %bc.resume.val, %scalar.ph ]
  %286 = load double, double* %64, align 8
  %287 = load double, double* %65, align 8
  %288 = load double, double* %66, align 8
  %289 = call double @llvm.fmuladd.f64(double %286, double %287, double %288)
  %290 = getelementptr double, double* %67, i64 %"i#668.043"
  store double %289, double* %290, align 8
  %291 = add nuw nsw i64 %"i#668.043", 1
  %exitcond = icmp eq i64 %291, %43
  br i1 %exitcond, label %L139.loopexit.loopexit70, label %if28
}

@coveralls
Copy link

coveralls commented Jul 27, 2017

Coverage Status

Coverage increased (+2.6%) to 18.539% when pulling 1bfa3de on devmotion:feature/muladd into f760365 on JuliaDiffEq:master.

@codecov-io
Copy link

codecov-io commented Jul 27, 2017

Codecov Report

Merging #57 into master will increase coverage by 3.36%.
The diff coverage is 93.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   15.89%   19.25%   +3.36%     
==========================================
  Files          30       30              
  Lines        1076     1101      +25     
==========================================
+ Hits          171      212      +41     
+ Misses        905      889      -16
Impacted Files Coverage Δ
src/utils.jl 76.19% <93.02%> (+57.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f760365...8b05ba9. Read the comment docs.

@devmotion
Copy link
Member Author

I'm not completely sure yet about when exactly muladd should be applied as dot call. Currently any dotted operand or operator results in muladd. but this might not be the best approach. While correcting all occurences of @muladd in OrdinaryDiffEq.jl I discovered that often it is sufficient to just change the function head to @muladd function ..., remove all other instances of @muladd in the function body, and add dots to any multiplication and addition that might act on vectors. However, in some cases one has to deal with matrix multiplications such as A*x .+ y where A is a matrix and x, y are two vectors. Following the current rules @muladd automatically replaces this with muladd.(A, x, y) which is wrong, as a simple example shows:

julia> A = reshape(1:4, 2, :)
2×2 Base.ReshapedArray{Int64,2,UnitRange{Int64},Tuple{}}:
 1  3
 2  4

julia> x = 1:2
1:2

julia> y = 3:4
3:4

julia> A*x.+y
2-element Array{Int64,1}:
 10
 14

julia> muladd.(A, x, y)
2×2 Array{Int64,2}:
 4   6
 8  12

So I would suggest only applying muladd. if both operators (* and +) are dot calls. This would be a bit less "magical" and allow to apply @muladd also to complete functions which contain matrix multiplications. However, one then has to add dots to all operators that might work on vectors and in the case of matrix multiplication one has to use a dotted addition (otherwise @muladd converts it to muladd(A, x, y) which is even worse). But in my opinion, these requirements seem quite natural and one can just use @. in the first case. Another option would be to just splice in function calls such as matrix multiplication (see https://github.com/JuliaLang/julia/blob/master/base/broadcast.jl#L590-L593).

@devmotion
Copy link
Member Author

It seems that because of JuliaLang/julia#22255 it is currently not possible to immediately apply @muladd to all functions in OrdinaryDiffEq completely correctly. Of course, it works for all code written as for loops and small expressions but any longer expression that needs to be dotted (often the case in not in-place methods and interpolations) has to be converted to for loops with temporary arrays first, I guess. Otherwise the macro (with the changes in #57 (comment)) works very well, I could remove almost all occurrences of @muladd.

@coveralls
Copy link

Coverage Status

Coverage increased (+3.2%) to 19.074% when pulling ae26b60 on devmotion:feature/muladd into f760365 on JuliaDiffEq:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+3.2%) to 19.074% when pulling ae26b60 on devmotion:feature/muladd into f760365 on JuliaDiffEq:master.

@coveralls
Copy link

coveralls commented Jul 28, 2017

Coverage Status

Coverage increased (+3.4%) to 19.255% when pulling 79b93ea on devmotion:feature/muladd into f760365 on JuliaDiffEq:master.

@coveralls
Copy link

coveralls commented Jul 28, 2017

Coverage Status

Coverage increased (+3.4%) to 19.255% when pulling 8b05ba9 on devmotion:feature/muladd into f760365 on JuliaDiffEq:master.

@ChrisRackauckas
Copy link
Member

%289 = call double @llvm.fmuladd.f64(double %286, double %287, double %288)

That's catching all of the broadcast code, but in the end it's actually doing a muladd. So it looks like it's fine as long as it doesn't hit JuliaLang/julia#22255, right? If that's the case, then I think we're ready to merge.

@devmotion
Copy link
Member Author

Yes broadcasts work with @muladd and they can be used (with a small overhead) as long as they're not too many (JuliaLang/julia#22255). The new macro can replace the old one, but probably without any immediate benefit. E.g. in OrdinaryDiffEq, one has to use broadcasts in every evaluation possibly involving vectors (or, before JuliaLang/julia#22255 is fixed, transform these evaluations to for loops to avoid performance issues; but I still have to do benchmarks) and add missing dots to already existing dot calls (@muladd only applies dot calls of muladd to constructs like x.*y.+z), to gain any benefit. However, even then vectors of arrays do not benefit from these changes without recursive broadcast; maybe this can be handled by broadcasts with VectorOfArrays.

@ChrisRackauckas
Copy link
Member

but probably without any immediate benefit. E.g. in OrdinaryDiffEq, one has to use broadcasts in every evaluation possibly involving vectors (or, before JuliaLang/julia#22255 is fixed, transform these evaluations to for loops to avoid performance issues

The other option is to split them into separate statements:

@. tmp = a1*k1 + a2*k2 + a3*k3 + a4*k4
@. tmp += a5*k5 + a6*k6 + a7*k7 + a8*k8

But that is a lot of code churn for what will hopefully get fixed soon.

@ChrisRackauckas
Copy link
Member

Sounds good. Looks like we're all set.

@ChrisRackauckas ChrisRackauckas merged commit 8561242 into SciML:master Jul 30, 2017
@devmotion devmotion deleted the feature/muladd branch July 30, 2017 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants