Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some ideas / notes #47

Open
ledvinap opened this issue Apr 16, 2019 · 8 comments
Open

Some ideas / notes #47

ledvinap opened this issue Apr 16, 2019 · 8 comments

Comments

@ledvinap
Copy link

There are some possible optimizations (at least for ARM CPU):

Only 4 arguments are passed in registers, remaining arguments has to be pushed on stack and stack frame may be created


out,buffer, maxlen are passed around a lot. This can be replaced by passing

struct out_object {
   void  (*fn)(out_object * out, char character, size_t idx);
}

struct out_object_buffer {
  struct out_object_base base;
  char* buffer;
}

static void out_buffer(out_object * out, char character, size_t idx) 
   struct out_object_buffer *self = (struct out_object_buffer*)out;
   self->buffer[idx] = character;
}

struct out_object_buffer_n {
   struct out_object_base base;
   char* buffer;
   size_t maxlen;
}

static void out_buffer_n(out_object * out, char character, size_t idx) 
   struct out_object_buffer_n *self = (struct out_object_buffer_n*)out;
   if(idx < self->maxlen)
      self->buffer[idx] = character;
}

In code, call out->fn(out, c, idx);. fn is only single memory fetch and first argument is single register move)

idx can be hidden in buffer object too (increment buffer in object, keep maxlen as pointer to end of buffer). This will avoid all idx handling in printf code, which must make no difference (idx must be incremented after each out call or _out_fct won't work)

BTW: inline makes no sense for _out_* functions, address for function is taken, so no inlining is possible unless _vsnprintf is inlined too


flags, base, prec, width, possibly negative (in flags?)may be packed into structure and passed by pointer. This will only need some care when this structure is modified.


There is subtle bug in _out_char - printf("X%cX", '\0'); will emit only 2 characters instead of three. Special-casing termination (possibly even as extra function pointer in out_object will handle this, it will improve code efficiency a bit (1 skipped test for each character) and as a bonus it will make it possible to implement buffering output function (flush on chunk size or end of format)


Another nice optimization would be to replace character output with write()-like function. This will considerably limit number of through-pointer calls and save some instructions (test buffer size only once per out call, then memcpy [possibly optimizing copy to use word access])
All that is necessary is to emit *format only when % is encountered and implement in-place reversing in out_rev and possibly use const char pad0[8]="00000000"; const char padSpc[8]=" "; for optimized padding.



I can (and probably will) make most of the changes, are you interested in merging them?

@ghost
Copy link

ghost commented Apr 16, 2019

(I am just a downstream observer, not a participant.)

I recognize those kinds of ARM optimizations, and they sound good. This is great.

However I’d also like to encourage you to optimize for code size as well. And I can help if you like.

@mpaland
Copy link
Owner

mpaland commented Apr 20, 2019

Hi @ledvinap , thanks a lot for your detailed investigation and the according results. Amazing!!
I´m on Easter holiday in the moment and have a look into all issues when returning home.
Happy Easter!

@vgrudenic
Copy link
Contributor

Hi @ledvinap, I am presuming this means that out_fct_type would be removed and only out_object.fn would remain? This makes sense because fctprintf is the most general overload and all other variants can pass through the same struct, so there should be no need to have two separate function pointers.

@legier
Copy link

legier commented Oct 5, 2019

I have another proposition of optimization.
In _ntoa_long and _ntoa_long_long you can make different processing path based on base of number. If its base is power of 2 you can replace division and modulo operation with simple bitwise right shift and and operation. It will improve performance a lot on most microcontrolers and even more on those without hardware division support.
Another optimization (in those functions) is to use look up table while obtaining if output should be a digit or a letter. It will reduce the if statement and all folowing calculations with single memory access.
I propose something like this:

const char LUT[] = "0123456789ABCDEF";
const char lut[] = "0123456789abcdef";
// internal itoa for 'long' type
static size_t _ntoa_long(out_fct_type out, char* buffer, size_t idx, size_t maxlen, unsigned long value, bool negative, unsigned long base, unsigned int prec, unsigned int width, unsigned int flags)
{
  char buf[PRINTF_NTOA_BUFFER_SIZE];
  size_t len = 0U;

  // no hash for 0 values
  if (!value) 
  {
    flags &= ~FLAGS_HASH;
  }

  // write if precision != 0 and value is != 0
  if (!(flags & FLAGS_PRECISION) || value) 
  {
    if(base == 10)
    {
      do
      {
        const char digit = (char) (value % base);
        buf[len++] = LUT[digit];
        value /= base;
      } while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
    }
    else
    {
      char* lutUsed = (flags & FLAGS_UPPERCASE ? LUT : lut);
      unsigned long baseBits = (base == 16 ? 4 : (base == 2 ? 2 : 3));
      do
      {
        const char digit = (char) (value & (base - 1));
        buf[len++] = lutUsed[digit];
        value >>= baseBits;
      } while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
    }
  }

  return _ntoa_format(out, buffer, idx, maxlen, buf, len, negative, (unsigned int)base, prec, width, flags);
}

@vgrudenic
Copy link
Contributor

That would actually be a good idea for the the base-10 branch also. I would perhaps explicitly write value % 10 instead of value % base, although I believe most optimizing compilers will understand that base is 10 at that point.

But gcc and clang will use multiplication and shifting instead of division whenever the divisor is a known integer (example here, the latter calls __aeabi_idivmod).

@ghost
Copy link

ghost commented Oct 5, 2019 via email

@eyalroz
Copy link

eyalroz commented Jun 30, 2021

@ledvinap : Hello Petr,

I think these different suggestions should be split into separate issues; and for each issue, a PR. I realize this repository hasn't seen much traction, but still.

@eyalroz
Copy link

eyalroz commented Aug 4, 2021

@legier : See also issue #116 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants