-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some ideas / notes #47
Comments
(I am just a downstream observer, not a participant.) I recognize those kinds of ARM optimizations, and they sound good. This is great. However I’d also like to encourage you to optimize for code size as well. And I can help if you like. |
Hi @ledvinap , thanks a lot for your detailed investigation and the according results. Amazing!! |
Hi @ledvinap, I am presuming this means that |
I have another proposition of optimization. const char LUT[] = "0123456789ABCDEF";
const char lut[] = "0123456789abcdef";
// internal itoa for 'long' type
static size_t _ntoa_long(out_fct_type out, char* buffer, size_t idx, size_t maxlen, unsigned long value, bool negative, unsigned long base, unsigned int prec, unsigned int width, unsigned int flags)
{
char buf[PRINTF_NTOA_BUFFER_SIZE];
size_t len = 0U;
// no hash for 0 values
if (!value)
{
flags &= ~FLAGS_HASH;
}
// write if precision != 0 and value is != 0
if (!(flags & FLAGS_PRECISION) || value)
{
if(base == 10)
{
do
{
const char digit = (char) (value % base);
buf[len++] = LUT[digit];
value /= base;
} while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
}
else
{
char* lutUsed = (flags & FLAGS_UPPERCASE ? LUT : lut);
unsigned long baseBits = (base == 16 ? 4 : (base == 2 ? 2 : 3));
do
{
const char digit = (char) (value & (base - 1));
buf[len++] = lutUsed[digit];
value >>= baseBits;
} while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
}
}
return _ntoa_format(out, buffer, idx, maxlen, buf, len, negative, (unsigned int)base, prec, width, flags);
} |
That would actually be a good idea for the the base-10 branch also. I would perhaps explicitly write But gcc and clang will use multiplication and shifting instead of division whenever the divisor is a known integer (example here, the latter calls |
For my part, I don't believe an embedded project should be using printf
enough for it to be a CPU bottleneck. In the unlikely event a human
readable screen is attached to the device, there will be a limited number
of strings that are actually presented to the user and their processing
will also be quite limited.
However every byte of RAM you add trying to optimize for a special case is
taken from every implementation that uses your code, whether or not the
optimization is relevant to that implementation. Which is my argument for
simple code that is reasonably efficient.
…On Sat, Oct 5, 2019 at 4:22 AM Vedran Grudenic ***@***.***> wrote:
That would actually be a good idea for the the base-10 branch also. I
would perhaps explicitly write value % 10 instead of value % base,
although I believe most optimizing compilers will understand that base is
10 at that point.
But gcc and clang will use multiplication and shifting instead of division
whenever the divisor is a known integer (example here
<https://godbolt.org/z/_Bl8Ia>, the latter calls __aeabi_idivmod).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#47?email_source=notifications&email_token=AA2EBXW7XVIBDF3EAIIZF2LQNB2GFA5CNFSM4HGJU642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANQGJQ#issuecomment-538641190>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2EBXSYK4NI3JJKAVL3KG3QNB2GFANCNFSM4HGJU64Q>
.
|
@ledvinap : Hello Petr, I think these different suggestions should be split into separate issues; and for each issue, a PR. I realize this repository hasn't seen much traction, but still. |
There are some possible optimizations (at least for ARM CPU):
Only 4 arguments are passed in registers, remaining arguments has to be pushed on stack and stack frame may be created
out
,buffer
,maxlen
are passed around a lot. This can be replaced by passingIn code, call
out->fn(out, c, idx);
.fn
is only single memory fetch and first argument is single register move)idx
can be hidden in buffer object too (incrementbuffer
in object, keep maxlen as pointer to end of buffer). This will avoid allidx
handling in printf code, which must make no difference (idx
must be incremented after each out call or_out_fct
won't work)BTW:
inline
makes no sense for_out_
* functions, address for function is taken, so no inlining is possible unless_vsnprintf
is inlined tooflags
,base
,prec
,width
, possiblynegative
(in flags?)may be packed into structure and passed by pointer. This will only need some care when this structure is modified.There is subtle bug in
_out_char
-printf("X%cX", '\0');
will emit only 2 characters instead of three. Special-casing termination (possibly even as extra function pointer inout_object
will handle this, it will improve code efficiency a bit (1 skipped test for each character) and as a bonus it will make it possible to implement buffering output function (flush on chunk size or end of format)Another nice optimization would be to replace character output with
write()
-like function. This will considerably limit number of through-pointer calls and save some instructions (test buffer size only once per out call, then memcpy [possibly optimizing copy to use word access])All that is necessary is to emit
*format
only when%
is encountered and implement in-place reversing inout_rev
and possibly useconst char pad0[8]="00000000"; const char padSpc[8]=" ";
for optimized padding.I can (and probably will) make most of the changes, are you interested in merging them?
The text was updated successfully, but these errors were encountered: