-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer: optimize Buffer.byteLength #1713
buffer: optimize Buffer.byteLength #1713
Conversation
|
||
bench.start(); | ||
for (var i = 0; i < n; i++) { | ||
Buffer.byteLength(str, 'utf8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8
is passed explicitly as the encoding because there isn't a default encoding in byteLength
. utf8
is picked as the default Buffer encoding here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be modified so that str
is different for every iteration of the loop. Otherwise the JIT will most likely "overly optimize" execution because it's seeing the exact same input every time, which may not give reliable/accurate benchmark results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good tip - would a few strings (like 4) work? I think that if they were generated newly for each run of the loop it'd take up the majority of the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just cycling through each of the 4 strings you already have defined would probably work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, I'll do that + the mixed one, writing a patch
Pushed up a patch @mscdex, the benchmark numbers now sit comfortably at 150-330%. Also, should I default the JS encoding to |
New findings: |
|
||
bench.start(); | ||
for (var i = 0; i < n; i++) { | ||
Buffer.byteLength(chars[0], 'utf8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add guards to these? E.g. if (Buffer.byteLength(chars[0], 'utf8') !== 16) throw ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a mini-test in the beginning, outside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean inside the loop to avoid unrealistic optimizations, byteLength looks simplistic enough that v8 could determine it's a pure function if not now then in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll do some assert()
s. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this new commit I'm pushing up, I've changed the benchmark a little bit, and it now has 8 different possibilities for each run in the loop. I'm unsure whether asserting in-loop is still the best choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of the guards is not to be a test but to make the benchmark less susceptible to irrelevant optimizations like licm and dce. There is no reason to use assert and the checks must be done in the loop, otherwise they are ineffective.
1278515
to
cb04c10
Compare
Force pushed a commit:
I added a note in Also, pending @petkaantonov's feedback on ensuring the functionality works correctly in the benchmark. I think the function is now polymorphic in the benchmark or something, because the UTF8 benchmark numbers have dropped somewhat significantly. The base64 is looking great though:
|
|
||
bench.start(); | ||
for (var i = 0; i < n; i++) { | ||
Buffer.byteLength(chars[n % 4], encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the utf8 and base64 objects to arrays, you can replace the hard-coded constant with n % inputs.length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
cb04c10
to
f6299c4
Compare
Pushed another commit, removing the
Maybe someone has a tip for the UTF8 function? |
Previously, there were very few direct tests for Buffer.byteLength(). I realized I had introduced some breaking changes in nodejs#1713 that the tests didn't catch, so I've pretty much added almost every possible corner of the code here. It also takes any other byteLength tests from test-buffer and moves them.
Shoot, just realized the new UTF8 parser introduces some breaking changes not covered by the test suite. I'll PR / add a new commit to this PR with the new tests and figure out that darn function 😄 |
|
||
// Handle uppercased encodings | ||
if (encoding !== encoding.toLowerCase()) | ||
return byteLength(string, encoding.toLowerCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest you cache the value of encoding.toLowerCase()
here, to avoid calling it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I'm not sure if we have guidelines on whether or not recursion is allowed. I made Buffer#write() iterative instead of recursive to avoid a (probably academic) stack overflow exception when it's getting called when the stack is almost full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cached the output. Would you like me to switch to a loop, like you did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nothing else, it'd be consistent with what Buffer#write() does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds good, I'll make the change.
Buffer.byteLength is called whenever a new string Buffer is created. UTF8 is used as the default encoding, and base64 is also popular. These must be fast and take up a relatively significant part of Buffer instantiation. This commit moves the Buffer.byteLength calculations into only JS-land, moving it from C++ land for base64 and UTF8. It also removes the ByteLength function on the C++ Buffer. It also adds a benchmark for both encodings; the improvements hover for UTF8, change a lot, but base64 is about
f6299c4
to
7b5655c
Compare
case 'binary': | ||
// Deprecated | ||
case 'raw': | ||
case 'raws': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was supported by the C++ version, but isn't currently supported by Buffer.isEncoding
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think 'raw' and 'raws' have been deprecated since v0.2 or v0.3. I've never seen them being used in the wild either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can follow this PR up with a deprecation commit (to keep it semver-patch) - technically speaking only raws
was deprecated, not raw
, for this function:
> Buffer.byteLength('abc', 'raw')
3
> Buffer.byteLength('abc', 'raws')
'raws' encoding has been renamed to 'binary'. Please update your code.
3
OKAY! I think I got it this time. I've added a relatively large test suite (because it was pretty much not tested before), improved the base64 a little bit, and completely ditched the UTF8 length function. (it is very hard to calculate the length when v8 uses UTF16 on strings and says a 4 byte character is 2 3 byte chars, all the while beating C++). I've instead opted to optimize out some calculations in the C++ side to only give the
|
Local<String> s = args[0]->ToString(env->isolate()); | ||
enum encoding e = ParseEncoding(env->isolate(), args[1], UTF8); | ||
// Fast case: avoid StringBytes on UTF8 string. Jump to v8. | ||
Local<String> str = args[0]->ToString(env->isolate()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's guaranteed to be a string so args[0].As<String>();
should be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I drop the Environment::GetCurrent
above then, or must that be kept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the string check and env can be dropped, this is always called with a string from js.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave in a CHECK(args[0]->IsString())
to catch bugs.
|
||
uint32_t size = StringBytes::Size(env->isolate(), s, e); | ||
args.GetReturnValue().Set(size); | ||
args.GetReturnValue().Set(len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can reduce this function to just:
void ByteLengthUtf8(const FunctionCallbackInfo<Value>& args) {
CHECK(args[0]->IsString());
args.GetReturnValue().Set(args[0].As<String>()->Utf8Length());
}
EDIT: s/UTF8/Utf8/ for consistency with other functions/types operating on UTF-8.
Looks alright to me. Will the new |
@trevnorris I don't think so. I used the old algorithm as reference, and the new one doesn't include the extra bytes like the old one did. Otherwise, they're almost the same algorithm. |
if (typeof(string) !== 'string') | ||
string = String(string); | ||
if (typeof string !== 'string') | ||
string = '' + string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change in functionality? ByteLength()
would throw if a non string was passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, String(val)
also does coercion. We seem to prefer '' + val
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis do we care if node aborts if someone does:
process.binding('buffer').byteLengthUtf8(42);
Or do we want to continue throwing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Fooling around with process.binding() voids the warranty.
Had my last question confirmed by @bnoordhuis. LGTM. |
@trevnorris Thanks! The original CI is mostly out of date. Should I re-run it? |
@brendanashworth yes please. |
The failures on Windows / ARM / Centos 5 look unrelated, the CI is otherwise happy. |
As the PR stands, for the respective UTF8 and Base64 # this PR
utf8 x 2,208,160 ops/sec ±0.75% (93 runs sampled)
base64 x 1,881,927 ops/sec ±0.73% (93 runs sampled)
# v2.0.1
utf8 x 1,502,424 ops/sec ±1.10% (95 runs sampled)
base64 x 969,291 ops/sec ±0.75% (95 runs sampled)
# v0.10.38
utf8 x 659,515 ops/sec ±0.37% (99 runs sampled)
base64 x 408,662 ops/sec ±0.53% (97 runs sampled) |
assert.equal(Buffer.byteLength('aGVsbG8gd29ybGQ=', 'base64'), 11); | ||
assert.equal(Buffer.byteLength('bm9kZS5qcyByb2NrcyE=', 'base64'), 14); | ||
assert.equal(Buffer.byteLength('aGkk', 'base64'), 3); | ||
assert.equal(Buffer.byteLength('bHNrZGZsa3NqZmtsc2xrZmFqc2RsZmtqcw==', 'base64'), 25); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long line.
LGTM with a style nit. |
Thanks @bnoordhuis, I shall merge in the morning. |
Buffer.byteLength is important for speed because it is called whenever a new Buffer is created from a string. This commit optimizes Buffer.byteLength execution by: - moving base64 length calculation into JS-land, which is now much faster - remove redundant code and streamline the UTF8 length calculation It also adds a benchmark and better tests. PR-URL: #1713 Reviewed-By: Trevor Norris <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Thank you reviewers! Landed in 9da168b. (I had to push up a small commit to fix a new failure on HEAD). |
Buffer.byteLength is important for speed because it is called whenever a new Buffer is created from a string. This commit optimizes Buffer.byteLength execution by: - moving base64 length calculation into JS-land, which is now much faster - remove redundant code and streamline the UTF8 length calculation It also adds a benchmark and better tests. PR-URL: nodejs/node#1713 Reviewed-By: Trevor Norris <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Buffer.byteLength
is called whenever a new stringBuffer
is created.UTF8 is used as the default encoding, and base64 is also popular. These
must be fast and take up a relatively significant part of
Buffer
instantiation.
This commit moves the
Buffer.byteLength
calculations into only JS-land,moving it from C++ land for base64 and UTF8.
It also adds a benchmark for both encodings; the improvements hover
around 40-60% for UTF8 strings and 170% for base64.