-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Ryu algorithm for floating point to string conversation #8441
Comments
Numbers look great Has comparison with grisu2 (no grisu3 though) here among others
|
I will tackle an implementation based of the C one, It will take time tho and may bear no fruits. I am currently fighting the new overflow safe Int a lot doing all of that. I am copying the API of the Grisu3 class for now so that it can be easy to replace as well as writing the code in a separate library which it will be possible to merge later. Best regards |
Excited to see what you can come up with @Zenohate |
Oh my... what a rabbit hole. Why did I've started following links for ryu on GitHub 😆 Anyways... There is a discussion about implementing ryu in Go. Lots of interesting stuff. This comment in particular might help correctness checking golang/go#15672 (comment) |
Also as mentioned in the paper
|
I guess we could always keep Grisu3 around for bigger than 64bit floats then. |
There's also no good way to have static lookup tables in Crystal like they do in C, but once there's a Ryu implementation we'll see. |
@asterite There is also a Java version https://github.com/ulfjack/ryu/tree/master/src/main/java/info/adams/ryu but I guess you mean Crystal needs static lookup tables closer to C for such cases? This can allow even more optimizations? |
I guess C will dump those static lookup tables to the data segment of the program. In Java it seems they are initialized at runtime. In Crystal it'll probably be have to be the same. Maybe it'll still be fast enough. |
Maybe Crystal can use something similar to Ruby's |
The readme does mention Grisu3 in some comparisons, FWIW (stated to be significantly faster). |
There is also Dragonbox now. |
Dragonbox implements only the shortest round-trip conversion; while this is good enough for |
@HertzDevil Hello, thank you for porting Dragonbox, I'm the author of the algorithm. This is just some information just in case you're interested. The biggest problem of Ryu printf is its enormous size (102kb iirc) of static table. I actually have an implementation of (almost) same algorithm which only requires 39kb here. It's not extensively tested like Dragonbox though. Also it is probably a little bit slower than the original implementation for In fact, I'm working on an alternative algorithm (which works similarly to Ryu printf but using a different formula for the core computation), which I expect to be faster than Ryu printf but at the same time only requires 4-5kb (or even half or quarter of that if you can afford sacrificing more performance). At this point I have an almost-working implementation, but I sort of stopped finishing it due to other things in my life having higher priority. Maybe in early next year, I'll publish the work and advertise it through reddit. |
We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now. But I will definitely keep an eye out for that Ryu Printf replacement. |
I made a working implementation of the said algorithm (here is the repo, and here is the explanation) in last December but forgot to mention it here. Currently I consider the implementation is largely incomplete, though (I believe) it works fine for all Roughly speaking, for the first few digits (up to 17~19) it performs the usual 128-bit x 64-bit multiplication that Dragonbox/Ryu/etc. do by re-using the same table, and for further digits it falls-back into a slower mechanism relying on an additional table. For this slower fallback mechanism, there are several tunable parameters that determine the trade-off between the size of the additional static data and the required number of multiplications per a digit. The most significant parameter is something I call the segment length, which roughly speaking is the number of digits that are generated "at once". I tried to set this segment length to be 22 and 252, each resulting in the data size of 3680 bytes and 580 bytes, respectively. For the first case, it performs several 192-bit x 64-bit multiplications to obtain 22 digits, while for the second case it performs 960-bit x 64-bit multiplications instead. It seems the performance of the first case is more or less equivalent to Ryu-printf; it wasn't a lot faster than Ryu-printf, unfortunately. You can refer to the benchmark graph I included in the repo. Currently I'm thinking that it will probably perform better if I extend the Dragonbox table a little bit (which allows some simplification of the first phase of printing first few digits), and also set the segment length to be 18 instead of 22 at the expense of a little bit larger table size, but I haven't done any experiment. |
To get a sense of how large 102 KB is, I logged the effective sizes of those lookup tables: (needs crystal-lang/perf-tools#13) Source code:require "perf_tools/mem_prof"
require "html"
class Array(T)
# TODO: some constants do not get registered in `PerfTools::MemProf`
def fallback_reachable_size
{% if T < Value %}
instance_sizeof(self) + sizeof(T) * @capacity
{% else %}
0
{% end %}
end
end
def log_size(obj, name)
reachable = PerfTools::MemProf.object_size(obj)
if reachable == 0 && obj.responds_to?(:fallback_reachable_size)
reachable = obj.fallback_reachable_size
end
puts "#{name} : #{obj.class}"
puts " #{obj.size} elements"
puts " #{reachable} bytes reachable"
end
module Unicode
def self.log_sizes
{% for cvar in @type.class_vars %}
::log_size({{cvar}}, "Unicode.{{cvar}}")
{% end %}
end
log_sizes
end
struct String::Grapheme
log_size(codepoints, "String::Grapheme.codepoints")
end
log_size(HTML::SINGLE_CHAR_ENTITIES, "HTML::SINGLE_CHAR_ENTITIES")
log_size(HTML::DOUBLE_CHAR_ENTITIES, "HTML::DOUBLE_CHAR_ENTITIES")
module Float::Printer::Dragonbox
log_size(ImplInfo_Float32::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float32::CACHE")
log_size(ImplInfo_Float64::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float64::CACHE")
end
module Float::Printer::RyuPrintf
log_size(POW10_SPLIT, "Float::Printer::RyuPrintf::POW10_SPLIT")
log_size(POW10_SPLIT_2, "Float::Printer::RyuPrintf::POW10_SPLIT_2")
end Output:
In short, the Ryu-printf tables are indeed comparable to the tables used for Unicode normalization, or for HTML entity (un)escaping. However, the code itself used to populate those tables is much larger than the tables' contents. Here I compile the following code, and then compare the resulting binary's size against a blank source file, on my Apple M2: (at the moment the new tables won't be codegen'ed at all unless Float::Printer::RyuPrintf.d2fixed(1.23, 5)
Float::Printer::RyuPrintf.d2exp(1.23, 5)
On the other hand, if the new tables are backed by |
There's a functional difference though in the purposes: With the lookup tables for the Ryu algorithm, that's different: They're not strictly necessary for the general task, just for this specific implementation. If you want to print floating point numbers, there are other algorithms which don't need those tables. For most general programs that's a good deal because size usually doesn't matter much. So it's probably fine as a default. However, there might be use cases where you don't want that compromise. Maybe we could allow choosing the algorithm at compile time? I suppose that's always possible with monkey-patching into stdlib. But perhaps a more stable mechanism could be helpful. Alternative implementations don't necessarily need to be distributed in stdlib, though. |
It was just brought to my attention by a friend that Crystal uses the Grisu3 algorithm for converting floating point numbers to strings. Apparently switching to the Ryu algorithm could offer a boost in performance. CC: @Zenohate
The text was updated successfully, but these errors were encountered: