Use Ryu algorithm for floating point to string conversation #8441

watzon · 2019-11-05T17:52:38Z

It was just brought to my attention by a friend that Crystal uses the Grisu3 algorithm for converting floating point numbers to strings. Apparently switching to the Ryu algorithm could offer a boost in performance. CC: @Zenohate

vlazar · 2019-11-05T18:20:13Z

Numbers look great
https://github.com/ulfjack/ryu#ryu-printf

Has comparison with grisu2 (no grisu3 though) here among others

So, my preliminary findings are:

erthink is faster when value could be represented by 12 or less digits (I checked this by changing range of this loop);

ryu is faster in the all other cases;

ulfjack/ryu#28 (comment)

Archivist062 · 2019-11-05T22:39:06Z

I will tackle an implementation based of the C one, It will take time tho and may bear no fruits. I am currently fighting the new overflow safe Int a lot doing all of that.

I am copying the API of the Grisu3 class for now so that it can be easy to replace as well as writing the code in a separate library which it will be possible to merge later.

Best regards

watzon · 2019-11-05T23:33:36Z

Excited to see what you can come up with @Zenohate

vlazar · 2019-11-06T07:18:08Z

Oh my... what a rabbit hole. Why did I've started following links for ryu on GitHub 😆

Anyways... There is a discussion about implementing ryu in Go. Lots of interesting stuff.

This comment in particular might help correctness checking golang/go#15672 (comment)

vlazar · 2019-11-06T07:42:48Z

Also as mentioned in the paper

The main disadvantage of Ry ̄u is its reliance on lookuptables. These are quite reasonable for the typical 32-bit and64-bit floating point types but grow exponentially for largertypes. It is up to future work to determine whether it ispossible to reduce their size or even avoid them entirely.

jhass · 2019-11-06T08:56:08Z

I guess we could always keep Grisu3 around for bigger than 64bit floats then.

asterite · 2019-11-06T11:04:09Z

There's also no good way to have static lookup tables in Crystal like they do in C, but once there's a Ryu implementation we'll see.

vlazar · 2019-11-06T12:31:42Z

@asterite There is also a Java version https://github.com/ulfjack/ryu/tree/master/src/main/java/info/adams/ryu but I guess you mean Crystal needs static lookup tables closer to C for such cases? This can allow even more optimizations?

asterite · 2019-11-06T13:27:09Z

I guess C will dump those static lookup tables to the data segment of the program. In Java it seems they are initialized at runtime. In Crystal it'll probably be have to be the same. Maybe it'll still be fast enough.

vlazar · 2019-11-06T14:46:13Z

Maybe Crystal can use something similar to Ruby's __END__ + DATA magic for such data?

rdp · 2019-11-29T04:51:23Z

The readme does mention Grisu3 in some comparisons, FWIW (stated to be significantly faster).

HertzDevil · 2021-07-08T19:46:11Z

There is also Dragonbox now.

HertzDevil · 2022-09-09T19:53:44Z

Dragonbox implements only the shortest round-trip conversion; while this is good enough for Float#to_s, it is unsuitable for sprintf %f / %e / %g because double rounding would occur. Instead we must still implement Ryu Printf if we want sprintf to be platform-independent by not relying on LibC.snprintf. (Ryu is the shortest representation algorithm, Ryu Printf is the fixed precision one.)

jk-jeon · 2022-09-11T20:30:50Z

@HertzDevil Hello, thank you for porting Dragonbox, I'm the author of the algorithm. This is just some information just in case you're interested.

The biggest problem of Ryu printf is its enormous size (102kb iirc) of static table. I actually have an implementation of (almost) same algorithm which only requires 39kb here. It's not extensively tested like Dragonbox though. Also it is probably a little bit slower than the original implementation for double but the difference is not big as you can see in the benchmark graph shown in the README.

In fact, I'm working on an alternative algorithm (which works similarly to Ryu printf but using a different formula for the core computation), which I expect to be faster than Ryu printf but at the same time only requires 4-5kb (or even half or quarter of that if you can afford sacrificing more performance). At this point I have an almost-working implementation, but I sort of stopped finishing it due to other things in my life having higher priority. Maybe in early next year, I'll publish the work and advertise it through reddit.

HertzDevil · 2022-09-12T18:33:53Z

We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now. But I will definitely keep an eye out for that Ryu Printf replacement.

jk-jeon · 2023-01-24T17:35:30Z

I made a working implementation of the said algorithm (here is the repo, and here is the explanation) in last December but forgot to mention it here. Currently I consider the implementation is largely incomplete, though (I believe) it works fine for all double inputs. I think it's not ready for serious adaptions yet for various reasons, but wanted to say that a better algorithm definitely exists.

Roughly speaking, for the first few digits (up to 17~19) it performs the usual 128-bit x 64-bit multiplication that Dragonbox/Ryu/etc. do by re-using the same table, and for further digits it falls-back into a slower mechanism relying on an additional table.

For this slower fallback mechanism, there are several tunable parameters that determine the trade-off between the size of the additional static data and the required number of multiplications per a digit. The most significant parameter is something I call the segment length, which roughly speaking is the number of digits that are generated "at once". I tried to set this segment length to be 22 and 252, each resulting in the data size of 3680 bytes and 580 bytes, respectively. For the first case, it performs several 192-bit x 64-bit multiplications to obtain 22 digits, while for the second case it performs 960-bit x 64-bit multiplications instead. It seems the performance of the first case is more or less equivalent to Ryu-printf; it wasn't a lot faster than Ryu-printf, unfortunately. You can refer to the benchmark graph I included in the repo.

Currently I'm thinking that it will probably perform better if I extend the Dragonbox table a little bit (which allows some simplification of the first phase of printing first few digits), and also set the segment length to be 18 instead of 22 at the expense of a little bit larger table size, but I haven't done any experiment.

HertzDevil · 2023-11-07T16:05:16Z

We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now

To get a sense of how large 102 KB is, I logged the effective sizes of those lookup tables: (needs crystal-lang/perf-tools#13)

Source code:

require "perf_tools/mem_prof"
require "html"

class Array(T)
  # TODO: some constants do not get registered in `PerfTools::MemProf`
  def fallback_reachable_size
    {% if T < Value %}
      instance_sizeof(self) + sizeof(T) * @capacity
    {% else %}
      0
    {% end %}
  end
end

def log_size(obj, name)
  reachable = PerfTools::MemProf.object_size(obj)
  if reachable == 0 && obj.responds_to?(:fallback_reachable_size)
    reachable = obj.fallback_reachable_size
  end

  puts "#{name} : #{obj.class}"
  puts "  #{obj.size} elements"
  puts "  #{reachable} bytes reachable"
end

module Unicode
  def self.log_sizes
    {% for cvar in @type.class_vars %}
      ::log_size({{cvar}}, "Unicode.{{cvar}}")
    {% end %}
  end

  log_sizes
end

struct String::Grapheme
  log_size(codepoints, "String::Grapheme.codepoints")
end

log_size(HTML::SINGLE_CHAR_ENTITIES, "HTML::SINGLE_CHAR_ENTITIES")
log_size(HTML::DOUBLE_CHAR_ENTITIES, "HTML::DOUBLE_CHAR_ENTITIES")

module Float::Printer::Dragonbox
  log_size(ImplInfo_Float32::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float32::CACHE")
  log_size(ImplInfo_Float64::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float64::CACHE")
end

module Float::Printer::RyuPrintf
  log_size(POW10_SPLIT, "Float::Printer::RyuPrintf::POW10_SPLIT")
  log_size(POW10_SPLIT_2, "Float::Printer::RyuPrintf::POW10_SPLIT_2")
end

Output:

Unicode.upcase_ranges : Array(Tuple(Int32, Int32, Int32))
  141 elements
  1716 bytes reachable
Unicode.downcase_ranges : Array(Tuple(Int32, Int32, Int32))
  125 elements
  1524 bytes reachable
Unicode.alternate_ranges : Array(Tuple(Int32, Int32))
  60 elements
  504 bytes reachable
Unicode.category_Lu : Array(Tuple(Int32, Int32, Int32))
  149 elements
  1812 bytes reachable
Unicode.category_Ll : Array(Tuple(Int32, Int32, Int32))
  163 elements
  1980 bytes reachable
Unicode.category_Lt : Array(Tuple(Int32, Int32, Int32))
  7 elements
  108 bytes reachable
Unicode.category_Lm : Array(Tuple(Int32, Int32, Int32))
  54 elements
  672 bytes reachable
Unicode.category_Lo : Array(Tuple(Int32, Int32, Int32))
  486 elements
  5856 bytes reachable
Unicode.category_Mn : Array(Tuple(Int32, Int32, Int32))
  308 elements
  3720 bytes reachable
Unicode.category_Mc : Array(Tuple(Int32, Int32, Int32))
  158 elements
  1920 bytes reachable
Unicode.category_Me : Array(Tuple(Int32, Int32, Int32))
  5 elements
  84 bytes reachable
Unicode.category_Nd : Array(Tuple(Int32, Int32, Int32))
  64 elements
  792 bytes reachable
Unicode.category_Nl : Array(Tuple(Int32, Int32, Int32))
  11 elements
  156 bytes reachable
Unicode.category_No : Array(Tuple(Int32, Int32, Int32))
  71 elements
  876 bytes reachable
Unicode.category_Zs : Array(Tuple(Int32, Int32, Int32))
  5 elements
  84 bytes reachable
Unicode.category_Zl : Array(Tuple(Int32, Int32, Int32))
  1 elements
  36 bytes reachable
Unicode.category_Zp : Array(Tuple(Int32, Int32, Int32))
  1 elements
  36 bytes reachable
Unicode.category_Cc : Array(Tuple(Int32, Int32, Int32))
  2 elements
  48 bytes reachable
Unicode.category_Cf : Array(Tuple(Int32, Int32, Int32))
  18 elements
  240 bytes reachable
Unicode.category_Cs : Array(Tuple(Int32, Int32, Int32))
  3 elements
  60 bytes reachable
Unicode.category_Co : Array(Tuple(Int32, Int32, Int32))
  3 elements
  60 bytes reachable
Unicode.category_Cn : Array(Tuple(Int32, Int32, Int32))
  0 elements
  24 bytes reachable
Unicode.casefold_ranges : Array(Tuple(Int32, Int32, Int32))
  681 elements
  8196 bytes reachable
Unicode.special_cases_downcase : Hash(Int32, Tuple(Int32, Int32, Int32))
  1 elements
  216 bytes reachable
Unicode.special_cases_upcase : Hash(Int32, Tuple(Int32, Int32, Int32))
  102 elements
  2872 bytes reachable
Unicode.special_cases_titlecase : Hash(Int32, Tuple(Int32, Int32, Int32))
  81 elements
  2872 bytes reachable
Unicode.fold_cases : Hash(Int32, Tuple(Int32, Int32, Int32))
  104 elements
  2872 bytes reachable
Unicode.canonical_combining_classes : Array(Tuple(Int32, Int32, UInt8))
  392 elements
  4728 bytes reachable
Unicode.canonical_decompositions : Hash(Int32, Tuple(Int32, Int32))
  2061 elements
  81976 bytes reachable
Unicode.compatibility_decomposition_data : Array(Int32)
  3230 elements
  12944 bytes reachable
Unicode.compatibility_decompositions : Hash(Int32, Tuple(Int32, Int32))
  3796 elements
  81976 bytes reachable
Unicode.canonical_compositions : Hash(Int64, Int32)
  941 elements
  28728 bytes reachable
Unicode.nfc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
  117 elements
  1428 bytes reachable
Unicode.nfkc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
  436 elements
  5256 bytes reachable
Unicode.nfd_quick_check : Array(Tuple(Int32, Int32))
  243 elements
  1968 bytes reachable
Unicode.nfkd_quick_check : Array(Tuple(Int32, Int32))
  548 elements
  4408 bytes reachable
String::Grapheme.codepoints : Array(Tuple(Int32, Int32, String::Grapheme::Property))
  1447 elements
  17388 bytes reachable
HTML::SINGLE_CHAR_ENTITIES : Hash(Slice(UInt8), Char)
  2032 elements
  73784 bytes reachable
HTML::DOUBLE_CHAR_ENTITIES : Hash(Slice(UInt8), String)
  93 elements
  4408 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float32::CACHE : Array(UInt64)
  78 elements
  648 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float64::CACHE : Array(Float::Printer::Dragonbox::WUInt::UInt128)
  619 elements
  9928 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT : Array(Tuple(UInt64, UInt64, UInt64))
  1224 elements
  29400 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT_2 : Array(Tuple(UInt64, UInt64, UInt64))
  3133 elements
  75216 bytes reachable

In short, the Ryu-printf tables are indeed comparable to the tables used for Unicode normalization, or for HTML entity (un)escaping. However, the code itself used to populate those tables is much larger than the tables' contents. Here I compile the following code, and then compare the resulting binary's size against a blank source file, on my Apple M2: (at the moment the new tables won't be codegen'ed at all unless RyuPrintf's methods are called explicitly, because stdlib doesn't incorporate them yet)

Float::Printer::RyuPrintf.d2fixed(1.23, 5)
Float::Printer::RyuPrintf.d2exp(1.23, 5)

Non-release: 1231112 B -> 1630104 B (+398992 B, +32.4% size)
Release: 298360 B -> 514024 B (+215664 B, +72.3% size)

On the other hand, if the new tables are backed by Slice.literal instead, then the release mode binary would unconditionally grow to roughly 400 KB, and the effect on non-release builds would drop below 10%.

straight-shoota · 2023-11-07T16:30:36Z

There's a functional difference though in the purposes:
Unicode normalization and HTML entity tables are unavoidable. If you want to normalize Unicode or (un)escape HTML entities, you explicitly need those tables. There's no way around that. Maybe there's some room for compacting their representation to save some bytes, but that's only a minor effect.

With the lookup tables for the Ryu algorithm, that's different: They're not strictly necessary for the general task, just for this specific implementation. If you want to print floating point numbers, there are other algorithms which don't need those tables.
Their motivation is an optional performance trade-off to improve execution time over memory size.

For most general programs that's a good deal because size usually doesn't matter much. So it's probably fine as a default.

However, there might be use cases where you don't want that compromise. Maybe we could allow choosing the algorithm at compile time? I suppose that's always possible with monkey-patching into stdlib. But perhaps a more stable mechanism could be helpful. Alternative implementations don't necessarily need to be distributed in stdlib, though.

HertzDevil mentioned this issue Jul 9, 2021

Implement the Dragonbox algorithm for Float#to_s #10913

Merged

straight-shoota added kind:feature topic:stdlib:numeric labels Jul 9, 2021

straight-shoota closed this as completed in #10913 Aug 31, 2022

HertzDevil reopened this Sep 9, 2022

HertzDevil mentioned this issue Sep 10, 2022

Implement the Ryu Printf algorithm #12473

Closed

straight-shoota mentioned this issue Mar 7, 2023

Re-export LibC.printf and snprintf for Windows interpreter #13143

Closed

HertzDevil mentioned this issue Dec 7, 2023

Implement sprintf "%f" in Crystal using Ryu Printf #14067

Merged

HertzDevil mentioned this issue Dec 22, 2023

Simplify String::Formatter when Ryu Printf is available #14132

Merged

straight-shoota closed this as completed in #14132 Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Ryu algorithm for floating point to string conversation #8441

Use Ryu algorithm for floating point to string conversation #8441

watzon commented Nov 5, 2019

vlazar commented Nov 5, 2019

Archivist062 commented Nov 5, 2019

watzon commented Nov 5, 2019

vlazar commented Nov 6, 2019

vlazar commented Nov 6, 2019

jhass commented Nov 6, 2019

asterite commented Nov 6, 2019

vlazar commented Nov 6, 2019

asterite commented Nov 6, 2019

vlazar commented Nov 6, 2019

rdp commented Nov 29, 2019

HertzDevil commented Jul 8, 2021

HertzDevil commented Sep 9, 2022 •

edited

Loading

jk-jeon commented Sep 11, 2022

HertzDevil commented Sep 12, 2022

jk-jeon commented Jan 24, 2023 •

edited

Loading

HertzDevil commented Nov 7, 2023 •

edited

Loading

straight-shoota commented Nov 7, 2023

Use Ryu algorithm for floating point to string conversation #8441

Use Ryu algorithm for floating point to string conversation #8441

Comments

watzon commented Nov 5, 2019

vlazar commented Nov 5, 2019

Archivist062 commented Nov 5, 2019

watzon commented Nov 5, 2019

vlazar commented Nov 6, 2019

vlazar commented Nov 6, 2019

jhass commented Nov 6, 2019

asterite commented Nov 6, 2019

vlazar commented Nov 6, 2019

asterite commented Nov 6, 2019

vlazar commented Nov 6, 2019

rdp commented Nov 29, 2019

HertzDevil commented Jul 8, 2021

HertzDevil commented Sep 9, 2022 • edited Loading

jk-jeon commented Sep 11, 2022

HertzDevil commented Sep 12, 2022

jk-jeon commented Jan 24, 2023 • edited Loading

HertzDevil commented Nov 7, 2023 • edited Loading

straight-shoota commented Nov 7, 2023

HertzDevil commented Sep 9, 2022 •

edited

Loading

jk-jeon commented Jan 24, 2023 •

edited

Loading

HertzDevil commented Nov 7, 2023 •

edited

Loading