Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let geoms pass parameters to scales. ScaleContinuous methods to map values to colours #5031

Closed
wants to merge 4 commits into from

Conversation

zeehio
Copy link
Contributor

@zeehio zeehio commented Nov 6, 2022

This is the base of one of the pillars of:

Currently:

  • geoms map data frame columns to aesthetics.
  • scales transform those column values into aesthetics values.
  • geoms use the transformed values into renderable objects (grobs)

The user can freely combine geoms with scales, and use multiple geoms with the same scale. The independence between geoms and scales lets most geoms work with most scales and that is a great core part of ggplot2.

While this independence is great, there are some implementation details left to ggplot2 that leave room for improvement in terms of performance. In particular, as shown on #4989, a plot that can be built and rendered in <6 seconds may take ~45 seconds in ggplot2.

One of the major bottlenecks is the mapping of column values into aesthetic values, done by the scale$map() method. What can we do to make it faster?

  1. Make the underlying implementation of mapping values to colours faster
  2. Letting the scales know how does the geom expect the aesthetic values to be.
  3. Changing how values are mapped to colours

Make the underlying implementation of mapping values to colours faster

The first point is mostly taken care in other pull requests in the scales and farver packages and it covers things like:

  • Improving the missing value replacement
  • Reducing intermediate memory copies and sweeps on the data

Letting the scales know how does the geom expect the aesthetic values to be.

The second point needs some communication between geoms and scales. To give an example: when a geom maps a data to a fill or colour aesthetic, the scale will transform column values into a character vector ("#ff0000",...). Some geoms do not use character colours, but rather use native colours (for nativeRaster objects, in integer format) and they must do the format conversion when rendering (e.g. https://github.com/zeehio/ggmatrix/blob/98445bf28caaca1022c03a542b8b4541034566a2/R/geom_matrix_raster.R#L123). If the geom can tell the scale that it would rather have colours in native format, and if the scale can tell the same to the palette, the intermediate character representation of colours can be avoided with significant performance benefits. This pull request defines a way for geoms to communicate with scales, but the example described in this paragraph is tackled in a future pull request.

Changing how values are mapped to colours

The third point is how values are mapped to colours and it is what this pull request is concerned about. The pull request focuses on ScaleContinuous because it is one of the most common scales, but similar adjustments could be applied to other scales if desired.

ScaleContinous maps values to palette colours as follows:

  1. unique values are found
  2. unique values are mapped to colors
  3. colors are matched to the original vector

When most values are unique, this mapping could be faster by simply maping all values to colors,
without finding and matching unique values first. In some cases the geom can guess or know if that is going to be the case.

This pull request establishes a way for geoms to communicate parameters to scales, and specifically use those parameters to define three different mapping_methods. By default the current "unique" approach is used. The geom may specify "raw" or "binned" instead.

The geom defines a new method scale_params= that typically will be a list (or a function that takes the computed params and returns that list). The list is named with the aesthetics, and for each aesthetic it provides a list with options.

For instance, the geom may now specify scale_params = list(fill=list(mapping_method = "raw")) to tell the scale corresponding to the fill aesthetic to use a "raw" mapping method, this is without finding unique values first. The "raw" method is usually faster than the current "unique" method for instance when the data consists of doubles without duplicate values.

Besides the default "unique" and the new "raw" mapping methods, we also allow the geom to ask to use the "binned" mapping method where the geom specifies a number of intervals to use scale_params = list(fill=list(mapping_method = "binned", mapping_method_bins = 256)) and the mapping process is as follows:

  • values are binned in N intervals
  • intervals are mapped to colors

This approach is "lossy" (we have a maximum of N different colours), but this can be much faster and have almost no difference with respect to the other mapping methods.

Questions/Discussion

  • Shouldn't this "mapping_method" be just a scale argument?
    Yes... with a "but maybe". Yes, that makes sense. If the "mapping_method" is a relevant argument for the scale it could be one of the scale_*_gradient(...) arguments. However it seems a rather "internal" argument and it won't be easy for a regular user to see its effect. An alternative could be to sample the vector we want to map and, based on the density of unique values in the sample, we could choose either "unique" or "raw". However, by letting the geom hint the scale we can let the scale use a more efficient default mapping method in some scenarios.

ScaleContinous maps values to palette colours as follows:

- unique values are found
- unique values are mapped to colors
- colors are matched to the original vector

If most values are unique, we can be faster by simply maping all values to colors,
without finding and matching unique values first.

In some scenarios the geom can guess or know if that is going to be the case.

The goal of this commit is to let the geom tell the ScaleContinuous scale
how the mapping from values to colours should be done.

By default the existing "unique" approach is used.

The geom may now specify `scale_params = list(fill=list(mapping_method = "raw"))`
to tell the scale corresponding to the fill aesthetic to use a "raw" approach
of mapping values to colours without finding unique values first.

Besides the default "unique" and the new "raw" mapping methods, we also allow
the geom to ask to use the "binned" approach, where the geom specifies a number
of intervals to use and the mapping process is as follows:

- values are binned in N intervals
- intervals are mapped to colors

This approach is "lossy" (we have a maximum of N different colours), but
this can be much faster and have almost no difference with respect to
the other mapping methods.
@aphalo
Copy link
Contributor

aphalo commented Dec 14, 2022

An alternative is to limit the number of colours used to those that an observer can distinguish and automatically switch to binning when there are more distinct values to be mapped in the data. There is no point in using more hue or lightness values than those that can be perceived as different. (the basis of JPEG).

@teunbrand
Copy link
Collaborator

Hi Sergio,

I think having extra options to make colour mapping more efficient is a good thing. However, putting scale parameters under the control of geoms seems to me like it goes against the grammar of graphics. I think you're spot on with this point here:

Shouldn't this "mapping_method" be just a scale argument?

Is there a good reason to implement this at the geom level of things?

@zeehio
Copy link
Contributor Author

zeehio commented Dec 24, 2022

An alternative is to limit the number of colours used to those that an observer can distinguish and automatically switch to binning when there are more distinct values to be mapped in the data. There is no point in using more hue or lightness values than those that can be perceived as different. (the basis of JPEG).

You are right, however the issue implementing this suggestion is that the mapping from numbers to colours is not always related just to hue, or just to brightness and it may be a combination of an arbitrary number of gradients, so it's not that easy to tell in advance the what's the threshold in your values where two numbers become the same colour, without mapping them. And if you have spent the time mapping all the values to colours already, then there is not much more to optimize... :-)

@zeehio
Copy link
Contributor Author

zeehio commented Dec 24, 2022

Hi Sergio,

I think having extra options to make colour mapping more efficient is a good thing. However, putting scale parameters under the control of geoms seems to me like it goes against the grammar of graphics. I think you're spot on with this point here:

Shouldn't this "mapping_method" be just a scale argument?

Is there a good reason to implement this at the geom level of things?

To be honest, I thought about this when I was writing the pull request. I will have to change it so it becomes a scale argument.

There is a scenario (not related to this PR) where it makes sense for the geom to set a scale parameter: A geom may prefer to render colours in a character format (like "#FF0000") or it may prefer them in native format (integers, to be used by nativeRaster objects). In that case, it makes sense for the geom to tell the scale "hey, give me the colours as integers if possible" so the scale returns the colours as integers and an extra conversion is avoided. It's just an implementation detail, but has a significant impact in performance. If you keep on reviewing all my pull requests (sorry for the extra work) you will come across that one...

I will rewrite this pull request whenever I have time (probably in a couple of weeks)

@zeehio
Copy link
Contributor Author

zeehio commented Dec 24, 2022

After further reading, most of what I suggest here can already be done with scale_fill_binned(), so I will close this pull request and clean up the pull requests that depend on this one.

@zeehio zeehio closed this Dec 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants