-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geom_raster() of a matrix: Performance analysis and improvements #4989
Comments
A custom geom to improve efficiencylibrary(bench)
library(ggplot2)
library(cowplot)
library(reshape2) # for melt
library(forcats)
library(grid)
library(rlang) Summary
Profiling:pv <- profvis::profvis({
gplt <- naive_strategy(mat_big)
benchplot(gplt)
})
pv I don’t know how to summarize and represent the output of profvis in a Total time: 38 seconds
Our naive strategy is spending a huge amount of time in mapping the geom_matrix_raster:If you want to try this geom, you can try installing my
# Copyright 2022 Sergio Oller Moreno <[email protected]>
# This file is part of the ggmatrix package and it is distributed under the MIT license terms.
# Check the ggmatrix package license information for further details.
#' Raster a matrix as a rectangle, efficiently
#'
#'
#' @param matrix The matrix we want to render in the plot
#' @param xmin,xmax,ymin,ymax Coordinates where the corners of the matrix will
#' be centered By default they are taken from rownames (x) and colnames (y) respectively.
#' @param interpolate If `TRUE`, interpolate linearly, if `FALSE` (the default) don't interpolate.
#' @param flip_cols,flip_rows Flip the rows and columns of the matrix. By default we flip the columns.
#' @inheritParams ggplot2::geom_raster
#'
#' @export
geom_matrix_raster <- function(matrix, xmin = NULL, xmax = NULL, ymin = NULL, ymax = NULL,
interpolate = FALSE,
flip_cols = TRUE,
flip_rows = FALSE,
show.legend = NA,
inherit.aes = TRUE)
{
data <- data.frame(values = c(matrix))
mapping <- aes(fill = .data$values)
if (is.null(xmin)) {
xmin <- as.numeric(rownames(matrix)[1L])
}
if (is.null(xmax)) {
xmax <- as.numeric(rownames(matrix)[nrow(matrix)])
}
if (is.null(ymin)) {
ymin <- as.numeric(colnames(matrix)[1L])
}
if (is.null(ymax)) {
ymax <- as.numeric(colnames(matrix)[ncol(matrix)])
}
if (nrow(matrix) > 1L) {
x_step <- (xmax - xmin)/(nrow(matrix) - 1L)
} else {
x_step <- 1
}
if (ncol(matrix) > 1L) {
y_step <- (ymax - ymin)/(ncol(matrix) - 1L)
} else {
y_step <- 1
}
# we return two layers, one blank to create the axes and handle limits, another
# rastering the matrix.
corners <- data.frame(
x = c(xmin - x_step/2, xmax + x_step/2),
y = c(ymin - y_step/2, ymax + y_step/2)
)
corners_xy <- corners
x_y_names <- names(dimnames(matrix))
if (is.null(x_y_names)) {
x_y_names <- c("rows", "columns")
}
colnames(corners) <- x_y_names
x_name <- rlang::sym(x_y_names[1L])
y_name <- rlang::sym(x_y_names[2L])
list(
layer(
data = corners, mapping = aes(x=!!x_name, y=!!y_name), stat = StatIdentity, geom = GeomBlank,
position = PositionIdentity, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(), check.aes = FALSE
),
layer(
data = data,
mapping = mapping,
stat = StatIdentity,
geom = GeomMatrixRaster,
position = PositionIdentity,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list2(
mat = matrix,
matrix_nrows = nrow(matrix),
matrix_ncols = ncol(matrix),
corners = corners_xy,
flip_cols = flip_cols,
flip_rows = flip_rows,
interpolate = interpolate
)
)
)
}
GeomMatrixRaster <- ggproto(
"GeomMatrixRaster", Geom,
non_missing_aes = c("fill"),
required_aes = c("fill"),
default_aes = aes(fill = "grey35"),
draw_panel = function(self, data, panel_params, coord, mat, matrix_nrows, matrix_ncols,
corners, flip_cols, flip_rows, interpolate) {
if (!inherits(coord, "CoordCartesian")) {
rlang::abort(c(
"GeomMatrixRaster only works with coord_cartesian"
))
}
corners <- coord$transform(corners, panel_params)
if (inherits(coord, "CoordFlip")) {
byrow <- TRUE
mat_nr <- matrix_ncols
mat_nc <- matrix_nrows
nr_dim <- c(matrix_nrows, matrix_ncols)
} else {
byrow <- FALSE
mat_nr <- matrix_nrows
mat_nc <- matrix_ncols
nr_dim <- c(matrix_ncols, matrix_nrows)
}
x_rng <- range(corners$x, na.rm = TRUE)
y_rng <- range(corners$y, na.rm = TRUE)
mat <- matrix(
farver::encode_native(data$fill),
nrow = mat_nr,
ncol = mat_nc,
byrow = byrow
)
if (flip_cols) {
rev_cols <- seq.int(mat_nc, 1L, by = -1L)
mat <- mat[, rev_cols, drop = FALSE]
}
if (flip_rows) {
rev_rows <- seq.int(mat_nr, 1L, by = -1L)
mat <- mat[rev_rows, drop = FALSE]
}
nr <- structure(
mat,
dim = nr_dim,
class = "nativeRaster",
channels = 4L
)
rasterGrob(nr, x_rng[1], y_rng[1],
diff(x_rng), diff(y_rng), default.units = "native",
just = c("left","bottom"), interpolate = interpolate)
},
draw_key = draw_key_rect
) efficient_strategy <- function(mat) {
gplt <- ggplot() +
geom_matrix_raster(matrix = mat) +
scale_fill_gradient(trans = "log2")
gplt
} cowplot::plot_grid(
naive_strategy(mat_small) + labs(title = "naive"),
efficient_strategy(mat_small) + labs(title = "efficient"),
ncol = 2
) Same results, how about performance? bm_efficient <- bench::mark(
efficient = {
gplt <- efficient_strategy(mat_big)
benchplot(gplt)
},
iterations = 1L
)
bm_efficient
Comparison of all strategies:Our new strategy is much better than strategies <- rbind(
bm_baseline,
bm_fast_fair,
bm_efficient
)
strategies
ggplot(strategies) +
geom_col(aes(
x = forcats::fct_reorder(as.character(expression), as.numeric(min)),
y = as.numeric(min),
fill=as.character(expression))
) +
coord_flip() +
labs(x = "Strategy", y = "CPU time (s)") +
guides(fill = "none") Our efficient strategy is twice as fast compared to We will next profile and target the slowest parts of the efficient |
Profiling the efficient strategylibrary(bench)
library(ggplot2)
library(cowplot)
library(reshape2) # for melt
library(forcats)
library(grid)
library(rlang) SummaryWe profile the efficient strategy described before. The main bottlenecks
Profiling the efficient strategypv <- profvis::profvis({
gplt <- efficient_strategy(mat_big)
benchplot(gplt)
})
pv I don’t know how to summarize and represent the output of profvis in a Total time: 18 seconds
Proposed changes to ggplot2
The next message will cover improvements in scale mapping. |
Summary
We see how a plot of a 4k x 3k matrix can be made around 10 times
faster than when using
geom_raster()
(45 seconds to 5.6 seconds).The differences in performance tell us that
geom_raster()
may notbe the best choice to rasterize a matrix.
We can bring the timing further down to 1.5 seconds if we omit
handling of missing values and reduce the palette of colours, but
these are shortcuts ggplot2 can’t make.
In this issue we will see an efficient way of rasterizing a matrix. We'll see which are the main ggplot2 bottlenecks affecting the performance of that efficient approach and we'll get to some pull requests to address those issues.
This issue is structured in several messages:
All code is included just for reproducibility, but it is not expected that the you linger in the details.
If I happen to call your attention, I'm looking for opportunities ideally starting around summer 2023. Happy to do remote work from Barcelona (Spain) or Mexico and open to relocation if needed.
Introduction: A small example
On my field of work it is a common case to have a matrix that we have to
plot with something similar to
filled.contour
orgeom_raster
. Thematrix has two associated axes, one for rows and one for columns. We
often need a scale transformation.
Here is a small example of the data:
mat_small
We can use
geom_raster()
to plot it. Let’s call this the naïvestrategy, because we just use ggplot in a naïve way. This approach is
great and it works, but we’ll see that it does not scale very well…
Scaling with the naïve strategy:
The same problem, with a 4000 x 3000 matrix:
This is how it looks like:
naive_strategy(mat_big)
bm_baseline
Cutting corners strategy:
This is “as fast as I can make it”. It is useful as a reference for what
can be done, but it is not realistic to expect
ggplot2
to be thisfast, due to these shortcuts being taken:
Let’s apply this to the small matrix:
bm_cut_corners
Fast and fair strategy
If we avoid cutting corners, we can still get quite decent performance.
Here we take care of missing values (if there was any) and we don’t
limit the palette to 256 colours.
Let’s apply this to the small matrix to assess correctness visually:
bm_fast_fair
Comparison of all strategies:
We can see how
ggplot2
creates a lot of extra copies, it allocates and deallocates 28GB of RAM. This hints for room for improvement.ggplot2 is taking more than 40 seconds, while other approaches need
close to 5 seconds.
There is clearly room for improvement and that’s my goal to address in this issue.
The text was updated successfully, but these errors were encountered: