Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid cells that lack data set to 0 rather than pulling nodata_val from config #16

Open
julietcohen opened this issue Sep 16, 2023 · 12 comments

Comments

@julietcohen
Copy link
Collaborator

In viz-raster/Raster.py, __as_array() assigns 0 to all grid cells that lack data here. Instead, this fill_value should be pulled from the config, which allows users to specify an integer, float, None, or np.nan.

This bug was identified while processing Ingmar's lake change data. In this dataset, the visualized attribute has a range of negative to positive values that communicates the rate of lake size change. Therefore, 0 is not a valid nodata_val because it is a real value in the data that means no change. When processing the data with the fill_value hard coded to 0, but with a nodata_val in the config set to any of the options besides 0, the resulting data is visualized obviously incorrectly (see the lake change issue here).

@julietcohen
Copy link
Collaborator Author

julietcohen commented Sep 28, 2023

Update

I branched both viz-raster and viz-staging to integrate the nodata_val option from the config into the deduplication config output from get_raster_config(). This config is passed to from_vector() here. Then this value is passed to __create_raster_from_stats_df(), which calls __as_array() and applies this value to grid cells with no data value here.

This worked for the rasters (and therefore the web tiles) that represent the polygons. But this approach didn't succeed in removing the yellow layer that stretches from prime meridian to antimeridian. I did a test run on the first 5000 polygons in one of the lake change files, with the nodata_val set to 999:
image
image

@julietcohen
Copy link
Collaborator Author

The web tiling step fails at the step to_image() in WebImage.py if the nodata_val is set to None in the config. Note: for this test run, in WebImage.py, the nodata_val argument was manually set to None to match the config, because this argument's default is not set to pull from the config (yet).

in the terminal:

/home/jcohen/viz-raster/pdgraster/WebImage.py:129: RuntimeWarning: invalid value encountered in cast
  image_data_scaled = image_data_scaled.astype(int)

and in the log:

ERROR:logger:Error creating web tile for tile Tile(x=2, y=0, z=1) from 
ERROR:logger:list index out of range

A deeper look into the error reveals that it occurs at the specified line because there are still remaining np.nan values in image_data_scaled after the cell values are scaled and the nodata_mask values are converted to 256. There are no None values.

@julietcohen
Copy link
Collaborator Author

"Middle" value in the palette is assigned to non-tiled area rendered by Cesium

As pictured above, a confusing solid-color layer expands across half of the world when I visualize the lake size change dataset. Importantly, this layer does not obscure the polygons themselves or even the area immediately around the polygons that's within the tiles produced by the workflow. Rather, the solid layer is present just in the area outside of the polygon tiles. In the palette specified in this config, I "manually" assign a number of colors Ingmar suggested (rather than using a palette name) that span from the smallest value in the dataset (specified as -2 in the val_range) to the largest value (specified as 2 in the val_range). The color of the solid layer is whatever I specify for the middle color (what should be 0).

Also, it goes without saying that the number of tiles is very small for these samples I am processing, and the number of tiles increases as I increase the number of polygons as input (of course, this has always been the case). I mention this to emphasize that if the workflow was producing all the colors we see in cesium, the number of tiles would be huge because the color cover half the earth, and the number would remain consistent because the color visualized in cesium covers half the earth no matter how many polygons I input. This all leads me to believe that the solid layer covering half the earth is partially due to how ceisum is reading my input. Why is cesium reading my 0-value ("middle" color) and applying it to the non-tiled area when I use a non-0 nodata_val, but it does not do this when I use 0 as the no-data_val?

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 17, 2023

Mapping the values of a lower resolution raster shows what we expect. Polygons of lake size change are values that range mostly from -2 to +2 (with few exceptions as we see in the data distribution plotted earlier), there are many 999 values that fall within the tiles created from the workflow (that translate to transparent white in the PNG's created from these rasters), and 0 values that populate all cells outside of the tiles. I think the 2 ways to solve the problem we see in Cesium is to 1) get the values outside the tiles to be 999 instead of 0, or 2) communicate to cesium that all values outside the tiles should be visualized as transparent white as well. The second option is kinda the same as the first, because we need to change the cells outside the tiles to not be 0 ever if they follow my palette, because I assign a color to 0 because some lakes may be 0.

Here's all 3 values (0, 999, and other) plotted together:
0 = red (I know, it's super faint 😒)
999 = blue
other = green
image

Because the plot colors were faint and hard to distinguish when plotted together, here are 3 maps that each plot one type of value:
image

image image

@julietcohen
Copy link
Collaborator Author

After chatting with Ian, we determined that the issue is likely not a cesium thing and is probably just in the way we resample the rasters, since the "middle" palette value (that's assigned to 0) is visually present in the lower-z web tiles. Since the web tiles are just color representations of the raster values, I'll have to pinpoint the specific step in the raster resampling where we fill in the pixel values that fall outside the tiles. There is probably a part in that code where those values are hard-coded to 0, and instead they need to pull the nodata_val from the config.

For example, see the following lower-z level web tile that has 1/4 of the tile yellow (one of the higher resolution tiles that are summarized into a lower resolution tile?)

image

@julietcohen
Copy link
Collaborator Author

!!! See code here for resampling rasters

@iannesbitt
Copy link

Nice find!

@julietcohen
Copy link
Collaborator Author

After playing around with the viz-raster/psgraster/Raster.py and viz-raster/psgraster/RasterTiler.pycode and changing the resampling option in the config (and feeling like I'm going in circles!), I find myself trying to figure out what's happening under the hood of rasterio.merge.merge(). I'm so very confident based on intermediate outputs that my changes to pull the nodata_val from the config into __merge_and_resample() is working. In this step, we have several geotiffs stored as an object rasters, all from the same z-level, that we feed into rasterio.merge.merge() to output a combined raster with the same characteristics (CRS, number of bands which is just 1 in this case for simplicity, dtype, and shape). Before we feed these rasters into merge(), none of them have values of 0. As a side note, these rasters are allowed to have a value of zero if a lake did not change size. Just for the sake of simplicity, we feed in rasters that do not have 0 as a value for the attribute of interest.
When we merge, we are now careful to specify two things:

  1. The nodata argument for merge() (see the documentation) is the nodata_val we pull from the config.
  2. The new array that we create after we execute merge() now has a fill_value that pulls the nodata_val we pull from the config as well. This new array populates every pixel with this fill value. We use this new array to specify the destination of the reprojection just afterwards.

Unfortunately, when the merged raster is output, it does indeed have some values of 0 (for the lower-res rasters). It also has values that match the nodata_val we specified, which is great.

I started exploring the differences in intermediate outputs when input data to the workflow includes np.nan values (the lake change data did include these as well as inf and -inf values, but I cleaned them). This is to test the behavior of how merge() populates cells with the nodata_val if the nodata_val is present in the rasters that are being merged, versus when the nodata_val is not present in those rasters.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 5, 2023

Note: Looking into rasterio.merge.merge(), it seems like here the function fills the desitnation array with 0's instead of the no data value. Perhaps if this is changed from np.zeros() to np.full(), we can get the result we want.

But just afterwards, this step seems to populate the cells in the array with the nodata value, but only if it is in range. Perhaps the values we have been testing for the nodata value are not in range...

@robyngit
Copy link
Member

robyngit commented Dec 6, 2023

Looks like a very probable solution to me!

@julietcohen
Copy link
Collaborator Author

Update:

Cloned repo for rasterio and created new env with local installs for viz-raster, viz-staging, and rasterio. Note: for some reason importing rasterio did not work in the env until I also installed conda install -c conda-forge gcc=12.1.0. Must be some weird dependency issue with the latest github release of rasterio that was not present in the pip version.

I edited the merge.py script line noted above to instead be:

dest = np.full((output_count, output_height, output_width), 
                   fill_value = 999, 
                   dtype=dt)

To match my testing viz-workflow script that specifies the nodata_val as 999 in the config. This did not produce errors in the workflow but also did still produce the yellow layer of 0 values in the output web tiles. The search continues!

@julietcohen
Copy link
Collaborator Author

The progress I have made towards this bug fix exists in 2 branches, one for viz-raster and one for viz-staging that have the same name: bug-16-nodata_val. Both branches need to be checked out with a python env actviated that has local installs for these 2 packages, so that changes to the packages' code can be tested with small workflow runs from start to finish (staging through web tiling)

The reason I have a branch in both repos for the same issue is that the bug fix requires changes to both repos because viz-staging defines the configuration, which includes the user's specification for the no data value. That variable is then passed onto the rasterization stage, when the actual no data value is applied to raster cells.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Status: No status
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants