Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display the entire IWP layer #24

Open
robyngit opened this issue Sep 1, 2022 · 10 comments
Open

Display the entire IWP layer #24

robyngit opened this issue Sep 1, 2022 · 10 comments
Labels
data available The complete dataset is on the datateam server and ready to process layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Permafrost: Surface Features data layer category: Permafrost: Surface Features

Comments

@robyngit
Copy link
Member

robyngit commented Sep 1, 2022

This issue is to track the progress of generating web tiles & 3dtiles for the entire Ice Wedge Polygon dataset

@robyngit robyngit added pdg Permafrost Discovery Gateway layer Displaying a specific data product in the PDG portal labels Sep 1, 2022
@robyngit
Copy link
Member Author

robyngit commented Sep 2, 2022

Reminders:

@julietcohen
Copy link

julietcohen commented Feb 22, 2023

High Ice IWP run 2/21 - Out of memory error on Delta error cancelled staging

  • scheduled job to process all shapefiles for Alaska, Canada, & Russia on 5 nodes, 24 hours
  • used modified viz-raster to avoid ray error and still be able to write raster_summary.csv to update ranges in web tiling step

Staging

  • files written to /tmp on each node.
  • CPU usage looked good at the start, fluctuated between 40-60% on the head and worker nodes:

image

  • Memory usage creeped up to ~93% after staging about half of the high ice files and stabilized there. I/O wait also increased (staged files were written to /tmp slower as time went on).

image

  • 11,954 staged files were written across all 5 nodes before the connection was lost. Total amount of files to process for high ice is 17,039.

image

11,954 staged files were transferred to /scratch and job was cancelled after this, ran for 4.5 hours.

@julietcohen
Copy link

High Ice IWP run 2/22

  • scheduled job to process all shapefiles for Alaska, Canada, & Russia on 11 nodes, 20 hours
  • increasing # of nodes hopefully decreases the amount of memory used on each node, avoiding the memory leak
  • could not do more nodes or hours based on the amount of CPU credits we have remaining

Staging

  • memory increased on each node to ~70% after processing ~50% of the files, then increased to close to 90% by the time 70% of the files were staged.
  • CPU is very high, started at >90% and stabilized at ~70% per node

image

  • By the time the vast majority of the files were staged, the CPU dropped to <1% on all nodes according to glances.
  • tmux output shows staged files are still being written quickly. Staged completed in less than 2 hours.

image

image

Merging Staged

Went well overall, got an error output, but this is the only one I saw:

image

By the time merging concluded, the head node contained 15,113 files (2.03 GB)

Raster Highest

  • Went very well, but suspiciously fast. 1.69 minutes to rasterize all 15,113 raster highest files.
  • 3 nodes have /tmp/geotiff dirs this time, while in all my practice runs only the head node ever created a /tmp/geotiff dir 🤷🏻‍♀️

Raster Lower

  • Went very well, but again suspiciously fast. 5.84 minutes to rasterize all parent geotiffs.
  • wrote directly to scratch
  • total number of files in geotiff dir in all z-levels: 82,373
  • wrote raster_summary.csv, but messed up the formatting. This ocurred before, and is noted in an issue here. I downloaded the csv, removed the few lines that were misformatted (just split data between cells worng) and uploaded the clean version to /scratch.

Web Tiling

  • took 8 minutes
  • created 82,373 web tiles

Done!

@julietcohen
Copy link

julietcohen commented Feb 23, 2023

Investigating the scarcity of IWP in new web tiles

The web tiles produced by the new batch of IWP data are far more scarce than the last batch of web tiles.

Old IWP Data (2022) New IWP Data (2023)
22,319 shp files for Alaska, Canada, Russia 17,039 shp files for Alaska, Canada, Russia
4,267 for Alaska only 1,169 for Alaska only
9,606 for Canada only 7,198 for Canada only
8,446 for Russia only 8,172 for Russia only
5,356,353 web tiles 82,373 web tiles
ratio = ~240 web tiles created for every shp file ratio = ~5 web tiles created for every shp file

Feb 22, 2023 workflow run:

  1. staged 17,039 shp files
  2. merging resulted in 15,113 gpkg files
  • The point of merging is to combine all staged from all nodes into the head node, but we do not copy over staged files from worker nodes that are already present in the head node because that would result in overwriting the file that already exisst there. We do not want to overwrite the staged file in the head node, because even though these files may have the same name (tile ID), they may contain different polygons.
  • a) If a tile does not yet exist in the head, we simply move it there.
  • b) If a tile does exist in the head node, we check if the files in the nodes are identical
  • c) If the tiles are identical, we just skip copying the file to the head node.
  • d) If the tiles are not identical, we append the polygons into one gdf and save that file to the head node.
  • This results in the number of staged tiles in the head node being the total sum of all staged tiles in all nodes minus the tiles that were already present in the head node.
  1. raster highest produced 15,114 tif files at z-level 15
  2. raster lower produced 82,272 tif files (sum of all z-levels)
  3. web tiling produced 82,272 png files (sum of all z-levels)

@julietcohen
Copy link

Update on high, med, and low ice processing

High ice has been processed completely and is up on the production portal, with a link to the published high ice package archived on the ADC.

Low and medium ice are in progress on Delta. We upgraded the allocation to a higher tier (Explore --> Discover) and exchanged enough ACCESS credits into storage and GPU hours to process all of low and medium together, through all steps, without having to transfer files in between steps from Delta to Datateam and then remove them from Delta to save space.

Staging medium and low went smoothly. Used 20 nodes and transferred all 8,247,460 staged files to /scratch.

Merging is going somewhat smoothly, but running into the same errors as documented before, which are rare compared to the successful merges. As a reminder, if a tile is not present in the head node but is present in a worker node, the file is copied from a worker node to the head node. If the tile is in the head node but is different than the same tile in the worker node, the tile is merged (deduplicated) from the worker node into the head node. Some of the errors printed in the terminal are below:

image image

When I investigated these errors before, I was not able to find the source. Finding out why certain files are corrupted during the merge is a high priority for improving the workflow.

0 errors were documented during staging.

@julietcohen
Copy link

Update on IWP for all regions

IWP for all regions (high, low, medium) have been processed through staging, merging, rasterization, and web-tiling. The high region was processed separately from low and medium (which were processed together) on Delta because of memory limitations (especially during merging step) and job time limitations (we can only process so many files within the max hours of processing allowed per job, and each step needs to complete within 1 job because checkpoints were not built into the ray workflow).

The IWP tiles are therefore within 2 layers, displayed on the demo portal:
image

The deduplication within each of the two workflow runs went well. Because the merging occurred within each run and has not been executed for all 3 regions together, there are strips of duplicated tiles where the high region overlaps with either the low region or the medium region. An example from northern Alaska:
image

We discussed different approaches to combine them. I would have to obtain more credits (easy to do) to merge them together on Delta (the step that takes the longest even when the regions were processed separately), but I would likely hit the memory and job time limitations (not related to credits). The other solution is to do it on an NCEAS machine, which will remove the time limit limitation and potentially the memory limitation as well, but I will need to adjust the code to work in parallel on that machine.

Anna's comment:
I would wait to publish the new data once you have it merged with the already published data and then call it v2 (and a new DOI). The dataset that is published and that is up on the PDG is enough for people to understand what the data is about.

@mbjones mbjones moved this to Ready in Data Layers Jan 11, 2024
@elongano
Copy link

Category: Permafrost Subsurface Features

@elongano elongano added Permafrost: Surface Features data layer category: Permafrost: Surface Features and removed priority: high labels Jan 29, 2024
@julietcohen julietcohen moved this from Ready to In Progress in Data Layers May 28, 2024
@julietcohen
Copy link

IWP dataset on Google Kubernetes Engine

With the successful execution of a small run of the kubernetes & parsl workflow on the Google Kubernetes Engine (GKE) (nice work @shishichen! 🎉), we have an updated game plan for processing the entire IWP workflow (high, med, and low) within 1 run (with deduplication between all regions and adjacent files).

  1. I will follow Shishi's documentation to execute my own viz workflow run with the few IWP tiles, which will allow me to accept her pull request into the viz-workflow repo develop branch
  2. Shishi and I will meet to discuss any other workflow parameters outside of the viz config (there may be certain decisions that are specific to GKE, such as how many workers to use, if we can fit all steps into 1 run or not if there is a time restraint, etc.)
  3. Either of us or both of us together will run the GKE workflow on a larger subset of data, like just the high ice subset in Alaska, and closely monitor the job to ensure processing is running in parallel, no files are lost, and tracking how many credits are burned for just that region
  4. do some math to make sure we will have enough GKE credits for the full run
  5. execute the full run

@mbjones
Copy link
Member

mbjones commented Jul 17, 2024

@julietcohen @shishichen a quick thought as we're preparing for this layer integration - this is probably obvious to you, but I thought I'd throw it out there just in case. As the high, medium, and low images have been tiled and deduplicated separately, we need to combine the two output datasets, dealing with duplicate polygons. I think the main issue is that we need to deduplicate the regions where High data overlaps with Med/Low data. This is not the whole dataset, and should primarily be on the boundaries of where the datasets overlap. If we query to find the list of tiles/images that overlap at the boundaries of those datasets, that list should be much smaller than the full list of all dataset images and tiles, and would save a huge amount of processing time, at the cost of a more complicated selection process for images and then a merging process of old and new tiles.

As an example, I made up the following scenario with High (grey) and Med/Low (salmon) images. In this case, only images H1, H2, ML1, and ML2 need to be reprocessed, and they only affect the tiles in rows 3 and 4 -- the tiles in rows 1, 2, 5, and 6 can be copied across straight to the output dataset without any reprocessing. All of this can be determined ahead of time via calculations on the image footprints, which should be very fast. Does that make sense to you? One thing I wondered about was whether the images like H3 that overlap H1 in row 3 would have an impact on tile row 3. Need to think about that.

image

@julietcohen
Copy link

Thanks for the description and the visual, Matt! That all aligns with my understanding as well.

Reminders for where the data is stored and published:

DOI: A2KW57K57

This DOI is associated with the published metadata package that will be updated with the tiles that have all been deduplicated between high, med, and low.

image

Datateam:/var/data/10.18739/A2KW57K57/ contains all regions of the IWP detections and footprints (high, med, low)

Datateam:/var/data/10.18739/A2KW57K57/iwp_geopackage_high contains only the high ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A2KW57K57/iwp_geotiff_high contains only the high ice output of geotiff tiles from the viz workflow

DOI: A24F1MK7Q

This DOI is not associated with a metadata package. This DOI only exists as a subdirectory within Datateam:/var/data/10.18739/ in order to organize the output for low and medium regions.

image

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geopackage_low_medium contains only the low and medium ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geotiff_low_medium contains only the low and medium ice output of geotiff tiles from the viz workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data available The complete dataset is on the datateam server and ready to process layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Permafrost: Surface Features data layer category: Permafrost: Surface Features
Projects
Status: In Progress
Development

No branches or pull requests

4 participants