Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove empty and unnecessary shapefiles from the IWP input data on delta #13

Closed
robyngit opened this issue Dec 8, 2022 · 10 comments
Closed
Assignees

Comments

@robyngit
Copy link
Member

robyngit commented Dec 8, 2022

The IWP workflow will run more efficiently in the future if we remove empty files before processing.

In searching for footprint files that were missing, @eliasm56 found that some files are empty or otherwise should not be included in the workflow. He said:

Any files that I didn't find footprints for can be disregarded because they're either empty, or we don't even have footprints for them and I shouldn't have sent them in the first place.

We should remove these empty or unnecessary files from the input directory on delta. The first step is to come up with a list of files to remove. The files to remove are those that are contained in the 📄files-missing-footprints.json list but not contained in the 📄recovered_footprints.csv list.

Related to PermafrostDiscoveryGateway/pdg-portal#24

@julietcohen
Copy link
Collaborator

julietcohen commented Dec 15, 2022

Initial list exploration

import pandas as pd
import os
import json
import re

# read in list of MISSING footprints with 'read' mode
with open('/u/julietcohen/shapefiles_cleaning/files-missing-footprints.json', 'r') as f:
    missing_fps = json.load(f)

# read in csv with RECOVERED footprints
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/recovered_footprints.csv')
# convert column to list
recovered_fps = recovered_fp_csv['file'].tolist()

# check length of each list
print(f'{len(missing_fps)} files are missing footprints and recovered {len(recovered_fps)} footprints')

4870 files are missing footprints and 1242 footprints have been recovered

  • The missing footprints list contains 4870 longer filepaths that may or may not contain one of the 1242 shorter recovered filepaths. We want to only retain the subset of the 4870 files within the missing footprints list that has a matching string detected in the recovered footprints.

Create list of recovered footprints that match a file with a missing footprint

# for each longer filepath that represents a missing footprint,
# check if ANY of the shorter filepaths are within it,
# if the shorter filepath is within one longer filepath, add it to the matching list
# if the shorter filepath is not within any longer filepath, do nothing
matching_rec = []
for missing_fp in missing_fps:
    for recovered_fp in recovered_fps:
        if recovered_fp in missing_fp:
            matching_rec.append(recovered_fp)

print(f'{len(matching_rec)} recovered footprints match a file with a missing footprint')

1286 recovered footprints match a file with a missing footprint

MRE to demonstrate what the above loop is doing:

long_list = ['catdogbunny', 'squirrelbearmonkey', 'tigerelephantgorilla']
short_list = ['dog', 'bear', 'bird']

matching = []
for long in long_list:
    for short in short_list:
        if short in long:
            matching.append(short)

print(f'{len(matching)} shorter strings are within the longer strings')

2 shorter strings are within the longer strings

Note that the number of recovered footprints that match a file that is missing a footprint, 1286, is larger than the total number of recovered footprints, 1242. This means there are duplicates within the list of matching files. Perhaps this is because each footprint contains footprints for multiple inputs, as noted here by Robyn.

Remove duplicates within list of matching files

Converting the list into a set removes duplicates. Convert the set back to a list.

matching_rec_uniques = list(set(matching_rec))
len(matching_rec_uniques)

949

949 unique recovered footprints can be paired with a file that is missing a footprint.

Subset files-missing-footprints list into a 'keep' list and a 'remove' list

We can use this list of matching files to subset the list of files that lack a footprint into a 'keep' list. search() looks for the first argument, a substring, within the second argument, a larger string. By iterating over both lists, we search for every matching file within every file that is missing a footprint and delegate those to a keep list if it is the footprint for that file.

keep = []
for missing_fp in missing_fps:
    for matching_rec_unique in matching_rec_uniques:
        if re.search(matching_rec_unique, missing_fp):
            keep.append(missing_fp)

len(keep)

990

As mentioned above, the 'keep' list length is likely greater than 949 because each footprint file contains footprints for multiple inputs. The end-goal is to remove all the 'keep' strings from the original list of files-missing-footprints, and the remaining list of files are those we want to remove.

@julietcohen
Copy link
Collaborator

Create list of files that were missing footprints that have recovered footprints

# duplicate the original list of all files that lacks a footprint
missing_fps_duplicate = missing_fps.copy()

# iterate thru files that were missing footprints, for some of which the footprints have been recovered
# remove the files for which footprints have been recovered
# from the total (original) list of files that were missing footprints
# this subset list are the files we should remove from the inputs for IWP workflow
for missing_fp in missing_fps:
    for matching_fp in keep:
        if matching_fp == missing_fp:
            missing_fps_duplicate.remove(missing_fp)

print(f'{len(missing_fps_duplicate)} should be removed as input because they are missing a footprint & footprint was not recovered.')

3880 should be removed as input because they are missing a footprint & footprint was not recovered.

MRE to demonstrate what the above loop is doing:

missing_animals = ['cat', 'dog', 'mouse', 'dog', 'elephant']
animals_keep = ['dog', 'elephant']

missing_animals_duplicate = missing_animals.copy()

for missing_animal in missing_animals:
    for matching_animal in animals_keep:
        if matching_animal == missing_animal:
            missing_animals_duplicate.remove(missing_animal)

print(missing_animals_duplicate)

['cat', 'mouse']

Now to convert these steps into a script with list comprehension rather than for loops!

@julietcohen julietcohen self-assigned this Dec 15, 2022
@julietcohen
Copy link
Collaborator

Script with for loops

create list of files to remove
import pandas as pd
import json
import re

# read in list of MISSING footprints with 'read' mode
with open('/path/to/files-missing-footprints.json', 'r') as f: # CHANGE THIS PATH
    missing_fps = json.load(f)

# read in csv with RECOVERED footprints, with column name 'file'
recovered_fp_csv = pd.read_csv('/path/to/recovered_footprints.csv', header = None) # CHANGE THIS PATH
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
print(f'{len(missing_fps)} files are missing footprints and {len(recovered_fps)} footprints have been recovered')

# create list of recovered footprints that are within a missing footprint
# for each longer filepath that represents a file without a footprint,
# check if every shorter filepath is within it, 
# if the shorter filepath is within the longer filepath, add it to the matching_rec list
# if the shorter filepath is not within any longer filepath, do nothing
matching_rec = []
for missing_fp in missing_fps:
    for recovered_fp in recovered_fps:
        if recovered_fp in missing_fp:
            matching_rec.append(recovered_fp)
print(f'{len(matching_rec)} recovered footprints match a missing footprint')

# there are more recovered footprints that match a file 
# that is missing a footprint
# than the total number of recovered footprints because there are duplicates
# because each fp contains fp for multiple inputs
# so convert to set to remove duplicates, and convert back into list
matching_rec_uniques = list(set(matching_rec))

# the files that contain any of the strings in `matching_rec_unique` are the files 
# we want to retain in `files-missing-footprints.json`
# all other files should be removed
keep = []
for missing_fp in missing_fps:
    for matching_rec_unique in matching_rec_uniques:
        if re.search(matching_rec_unique, missing_fp):
            keep.append(missing_fp)

# create list of files that were missing footprints 
# that now have recovered footprints
missing_fps_duplicate = missing_fps.copy()
# iterate thru files that were missing footprints, for some of which the footprints have been recovered
# remove the files for which footprints have been recovered
# from the total (original) list of files that were missing footprints
# this subset list are the files we should remove from the inputs for IWP workflow
for missing_fp in missing_fps:
    for matching_fp in keep:
        if matching_fp == missing_fp:
            missing_fps_duplicate.remove(missing_fp)
print(f'{len(missing_fps_duplicate)} should be removed as input because they are missing a footprint & footprint was not recovered.')

# convert list of IWP files to remove from input to json format
files_to_remove = json.dumps(missing_fps_duplicate)
# write json file
with open("/path/to/u/files_to_remove.json", "w") as outfile: # CHANGE THIS PATH
    json.dump(files_to_remove, outfile)

Also saved to Delta at: /u/julietcohen/shapefiles_cleaning/remove_unnecessary_files_forLoops.py

@julietcohen
Copy link
Collaborator

Deeper investigation into matching recovered footprints to files that are missing footprints

Example of a file path of a file missing a footprint:

/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/russia/230_250_251_iwp/WV03_20180806035825_104001004023FB00_18AUG06035825-M1BS-502531224100_01_P008_u16rf3413_pansh/WV03_20180806035825_104001004023FB00_18AUG06035825-M1BS-502531224100_01_P008_u16rf3413_pansh.shp

Example of an identifier for a recovered footprint:
WV03_20160803221654_1040010021643200_16AUG03221654-M1BS-500854877100_01_P002

It is clear that this footprint identifier is meant to match a portion of the shp filename. We can extract the filenames from the paths using os.path.basename() and remove the extension easily with os.path.splitext(), reducing the long filepath above to just:
WV03_20180806035825_104001004023FB00_18AUG06035825-M1BS-502531224100_01_P008_u16rf3413_pansh

This means that if we remove the trailing portion _u16rf3413_pansh of the above filename, it should match exactly the identifier for the footprint for that file. However, I originally did not take this approach because not all the files that are missing footprints match this file naming structure, but most do.

  • 4839/4870 base filename of the files that are missing footprints are 92 characters, such as the above base filename example.
  • 1/4370 base filenames < 92 characters: infer_shp
  • 30/4370 base filenames > 92 characters, such as:
WV02_20120813231405_103001001B8A8E00_12AUG13231405-M1BS_R4C1-052719605010_02_P002_u16rf3413_pansh

which has an inserted _R4C1,
and:

WV02_20140724202918_1030010035AB5C00_14JUL24202918-M1BS_R10C1-500106216070_03_P001_u16rf3413_pansh

which has an inserted _R10C1

So matching recovered footprint identifiers exactly to their respective footprints wouldn't be too much work, since there isn't a ton of diversity in base filename structures. There was just enough for me to shy away from that approach at the start. I think it would be worth trying, tho, to explain the confusing numbers reported above with my generalized apporach, like how there were more matches for footprints to files than there were footprints in the first place.

@julietcohen
Copy link
Collaborator

Duplicates detected within lists

As seen in the previous comment, we are working with probably 3 unique footprint ID codes. Since 1 of those ID code formats is very different than the others (infer_shp), we remove it for now, so we are working with a list of 4869 filepaths that are missing footprints that roughly follow the formats:

WV02_20120813231405_103001001B8A8E00_12AUG13231405-M1BS_R4C1-052719605010_02_P002_u16rf3413_pansh

and

WV02_20140724202918_1030010035AB5C00_14JUL24202918-M1BS_R10C1-500106216070_03_P001_u16rf3413_pansh

Now use Robyn's custom functions to extract the ID code of the footprint for each IWP file that is missing a footprint. These functions are from here and work similarly to the os functions used above, but also remove the trailing characters from the filename that are not part of the footprint ID.

def get_base_name(path):
    """
    Get the base name of a file, without the extension
    """
    return os.path.basename(path).split('.')[0]

def id_from_input_path(input):
    """
        Get just the IWP file 'ID' code from the full path name that
        includes a two-part suffix
    """
    input = get_base_name(input)
    parts = input.split('_')
    parts = parts[:-2]
    input = '_'.join(parts)
    return input

# for each filepath that is missing a footprint, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(missing_fp) for missing_fp in missing_fps]

missing_fp_IDcodes[0]

'WV03_20180806035825_104001004023FB00_18AUG06035825-M1BS-502531224100_01_P008'

Iterate through the ID codes of the files that were missing footprints, and create a list of files missing footprints to keep because their footprints have been recovered.

keep = []
for missing_fp_IDcode in missing_fp_IDcodes:
    for recovered_fp in recovered_fps:
        if recovered_fp == missing_fp_IDcode:
            keep.append(missing_fp_IDcode)

print(f'{len(keep)} should be kept as input because their footprint was recovered.')

1286 should be kept as input because their footprint was recovered.

This is the same amount of matches identified by the previous method above! It's great when 2 different approaches return the same result 🎉 Now that we have double confirmed that there are duplicates, let's identify an example of those duplicate files.

@julietcohen
Copy link
Collaborator

Footprint ID codes are duplicates, subdirs for shp files are not

Identify some footprint ID codes that are present in multiple files that lack footprints:

duplicate_fp_ids = [item for item, count in Counter(missing_fp_IDcodes).items() if count > 1]
print(duplicate_fp_ids[0:5])

['WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001', 'WV03_20160721024007_10400100209FE300_16JUL21024007-M1BS-500854844010_01_P001', 'WV03_20160803025042_104001001F527400_16AUG03025042-M1BS-500849252010_01_P001', 'WV02_20200907013951_10300100AB009600_20SEP07013951-M1BS-504694393080_01_P002', 'WV02_20160729025332_103001005A5F0300_16JUL29025332-M1BS-500849766040_01_P001']

Find the full filepaths of one example of 2 files that have the same footprint ID code:

known_dup = 'WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001'
ex_dup_missing_fps = []
for missing_fp in missing_fps:
    if known_dup in missing_fp:
        ex_dup_missing_fps.append(missing_fp)
['/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/russia/268_284_iwp/WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001_u16rf3413_pansh/WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001_u16rf3413_pansh.shp',
 '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/russia/269_285_iwp/WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001_u16rf3413_pansh/WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001_u16rf3413_pansh.shp']

These two filepaths differ in their subdir that comes after russia: 268_284_iwp versus 269_285_iwp

Let's check out another pair of files that have matching footprint ID codes to see if they differ in the same location in the filepath:

known_dup = 'WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002'
ex_dup_missing_fps = []
for missing_fp in missing_fps:
    if known_dup in missing_fp:
        ex_dup_missing_fps.append(missing_fp)

ex_dup_missing_fps
['/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/122_123_124_125_133_134_iwp/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh.shp',
 '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/110_111_112_113_iwp/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh.shp']

These two filepaths also differ in the subdir following the region, this time canada instead of russia:
122_123_124_125_133_134_iwp versus 110_111_112_113_iwp

@julietcohen
Copy link
Collaborator

Majority of documented shp files that are missing footprints are empty (contain no geometries)

Example plotting a random IWP shapefile with geometries present (not one of the files in the list of files missing a footprint):

import geopandas as gpd

example_shp = gpd.read_file('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/106_107_iwp/WV02_20100713194911_1030010006B35600_10JUL13194911-M1BS-500085170050_01_P006_u16rf3413_pansh/WV02_20100713194911_1030010006B35600_10JUL13194911-M1BS-500085170050_01_P006_u16rf3413_pansh.shp')

example_shp.plot(figsize=(6,6))

image

example_shp.head()

image

Plot two of the shp files listed in the files-missing-footprints.json that share a footprint ID code, as mentioned in previous comment (canada example):

ex_dup_missing_fps

['/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/122_123_124_125_133_134_iwp/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh.shp',
 '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/110_111_112_113_iwp/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh/WV02_20200809210719_10300100AB302700_20AUG09210719-M1BS-504570635080_01_P002_u16rf3413_pansh.shp']

Both are empty:

canada_122 = gpd.read_file(ex_dup_missing_fps[0])
canada_110 = gpd.read_file(ex_dup_missing_fps[1])
canada_110.plot()

image

canada_110.head()

image

Of the shp files listed as missing footprints, determine how many are empty

# re-read in missing_fps to ensure we are working with the raw list
with open('/u/julietcohen/shapefiles_cleaning/files-missing-footprints.json', 'r') as f:
    missing_fps = json.load(f)

empty_shp_files = []
nonempty_shp_files = []
for file in missing_fps:
    gdf = gpd.read_file(file)
    nrows = gdf.shape[0] # 0 = rows, 1 = cols
    if nrows == 0:
        empty_shp_files.append(file)
    else:
        nonempty_shp_files.append(file)
# NOTE: if running this loop again, do it in parallel
# it took 141 minutes to run without parallelization

Result:
4058 shp files from the list of 4870 shp files missing footprints are empty.
812 shp files from the list of 4870 shp files missing footprints are not empty.

@julietcohen
Copy link
Collaborator

julietcohen commented Dec 27, 2022

List of shp filepaths that are missing footprints, and have been identified as lacking geometries: empty_files_missing_fp.csv

  • mix of Russia, Canada, Alaska

List of shp filepaths that are missing footprints, and were not identified as lacking geometries:
nonempty_files_missing_fp.csv

  • all are in Alaska?

@julietcohen
Copy link
Collaborator

julietcohen commented Dec 28, 2022

Compare list of shp files missing footprints that are NOT empty to list of recovered footprints

Spoiler alert: the following code shows there are duplicates in the list of recovered footprints

code
import pandas as pd
import geopandas as gpd
import os
import re
from collections import Counter

# read in shp files missing footprints that are NOT empty
nonempty_files_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/exported_lists/nonempty_files_missing_fp.csv', header = None)
# convert column to list
nonempty_files = nonempty_files_csv[0].tolist()
# 812 shp files

# read in csv with RECOVERED footprints from Elias
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/imported_lists/recovered_footprints.csv', header = None)
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
# 1242 footprint files, some of these are for empty shapefiles tho, 
# so we need to filter this list for just the recovered fps that apply to shp files with geometries

# define Robyn's functions to extract the ID code from a IWP file
def get_base_name(path):
    """
    Get the base name of a file, without the extension
    """
    return os.path.basename(path).split('.')[0]

def id_from_input_path(input):
    """
        Get just the IWP file 'ID' code from the full path name that
        includes a two-part suffix
    """
    input = get_base_name(input)
    parts = input.split('_')
    parts = parts[:-2]
    input = '_'.join(parts)
    return input

# for each filepath that is missing a footprint and has geometries, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(nonempty_file) for nonempty_file in nonempty_files]
# still 812 shp files

# iterate thru the footprint ID codes of the shp files that are missing footprints and have geoms, 
# and create a list of those to keep bc their footprints have been recovered
keep = []
for missing_fp_IDcode in missing_fp_IDcodes:
    for recovered_fp in recovered_fps:
        if recovered_fp == missing_fp_IDcode:
            keep.append(missing_fp_IDcode)

# 1085 footprints were recoevred for shp files that have geoms and are missing footprints
# since 1085 > 812, there are duplicates in one list or both lists

# convert shp file list to set to check for duplicates
missing_fp_IDcodes_set = list(set(missing_fp_IDcodes))
# len(missing_fp_IDcodes_set) = 812 so there are no duplicates in missing_fp_IDcodes

# convert recovered footprints list to set to check for duplicates
unique_recovered_fps = list(set(recovered_fps))
# len(unique_recovered_fps) = 949 so there were 1242 - 949 = 293 duplicates in list of recovered footprints

Compare list of shp files missing footprints that are NOT empty to list of recovered footprints with duplicates removed

code
import pandas as pd
import geopandas as gpd
import os
import re
import json

# read in shp files missing footprints that are NOT empty
nonempty_files_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/exported_lists/nonempty_files_missing_fp.csv', header = None)
# convert column to list
nonempty_files = nonempty_files_csv[0].tolist()
# 812 shp files

# read in csv with RECOVERED footprints from Elias
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/imported_lists/recovered_footprints.csv', header = None)
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
# 1242 footprint files, some of these are for empty shapefiles, and 293 are duplicates,
# so we need to filter this list for unique values, then filter for
# just the recovered fps that apply to shp files with geometries

recovered_fps_unq = list(set(recovered_fps))
# 949 files

# define Robyn's functions to extract the ID code from a IWP file
def get_base_name(path):
    """
    Get the base name of a file, without the extension
    """
    return os.path.basename(path).split('.')[0]

def id_from_input_path(input):
    """
        Get just the IWP file 'ID' code from the full path name that
        includes a two-part suffix
    """
    input = get_base_name(input)
    parts = input.split('_')
    parts = parts[:-2]
    input = '_'.join(parts)
    return input

# for each filepath that is missing a footprint and has geometries, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(nonempty_file) for nonempty_file in nonempty_files]
# 812 shp files still

# iterate thru the ID codes of the files that were missing footprints and have geoms, 
# and find the matching records of the ID's of recovered footprints
matching_recs = []
for missing_fp_IDcode in missing_fp_IDcodes:
    for recovered_fp_unq in recovered_fps_unq:
        if recovered_fp_unq == missing_fp_IDcode:
            matching_recs.append(missing_fp_IDcode)

# 812 shp files should be kept as input because their footprint has been recovered and they contain geoms
# this makes sense, because every unique fp that was recovered matches a shp file with geometries
# that is missing a fp

# the shp files that contain any of the strings in `matching_rec` are the files 
# we want to retain in `files-missing-footprints.json`, all other files should be removed
shp_files_keep = []
for nonempty_file in nonempty_files:
    for matching_rec in matching_recs:
        if re.search(matching_rec, nonempty_file):
            shp_files_keep.append(nonempty_file)

# 812 shp filepaths to keep

# convert this list of files to keep into files to remove
# by removing all those filepaths from the original list of shp files
# that are missing footprints, some of which have no geometries

# read in list of ALL files that are missing footprints, some of which are empty
with open('/u/julietcohen/shapefiles_cleaning/imported_lists/files-missing-footprints.json', 'r') as f:
    all_missing_fps = json.load(f)

# iterate thru the original list of ALL filepaths that were missing footprints,
# regardless if they have geometries,
# and remove the filepaths for which footprints have been recovered
# so this subset list are the files we should remove from the inputs for IWP workflow
files_to_remove = all_missing_fps.copy()
for missing_fp in all_missing_fps:
    for shp_file_keep in shp_files_keep:
        if shp_file_keep == missing_fp:
            files_to_remove.remove(shp_file_keep)
print(f'{len(files_to_remove)} should be removed as input because they either lack geometries or the footprint was not recovered.')

Output:
4058 should be removed as input because they either lack geometries or the footprint was not recovered.

files_to_remove.csv

If the most recent workflow makes sense to @robyngit and/or Elias, I would like to close this issue. Aspects of this issue that made it confusing are:

  • footprint ID duplicates in the file lists provided
  • the fact that some of the shp files were empty

This makes me think we need to re-evaluate how we create the footprint ID's from the shp file names. For example, instead of using a subset of the shp file name, can we use the entire shp file name? To discuss with Robyn before the next IWP run start to finish.

@julietcohen
Copy link
Collaborator

Closing this issue because other team members are going to re-process the IWP files and re-structure footprints directory to match those new files. This issue is only applicable to the current IWP dataset, which will likely not be used in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants