-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove empty and unnecessary shapefiles from the IWP input data on delta #13
Comments
Initial list explorationimport pandas as pd
import os
import json
import re
# read in list of MISSING footprints with 'read' mode
with open('/u/julietcohen/shapefiles_cleaning/files-missing-footprints.json', 'r') as f:
missing_fps = json.load(f)
# read in csv with RECOVERED footprints
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/recovered_footprints.csv')
# convert column to list
recovered_fps = recovered_fp_csv['file'].tolist()
# check length of each list
print(f'{len(missing_fps)} files are missing footprints and recovered {len(recovered_fps)} footprints')
Create list of recovered footprints that match a file with a missing footprint# for each longer filepath that represents a missing footprint,
# check if ANY of the shorter filepaths are within it,
# if the shorter filepath is within one longer filepath, add it to the matching list
# if the shorter filepath is not within any longer filepath, do nothing
matching_rec = []
for missing_fp in missing_fps:
for recovered_fp in recovered_fps:
if recovered_fp in missing_fp:
matching_rec.append(recovered_fp)
print(f'{len(matching_rec)} recovered footprints match a file with a missing footprint')
MRE to demonstrate what the above loop is doing: long_list = ['catdogbunny', 'squirrelbearmonkey', 'tigerelephantgorilla']
short_list = ['dog', 'bear', 'bird']
matching = []
for long in long_list:
for short in short_list:
if short in long:
matching.append(short)
print(f'{len(matching)} shorter strings are within the longer strings')
Note that the number of recovered footprints that match a file that is missing a footprint, 1286, is larger than the total number of recovered footprints, 1242. This means there are duplicates within the list of matching files. Perhaps this is because each footprint contains footprints for multiple inputs, as noted here by Robyn. Remove duplicates within list of matching filesConverting the list into a set removes duplicates. Convert the set back to a list. matching_rec_uniques = list(set(matching_rec))
len(matching_rec_uniques)
949 unique recovered footprints can be paired with a file that is missing a footprint. Subset
|
Create list of files that were missing footprints that have recovered footprints# duplicate the original list of all files that lacks a footprint
missing_fps_duplicate = missing_fps.copy()
# iterate thru files that were missing footprints, for some of which the footprints have been recovered
# remove the files for which footprints have been recovered
# from the total (original) list of files that were missing footprints
# this subset list are the files we should remove from the inputs for IWP workflow
for missing_fp in missing_fps:
for matching_fp in keep:
if matching_fp == missing_fp:
missing_fps_duplicate.remove(missing_fp)
print(f'{len(missing_fps_duplicate)} should be removed as input because they are missing a footprint & footprint was not recovered.')
MRE to demonstrate what the above loop is doing: missing_animals = ['cat', 'dog', 'mouse', 'dog', 'elephant']
animals_keep = ['dog', 'elephant']
missing_animals_duplicate = missing_animals.copy()
for missing_animal in missing_animals:
for matching_animal in animals_keep:
if matching_animal == missing_animal:
missing_animals_duplicate.remove(missing_animal)
print(missing_animals_duplicate)
Now to convert these steps into a script with list comprehension rather than for loops! |
Script with for loopscreate list of files to removeimport pandas as pd
import json
import re
# read in list of MISSING footprints with 'read' mode
with open('/path/to/files-missing-footprints.json', 'r') as f: # CHANGE THIS PATH
missing_fps = json.load(f)
# read in csv with RECOVERED footprints, with column name 'file'
recovered_fp_csv = pd.read_csv('/path/to/recovered_footprints.csv', header = None) # CHANGE THIS PATH
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
print(f'{len(missing_fps)} files are missing footprints and {len(recovered_fps)} footprints have been recovered')
# create list of recovered footprints that are within a missing footprint
# for each longer filepath that represents a file without a footprint,
# check if every shorter filepath is within it,
# if the shorter filepath is within the longer filepath, add it to the matching_rec list
# if the shorter filepath is not within any longer filepath, do nothing
matching_rec = []
for missing_fp in missing_fps:
for recovered_fp in recovered_fps:
if recovered_fp in missing_fp:
matching_rec.append(recovered_fp)
print(f'{len(matching_rec)} recovered footprints match a missing footprint')
# there are more recovered footprints that match a file
# that is missing a footprint
# than the total number of recovered footprints because there are duplicates
# because each fp contains fp for multiple inputs
# so convert to set to remove duplicates, and convert back into list
matching_rec_uniques = list(set(matching_rec))
# the files that contain any of the strings in `matching_rec_unique` are the files
# we want to retain in `files-missing-footprints.json`
# all other files should be removed
keep = []
for missing_fp in missing_fps:
for matching_rec_unique in matching_rec_uniques:
if re.search(matching_rec_unique, missing_fp):
keep.append(missing_fp)
# create list of files that were missing footprints
# that now have recovered footprints
missing_fps_duplicate = missing_fps.copy()
# iterate thru files that were missing footprints, for some of which the footprints have been recovered
# remove the files for which footprints have been recovered
# from the total (original) list of files that were missing footprints
# this subset list are the files we should remove from the inputs for IWP workflow
for missing_fp in missing_fps:
for matching_fp in keep:
if matching_fp == missing_fp:
missing_fps_duplicate.remove(missing_fp)
print(f'{len(missing_fps_duplicate)} should be removed as input because they are missing a footprint & footprint was not recovered.')
# convert list of IWP files to remove from input to json format
files_to_remove = json.dumps(missing_fps_duplicate)
# write json file
with open("/path/to/u/files_to_remove.json", "w") as outfile: # CHANGE THIS PATH
json.dump(files_to_remove, outfile) Also saved to Delta at: |
Deeper investigation into matching recovered footprints to files that are missing footprintsExample of a file path of a file missing a footprint:
Example of an identifier for a recovered footprint: It is clear that this footprint identifier is meant to match a portion of the shp filename. We can extract the filenames from the paths using This means that if we remove the trailing portion
which has an inserted
which has an inserted So matching recovered footprint identifiers exactly to their respective footprints wouldn't be too much work, since there isn't a ton of diversity in base filename structures. There was just enough for me to shy away from that approach at the start. I think it would be worth trying, tho, to explain the confusing numbers reported above with my generalized apporach, like how there were more matches for footprints to files than there were footprints in the first place. |
Duplicates detected within listsAs seen in the previous comment, we are working with probably 3 unique footprint ID codes. Since 1 of those ID code formats is very different than the others (
and
Now use Robyn's custom functions to extract the ID code of the footprint for each IWP file that is missing a footprint. These functions are from here and work similarly to the def get_base_name(path):
"""
Get the base name of a file, without the extension
"""
return os.path.basename(path).split('.')[0]
def id_from_input_path(input):
"""
Get just the IWP file 'ID' code from the full path name that
includes a two-part suffix
"""
input = get_base_name(input)
parts = input.split('_')
parts = parts[:-2]
input = '_'.join(parts)
return input
# for each filepath that is missing a footprint, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(missing_fp) for missing_fp in missing_fps]
missing_fp_IDcodes[0]
Iterate through the ID codes of the files that were missing footprints, and create a list of files missing footprints to keep because their footprints have been recovered. keep = []
for missing_fp_IDcode in missing_fp_IDcodes:
for recovered_fp in recovered_fps:
if recovered_fp == missing_fp_IDcode:
keep.append(missing_fp_IDcode)
print(f'{len(keep)} should be kept as input because their footprint was recovered.')
This is the same amount of matches identified by the previous method above! It's great when 2 different approaches return the same result 🎉 Now that we have double confirmed that there are duplicates, let's identify an example of those duplicate files. |
Footprint ID codes are duplicates, subdirs for shp files are notIdentify some footprint ID codes that are present in multiple files that lack footprints: duplicate_fp_ids = [item for item, count in Counter(missing_fp_IDcodes).items() if count > 1]
print(duplicate_fp_ids[0:5]) ['WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001', 'WV03_20160721024007_10400100209FE300_16JUL21024007-M1BS-500854844010_01_P001', 'WV03_20160803025042_104001001F527400_16AUG03025042-M1BS-500849252010_01_P001', 'WV02_20200907013951_10300100AB009600_20SEP07013951-M1BS-504694393080_01_P002', 'WV02_20160729025332_103001005A5F0300_16JUL29025332-M1BS-500849766040_01_P001'] Find the full filepaths of one example of 2 files that have the same footprint ID code: known_dup = 'WV02_20180812024422_103001007FB6CB00_18AUG12024422-M1BS-502522698100_01_P001'
ex_dup_missing_fps = []
for missing_fp in missing_fps:
if known_dup in missing_fp:
ex_dup_missing_fps.append(missing_fp)
These two filepaths differ in their subdir that comes after russia: Let's check out another pair of files that have matching footprint ID codes to see if they differ in the same location in the filepath:
These two filepaths also differ in the subdir following the region, this time canada instead of russia: |
Majority of documented shp files that are missing footprints are empty (contain no geometries)Example plotting a random IWP shapefile with geometries present (not one of the files in the list of files missing a footprint):import geopandas as gpd
example_shp = gpd.read_file('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/canada/106_107_iwp/WV02_20100713194911_1030010006B35600_10JUL13194911-M1BS-500085170050_01_P006_u16rf3413_pansh/WV02_20100713194911_1030010006B35600_10JUL13194911-M1BS-500085170050_01_P006_u16rf3413_pansh.shp')
example_shp.plot(figsize=(6,6)) example_shp.head() Plot two of the shp files listed in the
|
List of shp filepaths that are missing footprints, and have been identified as lacking geometries: empty_files_missing_fp.csv
List of shp filepaths that are missing footprints, and were not identified as lacking geometries:
|
Compare list of shp files missing footprints that are NOT empty to list of recovered footprintsSpoiler alert: the following code shows there are duplicates in the list of recovered footprints codeimport pandas as pd
import geopandas as gpd
import os
import re
from collections import Counter
# read in shp files missing footprints that are NOT empty
nonempty_files_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/exported_lists/nonempty_files_missing_fp.csv', header = None)
# convert column to list
nonempty_files = nonempty_files_csv[0].tolist()
# 812 shp files
# read in csv with RECOVERED footprints from Elias
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/imported_lists/recovered_footprints.csv', header = None)
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
# 1242 footprint files, some of these are for empty shapefiles tho,
# so we need to filter this list for just the recovered fps that apply to shp files with geometries
# define Robyn's functions to extract the ID code from a IWP file
def get_base_name(path):
"""
Get the base name of a file, without the extension
"""
return os.path.basename(path).split('.')[0]
def id_from_input_path(input):
"""
Get just the IWP file 'ID' code from the full path name that
includes a two-part suffix
"""
input = get_base_name(input)
parts = input.split('_')
parts = parts[:-2]
input = '_'.join(parts)
return input
# for each filepath that is missing a footprint and has geometries, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(nonempty_file) for nonempty_file in nonempty_files]
# still 812 shp files
# iterate thru the footprint ID codes of the shp files that are missing footprints and have geoms,
# and create a list of those to keep bc their footprints have been recovered
keep = []
for missing_fp_IDcode in missing_fp_IDcodes:
for recovered_fp in recovered_fps:
if recovered_fp == missing_fp_IDcode:
keep.append(missing_fp_IDcode)
# 1085 footprints were recoevred for shp files that have geoms and are missing footprints
# since 1085 > 812, there are duplicates in one list or both lists
# convert shp file list to set to check for duplicates
missing_fp_IDcodes_set = list(set(missing_fp_IDcodes))
# len(missing_fp_IDcodes_set) = 812 so there are no duplicates in missing_fp_IDcodes
# convert recovered footprints list to set to check for duplicates
unique_recovered_fps = list(set(recovered_fps))
# len(unique_recovered_fps) = 949 so there were 1242 - 949 = 293 duplicates in list of recovered footprints Compare list of shp files missing footprints that are NOT empty to list of recovered footprints with duplicates removedcodeimport pandas as pd
import geopandas as gpd
import os
import re
import json
# read in shp files missing footprints that are NOT empty
nonempty_files_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/exported_lists/nonempty_files_missing_fp.csv', header = None)
# convert column to list
nonempty_files = nonempty_files_csv[0].tolist()
# 812 shp files
# read in csv with RECOVERED footprints from Elias
recovered_fp_csv = pd.read_csv('/u/julietcohen/shapefiles_cleaning/imported_lists/recovered_footprints.csv', header = None)
# convert column to list
recovered_fps = recovered_fp_csv[0].tolist()
# 1242 footprint files, some of these are for empty shapefiles, and 293 are duplicates,
# so we need to filter this list for unique values, then filter for
# just the recovered fps that apply to shp files with geometries
recovered_fps_unq = list(set(recovered_fps))
# 949 files
# define Robyn's functions to extract the ID code from a IWP file
def get_base_name(path):
"""
Get the base name of a file, without the extension
"""
return os.path.basename(path).split('.')[0]
def id_from_input_path(input):
"""
Get just the IWP file 'ID' code from the full path name that
includes a two-part suffix
"""
input = get_base_name(input)
parts = input.split('_')
parts = parts[:-2]
input = '_'.join(parts)
return input
# for each filepath that is missing a footprint and has geometries, extract the 'ID' code
missing_fp_IDcodes = [id_from_input_path(nonempty_file) for nonempty_file in nonempty_files]
# 812 shp files still
# iterate thru the ID codes of the files that were missing footprints and have geoms,
# and find the matching records of the ID's of recovered footprints
matching_recs = []
for missing_fp_IDcode in missing_fp_IDcodes:
for recovered_fp_unq in recovered_fps_unq:
if recovered_fp_unq == missing_fp_IDcode:
matching_recs.append(missing_fp_IDcode)
# 812 shp files should be kept as input because their footprint has been recovered and they contain geoms
# this makes sense, because every unique fp that was recovered matches a shp file with geometries
# that is missing a fp
# the shp files that contain any of the strings in `matching_rec` are the files
# we want to retain in `files-missing-footprints.json`, all other files should be removed
shp_files_keep = []
for nonempty_file in nonempty_files:
for matching_rec in matching_recs:
if re.search(matching_rec, nonempty_file):
shp_files_keep.append(nonempty_file)
# 812 shp filepaths to keep
# convert this list of files to keep into files to remove
# by removing all those filepaths from the original list of shp files
# that are missing footprints, some of which have no geometries
# read in list of ALL files that are missing footprints, some of which are empty
with open('/u/julietcohen/shapefiles_cleaning/imported_lists/files-missing-footprints.json', 'r') as f:
all_missing_fps = json.load(f)
# iterate thru the original list of ALL filepaths that were missing footprints,
# regardless if they have geometries,
# and remove the filepaths for which footprints have been recovered
# so this subset list are the files we should remove from the inputs for IWP workflow
files_to_remove = all_missing_fps.copy()
for missing_fp in all_missing_fps:
for shp_file_keep in shp_files_keep:
if shp_file_keep == missing_fp:
files_to_remove.remove(shp_file_keep)
print(f'{len(files_to_remove)} should be removed as input because they either lack geometries or the footprint was not recovered.') Output: If the most recent workflow makes sense to @robyngit and/or Elias, I would like to close this issue. Aspects of this issue that made it confusing are:
This makes me think we need to re-evaluate how we create the footprint ID's from the shp file names. For example, instead of using a subset of the shp file name, can we use the entire shp file name? To discuss with Robyn before the next IWP run start to finish. |
Closing this issue because other team members are going to re-process the IWP files and re-structure footprints directory to match those new files. This issue is only applicable to the current IWP dataset, which will likely not be used in the future. |
The IWP workflow will run more efficiently in the future if we remove empty files before processing.
In searching for footprint files that were missing, @eliasm56 found that some files are empty or otherwise should not be included in the workflow. He said:
We should remove these empty or unnecessary files from the input directory on delta. The first step is to come up with a list of files to remove. The files to remove are those that are contained in the 📄
files-missing-footprints.json
list but not contained in the 📄recovered_footprints.csv
list.Related to PermafrostDiscoveryGateway/pdg-portal#24
The text was updated successfully, but these errors were encountered: