-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add local blur ("blob") augmentations #18
Conversation
and pass data erasing config to the training routine. Example: erasing_config = {"probability" : 0.75, "threshold": threshold, "lim_blob_depth": [5, 15], "lim_blob_width": [5, 15], "lim_blob_height": [5, 15], "lim_gauss_diffusion": [1, 6], "verbose": False, "save_path": "../../trash_bin/", "num_steps_save": 1000} Be sure that the config is consistent together with the batch size. For that reason, one can call "check_data_erasing_config" before calling the training routine. Example: try: check_data_erasing_config(test_batch.numpy()[0], **erasing_config) except (IncorrectLimits, IncorrectType, OSError) as internal_error: print(internal_error) raise Exception except Exception: print("an expected error happened during data erasing config check") raise Exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravil-mobile Please look into the review comments.
elektronn3/training/trainer.py
Outdated
# since the new raw data batch is not pinned | ||
inp = deepcopy(inp) | ||
|
||
apply_data_erasing(batch=inp.numpy()[0], **self.data_erasing_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really shouldn't do any augmentations directly during the training loop. It slows down training and can lead to accidental gradient computation. All augmentations should be done in Dataset
classes (like PatchCreator
in our case) so they can be automatically offloaded to DataLoader
background processes and don't have to use PyTorch tensors. See http://pytorch.org/docs/master/data.html.
To be clear, the proposed non-geometric augmentations should be performed on the inp
numpy array after the normalization step here:
elektronn3/elektronn3/data/cnndata.py
Line 193 in cacafe9
inp = (inp - self.mean) / self.std |
PyTorch tensors should not be involved in augmentations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with that. The problem is: I haven't figured out where exactly I have to insert my augmentation. Whenever I look at the dataloader and stuff which is related to that I go to the "guts" of pytorch which is not a right place to insert my function. Please, propose an idea where exactly it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote above, it should be performed directly after the input normalization here:
elektronn3/elektronn3/data/cnndata.py
Line 193 in cacafe9
inp = (inp - self.mean) / self.std |
elektronn3/training/trainer.py
Outdated
# to avoid messing up samples within the data loader | ||
# WARNING: possible it slows down the performance | ||
# since the new raw data batch is not pinned | ||
inp = deepcopy(inp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to copy a tensor x
, use x.clone()
. That's not necessary here though, because augmentations shouldn't be performed here anyway (see the comment below).
elektronn3/data/blob_generator.py
Outdated
|
||
class BlobGenerator: | ||
""" A class instance generates blobs with arbitrary | ||
spacial size and location within specified domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- *spatial
- *the specified
- "domain" is ambiguous. Perhaps something like "coordinate bounds"?
elektronn3/data/blob_generator.py
Outdated
class BlobGenerator: | ||
""" A class instance generates blobs with arbitrary | ||
spacial size and location within specified domain. | ||
The domain size is usually the special size of the batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*spatial size of the input sample. We are not dealing with batches here.
elektronn3/data/blob_generator.py
Outdated
""" A class instance generates blobs with arbitrary | ||
spacial size and location within specified domain. | ||
The domain size is usually the special size of the batch. | ||
The user is responsible to pass correct parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this class is only internally used (and a programmer is kind of always responsible for passing correct parameters), I think this line can be deleted.
elektronn3/data/data_erasing.py
Outdated
|
||
blob = generator.create_blob() | ||
|
||
for k in range(blob.z_min, blob.z_max + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Low priority:) Avoid nested for loops. You can rewrite them with itertools.
elektronn3/data/data_erasing.py
Outdated
if np.random.rand() > probability: | ||
return | ||
|
||
channels, batch_depth, batch_width, batch_height = batch.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, none of these values are related to a batch, so the names should be updated.
elektronn3/data/data_erasing.py
Outdated
return | ||
|
||
channels, batch_depth, batch_width, batch_height = batch.shape | ||
batch_volume = batch_depth * batch_width * batch_height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Low priority:) volume = np.prod(inp.shape[1:])
(given that batch
is renamed to inp
.
elektronn3/data/data_erasing.py
Outdated
blob.x_min:blob.x_max + 1, | ||
blob.y_min:blob.y_max + 1] | ||
|
||
diffuseness = np.random.randint(low=lim_gauss_diffusion[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Low priority:) IMO "diffuseness" is not an optimal name for the sigma parameter of a gaussian blur filter. I'd just call it something like gaussian_sigma
or gaussian_std
. But that's just my opinion.
elektronn3/data/data_erasing.py
Outdated
print("erased percentage for channel (%i): %f" % | ||
(batch_indx, erasing_percentage)) | ||
|
||
if save_path and num_steps_save: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be inside the main function. If you still need this, you can put it into an own utility function.
elektronn3/data/data_erasing.py
Outdated
|
||
# Check the user's specified blob size. | ||
# First entry of each list must be less than the second one | ||
if lim_blob_depth[0] >= lim_blob_depth[1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please squash those checks (until line 214) into one or two checks and report all dimensions in case of an error? IMO it would even be fine to replace all these verbose checks with assertions. If any of the assertions fail, you will still know where they occur and can debug them properly.
Another idea would be to use relative size limits in the augmentation function (relative to the input size), instead of specifying fixed sizes. That would have the benefits that all those checks will no longer be needed and you won't have to re-think those sizes when varying input patch sizes in PatchCreator
.
elektronn3/data/data_erasing.py
Outdated
blob.x_min:blob.x_max + 1, | ||
blob.y_min:blob.y_max + 1] | ||
|
||
diffuseness = np.random.randint(low=lim_gauss_diffusion[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an integer? That looks like an arbitrary limitation for the sigma
parameter of gaussian blurring. It should be a random float value.
It should also make use of scipy.ndimage.gaussian_filter
's support for per-axis sigma values: You can enhance the versatility of the blurring effect by passing a sequence of three independent random values as the per-axis sigma
.
Most importantly, the depth axis should optionally take input data anisotropy into account (which is very common in 3D data sets including ours...). This should be controlled by an additional aniso
parameter in which the user can specify per-axis anisotropy, or at least z-axis anisotropy factor like in
elektronn3/elektronn3/data/transformations.py
Lines 433 to 435 in a18301d
aniso_factor: float | |
Anisotropy factor that determines an additional scaling in ``z`` | |
direction. |
examples/train_unet_neurodata.py
Outdated
iterator = st.loader.__iter__() | ||
test_batch, test_target = iterator.__next__() | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just remove the try/except block because it exits the process anyways. check_data_erasing_config(test_batch.numpy()[0], **erasing_config)
is enough.
examples/train_unet_neurodata.py
Outdated
) | ||
st.train(max_steps) | ||
|
||
iterator = st.loader.__iter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use double underscore methods, they are meant to be private. In this case, what you want is
test_inp, _ = next(iter(st.loader))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually need to load an example batch to find out the values that are relevant for your check. You can find the spatial input shape at
'patch_shape': (48, 96, 96), |
Please replace the
batch
parameter in check_data_erasing_config()
with a patch_shape
parameter (spatial_shape
could be a better name though) and use the shape from the data_init_kwargs
dict for it.
examples/train_unet_neurodata.py
Outdated
test_batch, test_target = iterator.__next__() | ||
|
||
try: | ||
check_data_erasing_config(test_batch.numpy()[0], **erasing_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's less than ideal that this check is performed inside of a user's training script. Once you've addressed the comments above, you should move it to an elektronn3-internal module and perform the check automatically from PatchCreator
if the new augmentations are used.
examples/train_unet_neurodata.py
Outdated
|
||
erasing_config = {"probability" : 0.75, | ||
"threshold": threshold, | ||
"lim_blob_depth": [5, 15], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default depth should be halved because of the anisotropy of neuro_data.
73a554e: Please don't simultaneously rename files and make major changes to them in one commit, because this messes with the git diffs (making all review comments "outdated"). Renaming should be in a separate commit so that git can always understand that files were only moved/renamed, not deleted/created. You can check if renaming was correctly detected by git by checking Edit: I just tried to fix the review by force-pushing a modified version of 73a554e, but it's still broken (showing unchanged parts as "outdated"). |
#18 This commit is the same as 73a554e, but without the data_erasing renaming. Co-authored-by: mdraw <[email protected]>
b9208e1
to
38fdafa
Compare
elektronn3/data/data_erasing.py
Outdated
class ScheduledParameter(object): | ||
""" The class is responsible for a parameter scheduling along an iterative | ||
process according to either the linear or exponential growth. The user | ||
specifies the initial value, the target one, growth type and the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*target one -> maximum value
(because that's closer to the parameter name "target" can be confused with training targets)
elektronn3/data/data_erasing.py
Outdated
process according to the exponential law. The user specifies | ||
the initial value, the target one and the number of steps | ||
along which the variable will be gradually scaled. | ||
class ScheduledParameter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter
is even more confusing than Variable
, because "parameter" generally refers to learnable weights when talking about neural networks. How about just Scheduler
or ScalarScheduler
?
elektronn3/data/data_erasing.py
Outdated
the initial value, the target one and the number of steps | ||
along which the variable will be gradually scaled. | ||
class ScheduledParameter(object): | ||
""" The class is responsible for a parameter scheduling along an iterative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Scheduler for a scalar value.
elektronn3/data/data_erasing.py
Outdated
along which the variable will be gradually scaled. | ||
class ScheduledParameter(object): | ||
""" The class is responsible for a parameter scheduling along an iterative | ||
process according to either the linear or exponential growth. The user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*the linear -> linear (without "the")
elektronn3/data/data_erasing.py
Outdated
If the user doesn't specify the target value or the interval the variable | ||
works as a constant | ||
If the user doesn't specify the target value or the interval, | ||
the parameter works as a constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
elektronn3/data/data_erasing.py
Outdated
""" Prints the current value of the parameter on the screen | ||
during an iterative process. The function counts number of | ||
step() calls and prints information each time when the number | ||
of the calls is even with respect to steps_per_report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*"is divisible by ``steps_per_report``."
elektronn3/data/data_erasing.py
Outdated
|
||
def __eq__(self, other): | ||
return self.value == other | ||
If the used doesn't pass the number of steps_per_report the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*user
Avoid negative statements "if not a, then not b". Write it like
"""Logs information every ``steps_per_report`` steps (if ``steps_per_report`` is set)."""
elektronn3/data/data_erasing.py
Outdated
steps_per_report - int | ||
number of step between information update on the screen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special case steps_per_report = None
should be mentioned here, especially since it's the default.
elektronn3/data/data_erasing.py
Outdated
|
||
""" | ||
def __init__(self, value, max_value=None, interval=None, steps_per_report=None): | ||
|
||
logger = logging.getLogger('elektronn3log') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't tie the logger to a specific class. Use a module-level logger instead, like in all other modules of elektronn3 that perform logging.
elektronn3/data/data_erasing.py
Outdated
------- | ||
None | ||
""" | ||
if self.steps_per_report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.steps_per_report
suggests that it's a bool.
This should be
if self.steps_per_report is not None
.
None
checks should always be explicit.
We can re-do the renaming of data_erasing.py later, just before merging the PR (I am a little scared that renaming it now will mess up the review again). |
one that contained all latest updates of the refactoring
The documentaion of the "ScalarScheduler" class was corrected The "apply_random_blurring" function calls was moved from the training function to the "__getitem__" one of the "PatchCreator" class
for k, i, j in product(range(region.coords_lo[0], region.coords_hi[0] + 1), range(region.coords_lo[1], region.coords_hi[1] + 1), range(region.coords_lo[2], region.coords_hi[2] + 1)): intersection.add((k, i, j))
And disable writing augmented files to disk during training
No description provided.