Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Loading Overhaul #651

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

buzzingwires
Copy link
Contributor

@buzzingwires buzzingwires commented Jul 4, 2023

Included in this pull request are a number of changes to the way images are loaded by vimiv, with special focus on lazy loading. The code is most certainly not ready to be merged with the upstream project. Tests need to be written for it, the code generally needs to be cleaned up and the style made consistent, and any ideas that don't work well with the vision for upstream vimiv must be removed.

The rationale for these changes is primarily to accommodate the author's niche and specific use case. That is browsing sets of thousands of images, while minimizing memory usage and redundant disk IO. These images are typically sorted externally and fed in through stdin, so their order must be maintained.

The intention was to make all changes opt-in by means of configuration variables, the default values of which would maintain the original behavior of the software. These changes are already seeing use in my personal fork, so I will not be offended if they are turned down.

The following configuration variables are added to change the behavior of the software:

thumbnail.save : If set to False, thumbnails will not be cached to the drive.

thumbnail.max_ahead : Only this number of thumbnails after the current selection will be loaded. Changing the selection will modify the range of loaded thumbnails accordingly. If zero, load unlimited, as is the default.

thumbnail.max_behind : Like thumbnail.max_ahead, but for thumbnails preceding the current selection.

thumbnail.max_count : Thumbnails will not be unloaded unless the number of them exceeds this. Note, that this does not cause thumbnails to be forcibly unloaded, so this variable should probably be renamed. Anyway, set to 0 to immediately unload out of range thumbnails.

image.id_by_extension : If set to True, use the extension to assume the filetype, instead of reading with imghdr. This prevents large sets of images from being all opened when starting.

image.imghdr_fallback : If set to True and an image fails to load by a reader guessed by image.id_by_extension, try using imghdr to find an appropriate reader. In light of the pending replacement of imghdr, this should perhaps be renamed.

sort.image_order and sort.directory_order also now have a passthrough sort option, which will maintain the order that images were provided to vimiv from the command line or stdin. Will probably remove sort.directory_order, since it's unlikely to be very useful. And I don't think it works properly.

If this option is set to False, then do not save generated thumbnails to the disk.
In order to reduce memory usage, the new 'thumbnail.max_ahead'
and 'thumbnail.max_behind' can be used to specify a limit to the
number of thumbnails after the current selection that can be loaded.
Thumbnails outside of this range will be unloaded, and will be loaded
again when they enter the range. Setting either of these variables to
0 will remove the loading limit, to restore previous behavior.
The 'thumbnail.max_count' configuration setting prevents thumbnails
from being unloaded if they are under this count, even if they fall
outside the range specified by 'thumbnail.max_ahead' and
'thumbnail.max_behind'. If this value is 0, it is ignored.
ignored.
For sort 'image_order' and 'directory_order', the passthrough setting
can be used to sort images in the order they were first encountered
by the software, whether passed from the command line, or from
directory monitoring.
vimiv typically uses imghdr to scan input files for valid images.
This potentially results in a great deal of startup disk IO for large
lists of files. When `image.id_by_extension` is enabled, the extension
of the file will naively be believed to represent the correct image
format.
When this option is set to true, if the result of attempting to load
an image using a reader determined by file extension fails, imghdr
will be used in its place as a last ditch effort to find a working
reader.
@karlch
Copy link
Owner

karlch commented Jul 4, 2023

Wow, thanks a bunch for the PR and all the work that went into this!

Would you mind splitting this into four distinct PRs for discussion purposes, as I do see parts being merged quickly, others needing more thoughts / discussion / time?

Some initial thoughts:

  1. thumbnail.save: this would create quite some computational overhead in each step, no? Ages ago in the old version we removed this option with the implementation of the freedesktop.org thumbnail management standard (option to disable thumbnail cache? vimiv#14). The computational overhead is worth avoiding disk writes for you? I suppose the path also changes quite often?
  2. thumbnail.max_behind, thumbnail.max_ahead and thumbnail.max_count: funky, that must have been some fiddling 😄 Selling point is memory usage? If so, I like the idea, but agree with max_count being confusing. If the only goal is memory usage, I don't see why max_count is needed, if in addition it is CPU strain / disk reads, then I need more thinking, but lean towards removing that part.
  3. passthrough sorting setting: can definitely be added, but currently unsure what the added value compared to a simple "none": lambda x: 0 ordering function is. Only use-case I see is keeping the CLI order, like you mention. The others are more or less random for me. If you have additional use-cases, would love to hear them 😊
  4. Loading by extension instead of magic-bytes: as you mention, definitely makes sense to wait for Replace imghdr #650 and then build upon this. I do see value in this as a setting, as the disk-reading strain has come up before in Slow when opening pictures on slow file systems (e.g., remote file systems) #363, and would love to merge a subsequent clean implementation. Feel free to just open an issue for this one, as the code would unfortunately have to be mostly re-written or at least moved after Replace imghdr #650 😞

@@ -161,14 +164,18 @@ def _on_monitor_fs_changed(self, value: bool) -> None:
def _load_directory(self, directory: str) -> None:
"""Load supported files for new directory."""
self._dir = directory
self._images, self._directories = self._get_content(directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you remove the assignment of self._directories by purpose? For me (i.e. without modifying the config or anything), this leads to no directories getting listed in the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reproduced your issue- this was most-certainly a careless deletion on my part. Thank you for catching it. Of all the changes, those to 'working_directory.py' will probably need the most careful scrutiny. For whatever reason, I had the most difficult time here.

@buzzingwires
Copy link
Contributor Author

buzzingwires commented Jul 4, 2023

Sure thing! And, I don't mind at all. The whole point of this draft pull request was to gauge sentiments towards these changes and figure out how to structure them.

  1. My reasoning is a partly that espoused in Add option to hide library at startup #14. To avoid SSD writes, and also a preference for avoiding an endlessly growing thumbnail cache. Using freedesktop.org's thumbnail standard was definitely the right idea, in my opinion. However, the situation described where other software inevitably generates thumbnails is not the case for me, probably because I rarely use graphical file managers.

    My answer to the computational overhead question is to use the max_behind, max_ahead and max_count settings being used to only load images that the user is presently concerning themself with, rather than loading every image listed. These should be able to be tuned such that the next images are loaded by the time the user manages to scroll to them.

  2. The idea behind max_count as it is tentatively named, is to reduce overhead in the situation of the user jumping to an arbitrary position in the thumbnail list, then jumping back to another. Both actions might put the view out of range of max_ahead and max_behind, causing the files to loaded and unloaded in quick succession. max_count allows for some out of range images to remain in memory. So, the goal is to allow some flexibility in memory conservation vs CPU usage.

  3. Indeed, the passthrough setting is pretty much exclusively to preserve command line order. I'm sorry, I'm a little confused what you mean by the others seeming random to you.

    The approach of using lambda x: 0 is better than a special condition to skip the sorting process, but in order for it to properly work in the even that the sort order is changed, a record of the command line order must be maintained.

  4. Got it. I'll hold off until Replace imghdr #650 is rolled out, and adapt my changes to that.

@karlch
Copy link
Owner

karlch commented Jul 4, 2023

Thanks for your explanations!

  1. Makes sense, happy to merge the option then, but should definitely be false at default (as you nicely did). In any case I would still keep the max_* settings in a separate PR, as those probably involve more review and discussion.
  2. Understood, and sounds good to me keeping all of them. This one is probably the most involved to test and review.
  3. "Others random" only refers to what happens when I e.g. move to a new directory in the library, or look at the library at all in this case, os.listdir and friends throw files in an order which just seems random to me, no clue how it works under the hood.
    Ahh, now I see why this was added. I personally don't think this additional boilerplate is worth it though. If one wants the startup "unsorted", just set the setting upon startup. I would assume you always have this one set to passthrough or none by default?
  4. Perfect 👍

@buzzingwires
Copy link
Contributor Author

buzzingwires commented Jul 4, 2023

Sure thing! I have nothing to add concerning points 1, 2, and 4.

For number 3 yeah, I see what you mean now. According to the Python docs, os.listdir is an unspecified order, and I'd speculate it's probably dependent on whatever interface the operating system provides to list a directory.

And yeah, I would have passthrough/none set by default. Arguably, none/passthrough should just naively take whatever existing order, so it'd make sense for changing the sort mode to wipe out the original command line order. With the present behavior, it could be called commandline rather than none or passthrough. Part of me thinks it'd be more consistent and flexible to have a sorting option that restored the original order, but its use would indeed be fairly incidental and its implementation pretty messy. I can totally understand a decision not to incorporate it.

A tangentially-relevant question to the matter of thumbnail loading: Do you know if there are any memory leaks associated with creating QIcon, QPixmap, and QListWidgetItem? When scrolling to the end and then to the beginning again with the max_ahead and max_behind settings, it seems that at least for me, the memory usage will slowly climb upwards as QIcons are created and assigned to ThumbnailItems. It probably isn't too much of an issue, but it's frustrating to see that something is being allocated and not freed, and not being certain of what that is, and I've tried various methods to 'ensure' objects are deallocated, but to no avail.

@karlch
Copy link
Owner

karlch commented Jul 5, 2023

While I agree that the option to reset the sorting would be more consistent, I would definitely argue that the added complexity is not worth it. Yet another option could be calling it unchanged indicating we don't touch whatever feeds in - or exists already.

I don't know of any, but am certainly not an expert on the matter. Would be a bit surprising to me though if this is an upstream issue. I can reproduce this, but playing around with the memory usage on master just jumping around a few folders I found the behaviour to be quite unpredictable (does increase / decrease when moving to folders with more / less images, but nowhere near the amount I would have expected, sometimes goes in the opposite direction, havoc). Not sure how simple of an issue this is 🤔

@buzzingwires
Copy link
Contributor Author

It really is an edge case with a relatively complex implementation that goes against how Vimiv generally handles arguments. So, I could also imagine it potentially being a pain to work around in the future. Let's just do the sorting option without keeping a record of the initial order, then?

Part of me just wants to keep the name of the sorting mode simple with 'none'. It seems like the first thing that'd come to mind with most people? But, I will defer to you, if you have a preference there.

Concerning the 'memory leak', I honestly was imagining this might be something from the C++ side of QT, or perhaps it's a quirk in how Python's GC works. It doesn't seem like something in vimiv itself, though I could be wrong on all those counts. This may be one of those behaviors that is technically a feature, that there's some reasoning for this extra memory being allocated. Anyway, I think it's a little quirk to keep an eye on.

Finally is the name of thumbnail.max_count. Perhaps it should be something like... thumbnail.unload_threshold

And otherwise, this should be four separate pull requests for

  • thumbnail.save

  • thumbnail.max_ahead, thumbnail.max_behind, thumbnail.max_count

  • image.id_by_extension, image.imghdr_fallback

  • sort.image_order, sort.directory_order

If this all looks good, I think I have what I need to move forward and will close this draft. Thank you for considering my ideas and taking the time to talk them out with me.

@karlch
Copy link
Owner

karlch commented Jul 5, 2023

Perfect, sounds good to me 👍 I am also totally fine with naming it none then.

Definitely worth keeping an eye on the memory issue, especially if this is an area where you have expertise. Personally, I probably won't be able to spend much time and extra thoughts on it given the general time limitations. Happy to help if anything specific comes up though of course.

The four PRs sound good, although the image.id_by_extension part may be better suited as an issue waiting for the mentioned PR to be merged. So three for now.

Concerning the naming of thumbnail.*, I think it would be nice if they had some common base that clarified them. We can discuss this in detail in the specific PR, but something along load_before, load_after and load_max maybe? While nit-picking, I think 0 should mean 0, and -1 = no limit (default). Was going to defer this to the PR, but while we are at it 😊

Thanks for your ideas and taking your time to share and discuss them!

@jcjgraf
Copy link
Contributor

jcjgraf commented Jul 5, 2023

I am a bit late to the party (did not have time yesterday to answer), but I also want to add a few comments 😊

  1. I get your issue with the thumbnail. But some things are not yet completely clear to me, respectively I have the feeling that there is potentially a better way to implement that.
  • Why exactly is there a distinction between max_behinde and max_before? Is that really needed?

  • If I get you correctly, the intention off max_count is that the thumbnails generated while jumping through the library and finding the correct set of images, get eventually removed again (once enough correct images were loaded). Does it also delete already generated thumbnails from disk?

  • Do you intent to change these option during runtime. E.g. once you identified the right location, increase max_before/after?

  • I find the max_count option still a bit weird, as it does not consider the number of thumbnails generated and saved by vimiv in total, or thumbnails generated in different directories.

I have also an issue with the way thumbnails are generated. While my problem is different from yours, my solution may also (partially) solve your issue (or a combination of both solutions). I deal with many RAW images. As the extraction of the embedded thumbnail needs quite some time, and as vimiv generates previews starting from the top, it takes an eternity (up to several minutes) till the thumbnails of the last images, which are the ones I am concerned with (when dealing with an SD card, where the newest images are by default at the bottom), are loaded.

While your solution may (partially) solve the my issue, I still want that in the end, all thumbnails are generated.

To solve my issue, I thought that it would be nice if priority is put onto the thumbnails displayed in the current view. As soon as the view changes, the new set of images is prioritized. Once all images in the view are generated, the vimiv deals with the generation of all other, not yet generated, thumbnails (in a somehow "smart" (load what is probably used next) way).
This behaviour would be inline how it is done in other major image viewer/editing software.

How this addresses your issue. If we add a boolean setting that specifies that only images in the current view should be loaded, it would kind of have the effect you wanted to achieve with max_before/after. It may remove some flexibility (i.e. being able how many images around the selected one should be loaded), but I think it makes the intend of this option much cleaner and applicable to a wider user-base.

The max_count option can also be added to this solution.

Would that also solve your issue?

  1. I have also thought about an option to let vimiv exclusively consider the file extension; If I have a SD card with a few thousand images of a single type, such an elaborated magic byte check is really not needed.
    You will be amazed how easy this will be to add to the new implementation. This option comes pretty much for free 😊

@buzzingwires
Copy link
Contributor Author

Alright, I'll try to respond to everything roughly in order.

@karlch

The sort type will be named none, then!

I don't know if my expertise is QT5's memory allocation in particular, but I'll most certainly keep it in the back of my head, and let you know if I come up with any tangible leads on it.

Definitely still waiting for the imghdr replacement for image.id_by_extension

The changes to thumbnail.* make sense, and I only have one issue. That is load_max might not precisely describe what the option does, because thumbnails that exceed this number are not automatically unloaded if they fall into the range specified by load_before and load_ahead. Rather, images that fall outside of the range will not be unloaded, unless there are more images than that number. That's why I considered unload_threshold.

Anyway! Now to @jcjgraf

Sorry, I might be missing something. I don't think there will be two options to that effect. Originally, they were max_ahead, max_behind, and max_count, though they will be renamed to load_ahead, load_behind, and potentially load_max

Already generated thumbnails will not be deleted from the disk by any of this code. Only potentially unloaded from memory, so they'd need to be reopened, and that can be turned off. One option adds the ability to not write new thumbnails from the disk, so the original will be reopened instead.

For specifics on max_count, maybe the end of my response to karlch would help?

Changing the options during runtime is definitely something I should test, though tentatively, I think they should probably work.

That's a great idea about generating thumbnail files in the background after the current range is displayed, and I'm totally happy to implement it.

Finally, using a single option would be cleaner as you mention. My preference for specifying a number ahead and behind is that it can be tuned to slower hardware, where one would want to preemptively load more thumbnails. Could make max_ahead and max_behind into a single option that uses the same number for either, though.

So to be clear, it should go like this?

Generate and display thumbnails in the range specified by max_ahead and max_behind (or our hypothetical single variable), starting with those that are actually visible. For images outside the range, save the thumbnail files to the disk, if that is enabled (which it is by default).

Definitely happy about the file extension detection. The trickier part of it is trying to use the tentatively named image.imghdr_fallback, where if an image type auto-detected by the extension does not work, then it will try to read the header of the file. Currently, my implementation of this seems to work, but is currently very hacky. This is something that might become clearer when your new implementation comes out, though!

Sorry for the length of this post.

@jcjgraf
Copy link
Contributor

jcjgraf commented Jul 6, 2023

Ah, now I get it. Thanks for taking the time to write such a detailed answers.
Since you seem to want a rather fine-grain control over what is loaded and what not, my proposed solution does not give you the required flexibility. So, do not further bother about what I said.

@buzzingwires
Copy link
Contributor Author

Sure thing! And hey, your feedback still has a lot of great ideas in it. I can definitely incorporate the background thumbnail file generation into my changes, so that the files are still made, even if they aren't loaded yet. I think this should just be added to the thumbnail.save? No need to make another config setting, I think.

On that note, merging max_ahead and max_behind might still be a good idea. I'll defer to @karlch on this, I think, since he already agreed to the first way. I'm not particularly married to the idea of being able to have different amounts loaded ahead and behind, and it would be good to reduce config file complexity.

Will also definitely test changing these settings while the software is running.

@karlch
Copy link
Owner

karlch commented Jul 6, 2023

I agree the points @jcjgraf made are useful, I see two points here:

  • Independent of any changes: generate "visible" thumbnails first, instead of starting at the front. Would not mingle this with any changes, but do this in an additional PR, in case we go for it.
  • Always generate thumbnails, independent of them being loaded in your case. If anything, this requires the max_ahead stuff, not thumbnail.save? Although it should probably check in an earlier stage, if they should be created at all, instead of running through all of them and then discarding, in case thumbnail.save is false 😄 This behaviour would then depend on both. Once again, I would prefer an extra PR, to avoid having monster-PRs 😊

Personally, I won't be using the max_ahead / max_behind stuff. So I also don't really have a preference between one or more settings. Random sample, have you ever set them to different values? 😄

@jcjgraf
Copy link
Contributor

jcjgraf commented Jul 6, 2023

Independent of any changes: generate "visible" thumbnails first, instead of starting at the front. Would not mingle this with any changes, but do this in an additional PR, in case we go for it.

I agree, it is less related than I thought (well, code-wise they are, but not functionality-wise) and both changes are definitely complex enough on their own. I can deal with my idea once your PR is done.

@buzzingwires
Copy link
Contributor Author

I actually think the first point can readily be implemented by changing the sorting of the thumbnail list that's fed into the manager. The machinery to pass thumbnail paths in arbitrary order with their intended index is already there.

What I mean by the second point is, with thumbnail.save, perhaps the files associated with all thumbnails should be created, even if they aren't into the range to be displayed yet. So, it would better accommodate @jcjgraf's use case of slow-loading raw files.

Honestly, I've never tried setting max_ahead and max_behind to different values. I think. Switching the code to a single config value would be trivial.

@karlch
Copy link
Owner

karlch commented Jul 6, 2023

Well, yes and no, depends on how far we want to go. Think of the following process:

  1. vimiv *.jpg or whatever opens a lot of images
  2. Thumbnails are instantly generated from the beginning in the current behaviour
  3. We run G and gt to view the last images .. and wait. The manager is busy generating from the start. So we would have to interact in some way, stopping the process (which is async) and re-start it. This may get out of hand quickly.

If we know in advance, we want to start at the end, passing the images in reverse to image and using the new none sort option would have the same effect.

Anything in between just becomes a messy async communication, not sure we want to go into that rabbit hole.

I will have to take a closer look if the new max_* settings stop this in some way, but I fear anything that involves larger navigation would involve havoc, especially if we want to load "all" but "re-prioritize" on the fly.

EDIT:
forgot, let's go for one setting then instead of two here 👍

@buzzingwires
Copy link
Contributor Author

One config option it is! And, I get what you're saying, now. Basically, off the top of my head, I would probably try to implement the "create all thumbnail files but not load them" functionality by making an option for the thumbnail manager to not send the signal that new images have been loaded. So, the thumbnail will be created on the disk, but the load signal will only be sent when the image comes into range. I think the question to figure out is: What happens the image comes into range while a background thumbnail creation job is running?

Could alternately do it by having the thumbnail GUI component keep track of the paths it has sent for thumbnail file creation. If an image comes into range while its thumbnail is being made, it just lets the old job finish instead of starting a new one. All jobs send the creation signal back, but the signal is ignored unless it's in range.

Yeah, you're right, this is separate pull request material, though. It's not trivial.

@karlch
Copy link
Owner

karlch commented Jul 6, 2023

Definitely extra PR material, agreed 😄

@jcjgraf could you clarify: I expect the extraction takes long, i.e. the generation of the initial QPixmap, loading or writing to disk really is instant afterwards? So we would have to change the order of the actual "generation" process, not just "not load them into the ThumbnailView"?

@jcjgraf
Copy link
Contributor

jcjgraf commented Aug 4, 2023

could you clarify: I expect the extraction takes long,

@karlch Yes, the extraction is what is expensive. I am not at all familiar with this part of the code. I thought it is somewhat related to the generation of the thumbnails, but maybe not. I will have a look at relevant part of the code and propose something, sometimes in the future.

@karlch
Copy link
Owner

karlch commented Aug 5, 2023

Thanks for clarifying, I think the current #665 actually fixes a large part of your problem, as the thumbnails are created from the current position, and any not yet running "creation" threads are actually stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants