Guide: Getting started with choosing a Machine Learning CLIP Model for Smart Search #11862
Replies: 6 comments 21 replies
-
Thanks for the write-up! Most of this could be added straight to the docs. Small correction: "efficiency" is related to both the x-axis (MACs) and y-axis (quality). Models that require less number-crunching to get the same quality are more efficient. A general tip to mitigate the search time for larger models is to set the env |
Beta Was this translation helpful? Give feedback.
-
Thanks for the writeup! Exactly what I was looking for. FWIW, I have Immich running on a low power RK3399 machine. So I didn't want to slow things down any furhter. I know, I could potentially use rockchip hardware acceleration, but it's by far not as straightforward as I'd like and I'm not at all in the mood for experimenting on this system right now, as it's running a bunch of other services, too. (Also, would HWA affect search speed or just inital processing?) So I started the test drive, coming from |
Beta Was this translation helpful? Give feedback.
-
Thanks for this. Also i did not realize the model was required for individual searches, Good to know. Also does anyone have any insights in how much better ViT-H-14-378-quickgelu__dfn5b responds over the default in the real world. I know numbers wise its a small jump but do other people find it noticable? |
Beta Was this translation helpful? Give feedback.
-
Hey Mertalev/internet stranger, if the content in this post looks good to you, would it be worthwhile to add it to the community guides? https://immich.app/docs/community-guides I would appreciate if someone would be kind enough to add it on my behalf. I know nothing of software and never used git..ever - so it is a bit daunting to try myself. |
Beta Was this translation helpful? Give feedback.
-
Hey all, thanks for the data provided. I was looking for a good model for German and English usage with "medium" or less hardware requirements. Based on the benchmarks my current choice is ViT-L-14-quickgelu stats |
Beta Was this translation helpful? Give feedback.
-
I didn't see this mentioned here, hence adding the comment: use remote machine learning if you have a laptop which is more powerful than the machine running immich. My MacBook Pro M1 is 35 times faster than my Synology DS918+ for example. I can run a new large model on my entire collection of 50k+ pictures in two days vs. more than two months for the Synology NAS. And that's without hardware acceleration / GPU. I still wish we are a proper OCR job in Immich. CLIP models are just not conceived for recognising text and hence are really bad at it. Ideally, additionally to the OCR job, a way to exclude all text from CLIP, because it just confuses the model a lot I noticed. Pictures with text are often returned for unrelated searches. |
Beta Was this translation helpful? Give feedback.
-
I am an absolute newbie when it comes to ML, so like many others, I was lost on how to choose a CLIP model. For over a year, I just stuck with the default, mostly because I didn't know what else to choose and because it worked pretty well. But after the recent release of some new models and the post about those being better, I was curious, like many others, about how to proceed. I realized a lot of people were asking questions on the discord channel, and the dev (mertalev) gave some really helpful advice and information there. Since we don't have an official guide yet, I thought I will curate some of their responses here.
Note: All this information is mostly just a copy/paste or a rephrased version of what I read on Discord. I could be wrong. But hopefully, it helps someone.
Performance Metrics of Different Models 📈
I got these links from a PR (#11468). It has the performance metrics of many models. They should have most of the models supported by immich (https://huggingface.co/immich-app).
Monolingual models metrics: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_retrieval_results.csv
Multilingual models metrics: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_multilingual_retrieval_results.csv
Easiest way to choose a model
The easiest way to determine what model to choose is to look at the attached interactive plots. I believe these plots were generated using information from the links listed above. These plots are in the .zip that is attached. There are three .html files that Mertalev shared (GitHub would not let me attach a .html file. So I had to .zip). Download them and open them on any browser. You might need a computer to open it.
clip_models_efficiency_plots.zip
The following are the specs of some of the popular models in tabular form:
Bubble Size -> RAM 💾
The bigger the bubble, the more RAM it's going to consume. Note: The charts don't factor in concurrency (running multiple things at once). At the default setting of 2, you might see a tiny bump in RAM use (like 10-20 MB). Crank it up higher, and you'll notice more of a difference.
Pro tip: Hover over a bubble to get more information about that model, including the RAM.
MACs (x-axis) -> Model Speed 🛩️
If you know FLOPs, MACs is a similar metric. It’s basically how much computation the model needs to do, which relates to how much time the device needs to spend to do them. More MACs = more time your device spends thinking. On a powerful GPU, speed might not be a big deal.
Quality of the search (y - axis)
Higher quality = Better search results. As simple as that. I don't know how quality is compared when the different models use different datasets. But it does give a very close idea of how the model performs.
Efficiency
"Efficiency" is related to both the x-axis (MACs) and y-axis (Quality). Models that require less number-crunching to get the same quality are more efficient.
For example: The model 'ViT-B-16-SigLIP-256__webli' has MACs = 29.45 Billion, and quality = 0.767. And the model 'ViT-H-14-378-quickgelu__dfn5b' has a staggering MACs = 542.15 Billion, and quality = 0.828. Now, certainly the bigger model will give better results. But is that increase of (~ 0.06 quality (7%)) worth the extra time to process things, you need to decide.
Remember, these charts just give you a general vibe. Don't stress too much about picking the "perfect" model based on them.
Some Questions You Might Have:
What Does "Slow" Mean? 🐌
When we're talking "slow," we're looking at two major things:
MACHINE_LEARNING_MODEL_TTL=0
for the machine learning container (in your docker compose).MACHINE_LEARNING_PRELOAD__CLIP=<model-name>
for the machine learning container. This will make the service load the specified model at startup instead of in the first request. For preloaded models, once it is loaded in memory, it will stay in memory until Immich shuts down. This means it will constantly use that RAM. Preloading the model also ensures that Immich does not unload the model after a certain duration of inactivity. If you have enough RAM, I recommend you preload your models. It has a noticeable effect on the search experience.Both 'Initial processing time' and 'Response time when a search is done' can be affected by the model choice. If you are like me, you probably hoped it would only affect the initial processing time. I don't care how long the initial time is, but it does become a bother when every search takes longer to show results. So, pick your poison. But hey, "slow" is relative, right? What is slow to you, might be fast for me!
Time Estimates? 🕰️
Here's the deal: Giving you exact times is impossible. It all depends on your hardware.
What is this 'webli', 'DFN-5b' etc?
Not all models are trained on the same data. It's like they've all read different books:
And so on...
How to set the model in Immich
Say you looked at the plots and have a model you want to try. Go to the huggingface page for Immich. Ctrl+F and find that model. Click on it and copy the exact name (as in the figure). Then paste that into the settings page under CLIP models (https://immich.app/docs/features/smart-search)
TIP - Both of these conventions are acceptable:
Some general tips:
FYI: This does mean, rerunning the smart search jobs, waiting for a while for all that is done. I suggest you take backups of your postgres database after every trial. So, once you are done with your trials, and know what you want, you can just switch to the database backup.
So, there you have it! Happy model hunting! 🎉
Some stats from my setup (Do not use this as a reference). Just to give you an idea:
You can do something similar with your trials to help you decide on one.
ViT-B-32__laion2b_e16
. But the difference was in milli seconds. Not much of a noticeable one. So, I stuck with thisBeta Was this translation helpful? Give feedback.
All reactions