You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?
Hi, thanks for your interest! Compared to VLMs trained on single-image input, each CoM chain may consists of multiple turns of image-text pairs, which could linearly increase the training and inference time. We have restricted the maximum turns to <= 3 in the data processor. And in fact, many CoM chain can reach the answer by re-inputting the image after a single CropZoomIn manipulation on the original image.
Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?
The text was updated successfully, but these errors were encountered: