Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Claim extraction from Images #72

Closed
Tracked by #69
dennyabrain opened this issue Dec 4, 2023 · 20 comments
Closed
Tracked by #69

Claim extraction from Images #72

dennyabrain opened this issue Dec 4, 2023 · 20 comments
Assignees
Labels
level:feature An issue that describes a feature (initiative>feature>ticket) level:ticket:spike priority:high role:ml

Comments

@dennyabrain
Copy link
Contributor

dennyabrain commented Dec 4, 2023

The various challenges involved in making sense of an image found on social media is summarized by this image
Screenshot 2023-12-04 at 15-13-05 Tech Interventions against Online Harms

The images could be a photograph, a manipulated image, screenshots, newspaper clippings or a meme. We have to device a solution to extract claims out of these images using a mix of automated and manual methods that can be deployed at population scale.

Some ideas on what type of functionality ML can enable :

  1. Extract text out of images using tesseract or google cloud vision
  2. Use a multimodal model(s) to describe the image and see how well they do that for our use case
  3. Extract entity name from images
  4. Finetuning multilingual and multimodal models for our context.

This is meant to be a timebound 5 day spike with the goal to learn as much as possible about what the state of the art in LLMs and ML can help us with claim extraction. We will like to include working prototype in this so that we get a good sense of system requirements and prices. As such evaluating paid proprietary solutions like ChatGPT could also be part of this.

@dennyabrain
Copy link
Contributor Author

Tuesday, Wednesday Spike

  • [Aurora, Aatman] NLP survey for image to text
  • Analyze the spreadsheet data to see trends for which category of use requests is popular
  • Review performance of Text extraction using Tesseract
  • Look at AI4Bharat language models that could be used

Check in on Wednesday 11 am.

@duggalsu
Copy link

duggalsu commented Dec 6, 2023

Tested Tesseract with english and hindi LSTM models with multiple psm settings. Hindi OCR does not work with legacy+LSTM option. Tesseract still cannot handle multi-column text from images.
Refer: https://muthu.co/all-tesseract-ocr-options/

Tested image pre-processing, which degraded text quality and did not improve OCR.

Relatively good OCR otherwise for English and Hindi on images

@aatmanvaidya
Copy link

aatmanvaidya commented Dec 6, 2023

tested out various models on hugging face and looked at (Learned Visual Model) LVM for vision. Will attach link soon

@dennyabrain
Copy link
Contributor Author

dennyabrain commented Dec 6, 2023

Wednesday :

  • Aatman : to focus on more models. Write up how well they do on our dataset. Note down their shortcomings as well.
  • Aurora : to focus on evaluating more models. Image to text and Image understanding (what the image contains)
  • [Aatman] Identify the 5 most popular categories of images - Some to start off with Memes/Posters, Non manipulated images, Newspaper clippings etc

@dennyabrain
Copy link
Contributor Author

  • Extracted text might needs cleanup. The discipline is called disfluency(?).

@duggalsu
Copy link

duggalsu commented Dec 6, 2023

It's called disfluency correction for our purpose.
Refer: https://www.semanticscholar.org/search?q=disfluency%20correction&sort=relevance

@duggalsu
Copy link

duggalsu commented Dec 6, 2023

The technical term for image understanding is Image Captioning
Refer: https://en.wikipedia.org/wiki/Natural_language_generation#Image_captioning

@dennyabrain
Copy link
Contributor Author

Thanks. I think an added layer of automation that would make for useful claim extraction is if we can detect the entities(people/landmarks) in a picture. So instead of the extracted claim being "a man is standing next to a building" if it said "politician X is standing next to taj mahal". we could create a dataset of persons of interest to facilitate this.

@dennyabrain
Copy link
Contributor Author

dennyabrain commented Dec 6, 2023

Found this nice use of traditional image processing to segment portions from newspaper clippings - https://stackoverflow.com/questions/64241837/use-python-open-cv-for-segmenting-newspaper-article

Should be also useful for multi text portion memes/posters.
I think these techniques might also be useful to segment portions of an image. and then those individual segments could be used for further matching queries.

@aatmanvaidya
Copy link

aatmanvaidya commented Dec 6, 2023

Identify the 5 most popular categories of images

Categories I could come up with -

  1. Newspaper Clippings
  2. Screenshots - these could be of social media posts, inshorts news app, whatsapp message(s), facebook posts, tweets etc. Some of these also include memes
  3. Information Posters - posters communicating some kind of information like india's gdp growth, sharing facts about a topic, sharing details about how a political party led to development, sharing info around sports,
  4. Letter(s) - some sort of complaint letters or information letters, letters to the govt regarding some issues.
  5. Other - news headlines

(In the dataset, I saw some images repeat)

Extract Text from Images (Vision Encoder Decoder Models)

  1. nougat-base - A Donut trained model to extract text from images. Works well for short English text, fails when the text is long (newspaper clippings etc). Doesn't work for Indic Languages.
  2. Few other models - perform poorly both on English and Hindi text in images.
  3. Transformer Based OCR's - decent text extraction for small length of text in English, performed poorly for Hindi text in images. Some gibberish pops out
  4. Awesome Transformer Based OCR - https://github.com/EriCongMa/awesome-transformer-ocr
  5. LayoutLM - https://huggingface.co/impira/layoutlm-document-qa - this is more for image understanding, but can also sometimes extract text - doesn't work for Indic languages.
  6. Azure AI Vision - https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr - This supports Hindi as per the Microsoft article.
  7. A question - What tools are other using for similar appplications?
  8. Easy OCR - https://github.com/JaidedAI/EasyOCR
  9. Keras OCR - https://github.com/faustomorales/keras-ocr
  10. Multilingual OCR for Indic Scripts

Detect the entities(people/landmarks) in a picture

  1. VisualBERT - https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert
  2. https://huggingface.co/nlpconnect/vit-gpt2-image-captioning (Aurora has also found this)
  3. https://huggingface.co/google/vit-base-patch16-224 - prints out different objects in an image.
  4. https://huggingface.co/openai/clip-vit-base-patch32 - this the model by OpenAI. The only drawback I found is that, we have to input the prediction labels and then it will compute which of the label's will have the highest chance of being in the photo.
  5. GIT (GenerativeImage2Text) based models - describes the image is about
  6. Vision-and-Language Transformer (ViLT) - https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
    • The best part of this model is that you can ask it questions around the image, like, "What is on the top of the tower?", "What is the man eating?" etc

Gibberish Text Detection

  1. https://stackoverflow.com/questions/68867789/python-pytesseract-module-returning-gibberish-from-an-image
  2. https://stackoverflow.com/questions/57377470/tesseract-showing-gibberish
  3. https://stackoverflow.com/questions/39835546/how-to-remove-gibberish-that-exhibits-no-pattern-using-python-nltk
  4. https://medium.com/analytics-vidhya/text-processing-tools-i-wish-i-knew-earlier-a6960e16a9c9

Large Vision Models (LVM)

  1. Sequential Modeling Enables Scalable Learning for Large Vision Models.
  2. LVM-Med
  3. LayoutLMV2

Other

  1. https://huggingface.co/blog/vision_language_pretraining
  2. https://huggingface.co/docs/transformers/main/en/model_doc/vision-encoder-decoder
  3. An encoder-decoder based framework for hindi image caption generation
  4. A Scaled Encoder Decoder Network for Image Captioning in Hindi

GPT4-Vision

  1. The documentation itself has good examples on how to use Vision API - https://platform.openai.com/docs/guides/vision
  2. GPT-4-Vision Interesting Uses and Examples Thread (2023) - A great code example on how to use GPT4-Vision
  3. https://tmmtt.medium.com/how-to-use-gpt-4-vision-api-ba6b57af569c
  4. Various use case examples of GPT4 Vision with python code - https://github.com/Anil-matcha/GPT-4-Vision-Chatbot/tree/main

SAM

  1. https://github.com/kadirnar/segment-anything-video
  2. https://blog.roboflow.com/how-to-use-segment-anything-model-sam/
  3. https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-segment-anything-with-sam.ipynb
  4. Many CV models (even for segment) - https://github.com/roboflow/notebooks
  5. https://www.youtube.com/watch?v=D-D6ZmadzPE

@dennyabrain
Copy link
Contributor Author

@aatmanvaidya can you try out two things that I believe will be useful pre-processing steps regardless of what model we use :

  1. Segmenting images to split an image into its components, could be pictures (in information poster), text blobs (in news paper clippings) etc
  2. Face detection and saving the face in a different file

@dennyabrain dennyabrain self-assigned this Dec 11, 2023
@aatmanvaidya
Copy link

aatmanvaidya commented Dec 11, 2023

Summary

From my perspective, writing a rough pipeline that could be followed

Once we have the image, we could follow a process like this

  1. Identify text using simple image processing techniques.
    • Image extraction tools cannot extract text properly where text is present in columns (new paper clippings is a popular example)
  2. Extract text from that identified text portion using Tesseract or EasyOCR
  3. Remove gibberish from the text.

@dennyabrain
Copy link
Contributor Author

Swair had a long response to this, I am cherry picking insights and typing here :

  1. For our image segmentation task, swair recommended Meta's SAM model
  2. He said paying for GPT4 vision could be an interesting exercise to compare performance. His back of the napkin calculation was that it should take 3.5$ per 1000 images.

He also said our approach of segmenting relevant portions and indexing it might be interesting/publishable.

@dennyabrain
Copy link
Contributor Author

@aatmanvaidya @duggalsu can summarize a 5 line blurb on the various text extraction models and libraries they used and their conclusions.

@dennyabrain
Copy link
Contributor Author

dennyabrain commented Dec 11, 2023

@aatmanvaidya
Copy link

aatmanvaidya commented Dec 11, 2023

Summary of the CDT report

  • In some applications, multilingual language models outperform models trained on only one language
  • The gap in data availability between languages is known as the resourcedness gap.

@dennyabrain
Copy link
Contributor Author

dennyabrain commented Dec 12, 2023

We should use today to test out one of the remaining solutions

  1. I have the chatgpt4 keys. So we should be able to try out gpt4vision
  2. Lets see how the SAM model from above performs well.
  3. Try out google cloud vision also. Especially to check how it performs on indic languages.

End of Spike Requirements :
Put together a self contained slide with all your findings. We would like to keep it handy when we talk about the status on possibility for claim extraction work. I think a good way to structure the slides would be to have sections on the problem statement "Extract text from image", "Caption an image" and then mention the technique(s) used and the results they gave. Fill it with as many examples as possible. Its best to be able to see them to truly understand. Share the good examples but also the really bad examples of the tech failing.

@duggalsu
Copy link

GPT4-Vision does not seem good for any kind of OCR - it will not do OCR for copyrighted articles in English and does not work well for Hindi

However, it can describe the image in detail i.e. do "image captioning" very well, better than the previously tested huggingface models

@tarunima tarunima moved this from Todo to Done in 2023 Q4 Planner Dec 15, 2023
@aatmanvaidya
Copy link

https://github.com/VikParuchuri/surya

Surya - A SOTA tool for multilingual OCR
Surya is a multilingual document OCR toolkit. It can do:

Accurate line-level text detection
Text recognition (coming soon)
Table and chart detection (coming soon)
It works on a range of documents and languages (see usage and benchmarks for more details).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level:feature An issue that describes a feature (initiative>feature>ticket) level:ticket:spike priority:high role:ml
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants