Claim extraction from Images #72

dennyabrain · 2023-12-04T09:42:05Z

The various challenges involved in making sense of an image found on social media is summarized by this image

The images could be a photograph, a manipulated image, screenshots, newspaper clippings or a meme. We have to device a solution to extract claims out of these images using a mix of automated and manual methods that can be deployed at population scale.

Some ideas on what type of functionality ML can enable :

Extract text out of images using tesseract or google cloud vision
Use a multimodal model(s) to describe the image and see how well they do that for our use case
Extract entity name from images
Finetuning multilingual and multimodal models for our context.

This is meant to be a timebound 5 day spike with the goal to learn as much as possible about what the state of the art in LLMs and ML can help us with claim extraction. We will like to include working prototype in this so that we get a good sense of system requirements and prices. As such evaluating paid proprietary solutions like ChatGPT could also be part of this.

dennyabrain · 2023-12-05T05:05:12Z

Tuesday, Wednesday Spike

[Aurora, Aatman] NLP survey for image to text
Analyze the spreadsheet data to see trends for which category of use requests is popular
Review performance of Text extraction using Tesseract
Look at AI4Bharat language models that could be used

Check in on Wednesday 11 am.

duggalsu · 2023-12-06T04:39:42Z

Tested Tesseract with english and hindi LSTM models with multiple psm settings. Hindi OCR does not work with legacy+LSTM option. Tesseract still cannot handle multi-column text from images.
Refer: https://muthu.co/all-tesseract-ocr-options/

Tested image pre-processing, which degraded text quality and did not improve OCR.

Relatively good OCR otherwise for English and Hindi on images

aatmanvaidya · 2023-12-06T04:42:16Z

tested out various models on hugging face and looked at (Learned Visual Model) LVM for vision. Will attach link soon

dennyabrain · 2023-12-06T05:06:21Z

Wednesday :

Aatman : to focus on more models. Write up how well they do on our dataset. Note down their shortcomings as well.
Aurora : to focus on evaluating more models. Image to text and Image understanding (what the image contains)
[Aatman] Identify the 5 most popular categories of images - Some to start off with Memes/Posters, Non manipulated images, Newspaper clippings etc

dennyabrain · 2023-12-06T05:09:17Z

Extracted text might needs cleanup. The discipline is called disfluency(?).

duggalsu · 2023-12-06T05:25:03Z

It's called disfluency correction for our purpose.
Refer: https://www.semanticscholar.org/search?q=disfluency%20correction&sort=relevance

duggalsu · 2023-12-06T05:31:22Z

The technical term for image understanding is Image Captioning
Refer: https://en.wikipedia.org/wiki/Natural_language_generation#Image_captioning

dennyabrain · 2023-12-06T05:50:22Z

Thanks. I think an added layer of automation that would make for useful claim extraction is if we can detect the entities(people/landmarks) in a picture. So instead of the extracted claim being "a man is standing next to a building" if it said "politician X is standing next to taj mahal". we could create a dataset of persons of interest to facilitate this.

dennyabrain · 2023-12-06T06:05:43Z

Found this nice use of traditional image processing to segment portions from newspaper clippings - https://stackoverflow.com/questions/64241837/use-python-open-cv-for-segmenting-newspaper-article

Should be also useful for multi text portion memes/posters.
I think these techniques might also be useful to segment portions of an image. and then those individual segments could be used for further matching queries.

aatmanvaidya · 2023-12-06T09:38:06Z

Identify the 5 most popular categories of images

Categories I could come up with -

Newspaper Clippings
Screenshots - these could be of social media posts, inshorts news app, whatsapp message(s), facebook posts, tweets etc. Some of these also include memes
Information Posters - posters communicating some kind of information like india's gdp growth, sharing facts about a topic, sharing details about how a political party led to development, sharing info around sports,
Letter(s) - some sort of complaint letters or information letters, letters to the govt regarding some issues.
Other - news headlines

(In the dataset, I saw some images repeat)

Extract Text from Images (Vision Encoder Decoder Models)

nougat-base - A Donut trained model to extract text from images. Works well for short English text, fails when the text is long (newspaper clippings etc). Doesn't work for Indic Languages.
Few other models - perform poorly both on English and Hindi text in images.
Transformer Based OCR's - decent text extraction for small length of text in English, performed poorly for Hindi text in images. Some gibberish pops out
- https://huggingface.co/microsoft/trocr-base-handwritten, https://huggingface.co/microsoft/trocr-base-printed, https://huggingface.co/microsoft/trocr-large-printed
Awesome Transformer Based OCR - https://github.com/EriCongMa/awesome-transformer-ocr
LayoutLM - https://huggingface.co/impira/layoutlm-document-qa - this is more for image understanding, but can also sometimes extract text - doesn't work for Indic languages.
Azure AI Vision - https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr - This supports Hindi as per the Microsoft article.
A question - What tools are other using for similar appplications?
Easy OCR - https://github.com/JaidedAI/EasyOCR
- https://www.jaided.ai/easyocr/
- PERFORMS GOLD
Keras OCR - https://github.com/faustomorales/keras-ocr
Multilingual OCR for Indic Scripts

Detect the entities(people/landmarks) in a picture

VisualBERT - https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert
- https://huggingface.co/docs/transformers/model_doc/visual_bert
- As per the official demo by the researchers, the model can detect many objects accurately in an image. I could not run the hugging face code
https://huggingface.co/nlpconnect/vit-gpt2-image-captioning (Aurora has also found this)
https://huggingface.co/google/vit-base-patch16-224 - prints out different objects in an image.
https://huggingface.co/openai/clip-vit-base-patch32 - this the model by OpenAI. The only drawback I found is that, we have to input the prediction labels and then it will compute which of the label's will have the highest chance of being in the photo.
GIT (GenerativeImage2Text) based models - describes the image is about
- https://huggingface.co/microsoft/git-large-r-coco - tested on images of mountains, a man eating cake etc. Performed accurately.
- https://huggingface.co/microsoft/git-base, https://huggingface.co/microsoft/git-large-coco - variants of the above
Vision-and-Language Transformer (ViLT) - https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
- The best part of this model is that you can ask it questions around the image, like, "What is on the top of the tower?", "What is the man eating?" etc

Gibberish Text Detection

Large Vision Models (LVM)

Other

GPT4-Vision

The documentation itself has good examples on how to use Vision API - https://platform.openai.com/docs/guides/vision
GPT-4-Vision Interesting Uses and Examples Thread (2023) - A great code example on how to use GPT4-Vision
https://tmmtt.medium.com/how-to-use-gpt-4-vision-api-ba6b57af569c
Various use case examples of GPT4 Vision with python code - https://github.com/Anil-matcha/GPT-4-Vision-Chatbot/tree/main

SAM

duggalsu · 2023-12-06T12:16:19Z

"Disfluency correction" is for text output from Automatic Speech Recognition (ASR). It will not capture OCR text errors

We should look at "OCR correction"

https://www.semanticscholar.org/search?q=ocr%20correction&sort=relevance&pdf=true
https://paperswithcode.com/search?q_meta=&q_type=&q=ocr+correction
- In bengali - https://paperswithcode.com/paper/bbocr-an-open-source-multi-domain-ocr
- Specifically for tesseract, only english - https://paperswithcode.com/paper/a-tool-for-facilitating-ocr-postediting-in
- For endangered (low resource) langs - https://paperswithcode.com/paper/lexically-aware-semi-supervised-learning-for
https://huggingface.co/search/full-text?q=ocr+correction&type=model
- No Indic lang models
- English - https://huggingface.co/yelpfeast/byt5-base-english-ocr-correction

"Image Captioning"
Tried a few existing popular models

https://huggingface.co/Salesforce/blip-image-captioning-base - does not seem to work well out-of-the-box
https://huggingface.co/nlpconnect/vit-gpt2-image-captioning - more generic than the above

dennyabrain · 2023-12-07T04:58:01Z

@aatmanvaidya can you try out two things that I believe will be useful pre-processing steps regardless of what model we use :

Segmenting images to split an image into its components, could be pictures (in information poster), text blobs (in news paper clippings) etc
Face detection and saving the face in a different file

aatmanvaidya · 2023-12-11T04:42:42Z

Summary

From my perspective, writing a rough pipeline that could be followed

Once we have the image, we could follow a process like this

Identify text using simple image processing techniques.
- Image extraction tools cannot extract text properly where text is present in columns (new paper clippings is a popular example)
Extract text from that identified text portion using Tesseract or EasyOCR
Remove gibberish from the text.

dennyabrain · 2023-12-11T04:46:12Z

Swair had a long response to this, I am cherry picking insights and typing here :

For our image segmentation task, swair recommended Meta's SAM model
He said paying for GPT4 vision could be an interesting exercise to compare performance. His back of the napkin calculation was that it should take 3.5$ per 1000 images.

He also said our approach of segmenting relevant portions and indexing it might be interesting/publishable.

dennyabrain · 2023-12-11T05:03:37Z

@aatmanvaidya @duggalsu can summarize a 5 line blurb on the various text extraction models and libraries they used and their conclusions.

dennyabrain · 2023-12-11T05:06:50Z

References shared on the call :
https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/
https://cvit.iiit.ac.in/

aatmanvaidya · 2023-12-11T10:42:48Z

Summary of the CDT report

In some applications, multilingual language models outperform models trained on only one language
The gap in data availability between languages is known as the resourcedness gap.

dennyabrain · 2023-12-12T04:20:05Z

We should use today to test out one of the remaining solutions

I have the chatgpt4 keys. So we should be able to try out gpt4vision
Lets see how the SAM model from above performs well.
Try out google cloud vision also. Especially to check how it performs on indic languages.

End of Spike Requirements :
Put together a self contained slide with all your findings. We would like to keep it handy when we talk about the status on possibility for claim extraction work. I think a good way to structure the slides would be to have sections on the problem statement "Extract text from image", "Caption an image" and then mention the technique(s) used and the results they gave. Fill it with as many examples as possible. Its best to be able to see them to truly understand. Share the good examples but also the really bad examples of the tech failing.

duggalsu · 2023-12-13T04:34:55Z

GPT4-Vision does not seem good for any kind of OCR - it will not do OCR for copyrighted articles in English and does not work well for Hindi

However, it can describe the image in detail i.e. do "image captioning" very well, better than the previously tested huggingface models

aatmanvaidya · 2024-01-13T08:23:06Z

https://github.com/VikParuchuri/surya

Surya - A SOTA tool for multilingual OCR
Surya is a multilingual document OCR toolkit. It can do:

Accurate line-level text detection
Text recognition (coming soon)
Table and chart detection (coming soon)
It works on a range of documents and languages (see usage and benchmarks for more details).

dennyabrain mentioned this issue Dec 4, 2023

Signup Kosh+Feluda for paid projects #69

Open

10 tasks

dennyabrain assigned aatmanvaidya and duggalsu Dec 4, 2023

dennyabrain added the level:feature An issue that describes a feature (initiative>feature>ticket) label Dec 4, 2023

dennyabrain added this to 2023 Q4 Planner Dec 4, 2023

dennyabrain moved this to Todo in 2023 Q4 Planner Dec 4, 2023

dennyabrain added role:ml level:ticket:spike priority:high labels Dec 4, 2023

dennyabrain mentioned this issue Dec 6, 2023

Identify categories of Claim Extraction and Matching that can be improved #70

Closed

dennyabrain self-assigned this Dec 11, 2023

tarunima moved this from Todo to Done in 2023 Q4 Planner Dec 15, 2023

dennyabrain closed this as completed Jan 2, 2024

This was referenced Feb 16, 2024

[DMP 2024]: Clustering large amount of videos tattle-made/feluda#81

Closed

[DMP 2024]: Clustering large amount of audio tattle-made/feluda#82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claim extraction from Images #72

Claim extraction from Images #72

dennyabrain commented Dec 4, 2023 •

edited

Loading

dennyabrain commented Dec 5, 2023

duggalsu commented Dec 6, 2023 •

edited

Loading

aatmanvaidya commented Dec 6, 2023 •

edited by dennyabrain

Loading

dennyabrain commented Dec 6, 2023 •

edited

Loading

dennyabrain commented Dec 6, 2023

duggalsu commented Dec 6, 2023

duggalsu commented Dec 6, 2023 •

edited

Loading

dennyabrain commented Dec 6, 2023

dennyabrain commented Dec 6, 2023 •

edited

Loading

aatmanvaidya commented Dec 6, 2023 •

edited

Loading

duggalsu commented Dec 6, 2023 •

edited

Loading

dennyabrain commented Dec 7, 2023

aatmanvaidya commented Dec 11, 2023 •

edited

Loading

dennyabrain commented Dec 11, 2023

dennyabrain commented Dec 11, 2023

dennyabrain commented Dec 11, 2023 •

edited

Loading

aatmanvaidya commented Dec 11, 2023 •

edited

Loading

dennyabrain commented Dec 12, 2023 •

edited

Loading

duggalsu commented Dec 13, 2023

aatmanvaidya commented Jan 13, 2024

Claim extraction from Images #72

Claim extraction from Images #72

Comments

dennyabrain commented Dec 4, 2023 • edited Loading

dennyabrain commented Dec 5, 2023

Tuesday, Wednesday Spike

duggalsu commented Dec 6, 2023 • edited Loading

aatmanvaidya commented Dec 6, 2023 • edited by dennyabrain Loading

dennyabrain commented Dec 6, 2023 • edited Loading

Wednesday :

dennyabrain commented Dec 6, 2023

duggalsu commented Dec 6, 2023

duggalsu commented Dec 6, 2023 • edited Loading

dennyabrain commented Dec 6, 2023

dennyabrain commented Dec 6, 2023 • edited Loading

aatmanvaidya commented Dec 6, 2023 • edited Loading

Identify the 5 most popular categories of images

Extract Text from Images (Vision Encoder Decoder Models)

Detect the entities(people/landmarks) in a picture

Gibberish Text Detection

Large Vision Models (LVM)

Other

GPT4-Vision

SAM

duggalsu commented Dec 6, 2023 • edited Loading

dennyabrain commented Dec 7, 2023

aatmanvaidya commented Dec 11, 2023 • edited Loading

Summary

dennyabrain commented Dec 11, 2023

dennyabrain commented Dec 11, 2023

dennyabrain commented Dec 11, 2023 • edited Loading

aatmanvaidya commented Dec 11, 2023 • edited Loading

Summary of the CDT report

dennyabrain commented Dec 12, 2023 • edited Loading

duggalsu commented Dec 13, 2023

aatmanvaidya commented Jan 13, 2024

dennyabrain commented Dec 4, 2023 •

edited

Loading

duggalsu commented Dec 6, 2023 •

edited

Loading

aatmanvaidya commented Dec 6, 2023 •

edited by dennyabrain

Loading

dennyabrain commented Dec 6, 2023 •

edited

Loading

duggalsu commented Dec 6, 2023 •

edited

Loading

dennyabrain commented Dec 6, 2023 •

edited

Loading

aatmanvaidya commented Dec 6, 2023 •

edited

Loading

duggalsu commented Dec 6, 2023 •

edited

Loading

aatmanvaidya commented Dec 11, 2023 •

edited

Loading

dennyabrain commented Dec 11, 2023 •

edited

Loading

aatmanvaidya commented Dec 11, 2023 •

edited

Loading

dennyabrain commented Dec 12, 2023 •

edited

Loading