Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DMP 2024]: Clustering large amount of audio #82

Closed
6 tasks
dennyabrain opened this issue Feb 16, 2024 · 35 comments
Closed
6 tasks

[DMP 2024]: Clustering large amount of audio #82

dennyabrain opened this issue Feb 16, 2024 · 35 comments
Assignees
Labels

Comments

@dennyabrain
Copy link
Contributor

dennyabrain commented Feb 16, 2024

Ticket Contents

Description

Feluda allows researchers, factcheckers and journalists to explore and analyze large quantity of multimeda content. One important modality on Indian social media is audio. The scope of this task is to explore various automated techniques suited for this grouping similar audio together and visualizing them. After consultation with the team, implement an end to end workflow that can be used to surface visual or temporal trends in a large collection of audio.

Goals

  • Review Literature with our team and do research and prototyping to review state of the art ML and classical DSP techniques
  • Optimize the solution for consistent RAM and CPU usage (limit the spikes caused by variables like file size, video length etc) since it will need to scale up for million videos.
  • Integrate the solution into Feluda by creating a operator that adheres to Feluda operator's interface

Expected Outcome

Feluda's goal is to provide a simple CLI or scriptable interface for Analysing multimodal social media data. In that vein, all the work that you do should be executable and configurable via scripts and config files. The solution should look at feluda's architecture and its various components to identify best ways to enable this.
The solution should have a way to configure data source (database with file IDs or a S3 bucket with files), specify and implement the data processing pipeline and where the result will be stored. Our current implementation uses S3 and SQL database for data source and Elasticsearch for storing result but additional sources or stores can be added if apt for this project.

Acceptance Criteria

  • Regular Interactive Demos with the team using a public jupyter notebook pushed to our experiments repository
  • Working feluda operator with tests that can be run as an independent worker in the cloud to schedule processing jobs over a large dataset
  • Output Structured data that can be passed onto a UI service (web or mobile) for downstream use cases

Implementation Details

One way we have approached this is by using Vector Embeddings. We have done this to great success to surface visual trends in Images. We used ResNet model to generate vector embeddings and store them in elasticsearch. We also used t-sne to reduce the dimensions of the vector embeddings to then display them in a 2D visualization. It can be viewed here
A detailed report over feluda's usage in a project to analyze images can be read here
The relevant feluda operator can be studied here
The code for tsne is here
A prior study of various ways to get insights out of images has been documented here

Mockups/Wireframes

This is an interactive visualization of Image clustering done using Feluda.
Screenshot 2024-02-16 at 08-16-56 Tattle - articles

Doing UI development or integrating with any UI software is not part of this project but it might help to see what sort of downstream applications we use Feluda for.

Product Name

Feluda

Organisation Name

Tattle

Domain

Open Source Library

Tech Skills Needed

Machine Learning, Python

Mentor(s)

@dennyabrain @duggalsu

Category

Data Science, Machine Learning, Research

@MadhukeshSingh
Copy link

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project.

Here's a bit about myself:
I am Madhukesh Singh, currently studying at the National Institute of Technology, Hamirpur, in my third year.

My experience includes working on image processing, computer vision, and object detection in satellite imagery during my internship as an AI developer at DRDO DYSL.AI.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

@dennyabrain
Copy link
Contributor Author

Hi @MadhukeshSingh we can use this issue to communicate approaches. If you start concretely implementing something, you can make a new issue specific to your approach and we can take the conversation there.

@Tahseen23
Copy link

"Hi there, @dennyabrain! I want to contribute to this project, but I am new to open-source contribution.
So, can you tell me what I have to do in this project and how to contribute?"

@manisha1301
Copy link

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project and for the project because of a robust skill set encompassing advanced machine learning and natural language processing capabilities.
my adaptability, efficiency in information retrieval, and quick learning make me a valuable asset for tasks requiring Machine Learning, AI-driven insights, data analysis, language language-related applications.
I am equipped to contribute to the team's goal by leveraging cutting-edge AI technology and staying abreast of industry trends.

Here's a bit about myself:
I am Manisha Sharma, currently studying at the Gd Goenka University, Gurugram, Haryana, 4th last year.

My experience includes working on deep learning, machine learning and artificial neural network, and artificial crypt analysis during my internship as an AI developer at Sag - DRDO and currently working in Interglobe Aviation as a data analyst internship.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

@sreyash-layek
Copy link

Hello @dennyabrain ,
I'm thrilled to delve into the Feluda project and its objectives. After reviewing the documentation, I noticed that my background aligns well with the project's needs.

A little about myself: My name is Sreyash Layek, and I'm currently in my fifth year at the Indian Institute of Technology, Kharagpur, pursuing a Dual Degree (Integrated B.Tech & M.Tech) with a specialization in Signal Processing and Machine Learning.

Over the past three years, I've dedicated myself to exploring Machine Learning, with a particular focus on Computer Vision and Natural Language Processing tasks. I've spent a year working on Speech Processing and Accent Conversion, achieving results close to the state-of-the-art. Additionally, I've developed models for various applications, including Attention Monitoring, Accident Classification, Audio Classification, Emotion Classification, Recommendation Systems, and more.

I bring to the table over five years of experience in Python and three years in Machine Learning and Deep Learning. I'm eager to learn more about the project and discuss how I can contribute. I'd be interested in understanding your expectations and the specific requirements for this project.

Could we explore this further?

@Sbswag
Copy link

Sbswag commented Apr 11, 2024

Hello @dennyabrain , My name is Surjeet bijarniya and I am a student of IIT bhu and passionate about machine learning and eager to join this project. But I am new in machine learning sir, tell me how I contribute

@KAMERAVAMSHI
Copy link

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself:
I'm Kamera Vamshi, currently I am Pursuing my B.Tech Final year at the National Institute of Technology, Rourkela (NIT Rourkela).

My background involves significant experience in Machine Learning, Python, and Data Analysis. I honed these skills during my internship and Projects.

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project.

@AkanshuAich
Copy link

Hii @dennyabrain ,

I am Akanshu Aich, a third year BTech student from International Institute of Information Technology, Bhubaneswar. I am writing to express my interest in contributing to this project as a part of DMP 2024. Having thoroughly reviewed the project, I am impressed by its objectives and it seeks the potential for great impact in industries.

With my background in Backend using Django , MERN with practicing hands on Machine learning and DevOps such as Docker, I believe I can make valuable contributions to Machine learning part . My experience includes several projects like Society-Expenditure Manager using Django, Real Estate using MERN and Info-Finding Tool using Machine Learning(LLM), which I believe align well with the goals of your project.

I am particularly interested in fulfilling the requirements of the project and have some ideas on how to approach it effectively. I am committed to adhering to best practices, contributing high-quality code, and actively collaborating with the project maintainers and community.

I am excited about the opportunity to contribute to "Feluda" and help further its mission. I look forward to discussing potential contributions and how I can best support the project.

Please guide me with procedure and with all your knowledge and experience.

@manavsolkar
Copy link

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself:
I'm Manav Solkar, currently I am Pursuing my B.Tech second year at Thakur College of Engineering and Technology (TCET).

I really want to be a part of this and hope that your guidance would help me to increase my skillset .

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project

@Tatwansh
Copy link

Hey @dennyabrain and @duggalsu,
I am interested to work on this project. I have prior experience working on project with similar objectives on the QAnon dataset. You can check out my work with the provided link.

notebook link: https://www.kaggle.com/code/tatwanshjaiswal/dark-web-language-analysis

I would be happy to receive feedback on how to improve it.

@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries.

@ashuashutosh2211
Copy link

Hey @dennyabrain and @duggalsu, I am Ashutosh pursuing B.Tech. in Artificial Intelligence and Data Science from IIT Jodhpur. I am proficient in languages like Python and C++. I have worked on projects related to machine learning and deep learning such as Stock Price Prediction and Voice Controlled Music Recommendation System using Deep Learning.
I am interested to work on this project and apply my skills in the project.

@dennyabrain
Copy link
Contributor Author

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

  1. Look at the problem statement and propose your approach
    Remember the main problem statement - Given a large number of audio files, find a way to group identical and similar audio files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

  2. Try getting feluda working on your machine
    Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it here. If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

  3. Recreate our code on a jupyter notebook or google collab notebook
    We already have some code that takes audio files and converts them into vectors. We also have code that takes these vectors and clusters them. I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

@vishakha72
Copy link

Hello @dennyabrain I really want to contribute in this project. I have good hands on experience on python, Machine learning, Databases, Deep Learning. I am Data Science student and really enthusiast to work in your project.
From past 3 years, I have done a lot of real time projects, I have also done many internships to gain the hands on experience.
I want to learn and gain experience in deep way by working on this project. Please allow me to work with your project.

@Satyam0775
Copy link

Hello @dennyabrain,

I'm eager to contribute to your project. With substantial experience in Python, machine learning, databases, and deep learning, I believe I can make valuable contributions. As a data science student, I've spent the past three years working on various real-world projects and completing internships to hone my skills.
I'm enthusiastic about delving deeper into the field and gaining practical experience through involvement in your project. I'm eager to learn and collaborate effectively. Please consider allowing me to be part of your team.

@dennyabrain dennyabrain changed the title [DMP 2024]: ## Clustering large amount of audio [DMP 2024]: Clustering large amount of audio Apr 23, 2024
@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

@Chaithanya512
Copy link
Contributor

hi @dennyabrain,

I am Chaithanya Kalyan. I am interested in contributing to this project.

I have experience working with time series signals. As part of the PhysioNet 2023 challenge, time domain and frequency domain features were extracted to classify the EEG signals (more details here).

I have a doubt regarding the details of this project and would greatly appreciate the clarification:

  1. Does this clustering algorithm have to be scalable to different datasets (like a general framework that can be extended ) or is it only for a specific dataset?

I think the following approach will be worth trying:
without extracting the traditional audio features, we can train an autoencoder network on a large audio collection to automatically learn a low-level representation of the audio signals and cluster based on these latent representations.

I have tried a similar approach on EEG signals before, you can find that notebook here.

I would be happy to hear your feedback.

contact: [email protected]

@dennyabrain
Copy link
Contributor Author

Hi @Chaithanya512,

Given that the project focus is on addressing usecases around online misinformation, the dataset we deal with is usually audio/video found on social media. So it can contain a variety of audio - memes, news clipping, amateur recording from phones etc.

Is there a quick way to validate if the autoencoder network approach would be suitable for this use case? What is your rationale to preferring that over extracting traditional audio features?

@Chaithanya512
Copy link
Contributor

Chaithanya512 commented Apr 23, 2024

Thank you for the feedback, I am currently working on the code to validate the use of autoencoders.

Compared to traditional, hand-crafted features, autoencoders have the potential to capture a wider range of features. While traditional audio features are valuable, they might miss some subtle patterns in the data that autoencoders can discover.

I have a follow-up question (might be stupid) for your response, please correct me if I am wrong.

I'm curious, do you think traditional audio features are effective in clustering misinformation and not-misinformation? do those features vary for misinformation and not-misinformation?

@dennyabrain
Copy link
Contributor Author

So we wont be using the clusters to classify something as "misinformation" and "not misinformation". We're hoping to use clustering as a way to find first level of grouping amongst a large dataset. So most likely the clusters could be something high level like "memes", "amateur-smartphone" etc. If we are lucky we could aspire for thematic labels like "politics", "health" etc.

An example of clustering we did on images is here - https://tattle.co.in/articles/covid-whatsapp-public-groups/t-sne/
The clusters we got then were - Screenshots(Social Media), Screenshots(Other), Medical Supplies, Paper Documents, Religious Imagery etc

@Chaithanya512
Copy link
Contributor

Chaithanya512 commented Apr 24, 2024

thank you for the clarification. That makes sense now. So, we are using clustering only to find the high-level labels/pseudo labels. I have found this paper that uses labeled data (only text) to categorize misinformation posters or active citizens on social media. It got me thinking - if we could obtain the transcriptions of the audio content (if that is possible), that information could significantly enhance our clustering efforts.

@dennyabrain
Copy link
Contributor Author

@Chaithanya512 yes that would certainly help. Infact when we do clustering for images, we often try to extract any text out of it as a way to get a richer dataset. You can certainly try transcriptions for audio content. One challenge might be that we are dealing with non English languages and also low quality audio.

@preeti13456
Copy link

preeti13456 commented Apr 24, 2024

hey can I work on this issue I have work on speech attenuation in the past so kind of familiar with problem statemnet indly let me know

@Ahmedfurkhan
Copy link

Hey !! I Want to work on this

@Ankita-Mohan
Copy link

Hi there, @dennyabrain,
I am Ankita Mohan, I am a third-year student at Kalinga Institute of Industrial Technology, Odisha. I'm passionate about machine learning and keen on joining this project. Moreover, I have a deep understanding of clustering algorithms as I have done projects in clustering.
I am eager to contribute and to gain your guidance for the same.

@Pushkar0730
Copy link

I would definitely like to work on it ☺️

@dennyabrain
Copy link
Contributor Author

Hi all thanks for your enthusiasm. Please let me know if you have any specific ideas on how you would go about the project.

Please refer to this comment for some suggested ways to move forward #82 (comment)

@PriyalPB
Copy link

Hi @dennyabrain ! I'm a third year student from Cummins Pune.

I'm thrilled to join your Clustering large amount of audio project and offer my skill sets which has a strong background in Machine Learning ,deep learning (CNN), NLP, DSP and Python, which seem to fit perfectly with what you're looking for.
I'm excited to explore how my expertise can elevate the project. Furthermore, the integration computer vision along with the ML advancements could lead to a seamlessly automated system.
I'm eager to discuss further avenues where I can make meaningful contributions. Could we schedule a meeting to delve into this in more detail?

@CodeSage4
Copy link

My skills in machine learning (computer vision, NLP) and experience with speech processing align well with the Feluda project. I'm a motivated student with 3+ years of Python experience and 2 years in ML/DL. Eager to discuss how I can contribute!

@VDinesh03
Copy link

Hi @dennyabrain , Myself V Dinesh Third Year Mechanical student from Army Institute of Technology Pune. I'm passionate about machine learning and keen on joining this project. In addition, my expertise in clustering algorithms extends to a profound level, acquired through hands-on experience gained from multiple projects focused specifically on implementing and fine-tuning various clustering techniques. These projects have provided me with a comprehensive understanding of the underlying principles, nuances, and practical applications of clustering algorithms across diverse domains, allowing me to effectively navigate through complex datasets, identify patterns, and extract meaningful insights. I am enthusiastic about contributing my expertise and am eager to receive your guidance in order to further enhance my capabilities in this regard.

@pandharkardeep
Copy link

Hi @dennyabrain . I am Deep Pandharkar, second year Data Science Engineering student from DJ Sanghvi College of Engineering Mumbai. I have a some experience in CV as well as NLP. My passion towards ML makes me keen towards joining this project. In addition to that, I have practised a lot of vector embeddings as a part of my NLP projects. I also have coding experience in Data Structures and Algorithms. Eager to discuss how can I contribue

@Sufia-ahmad
Copy link

I am Sufia, and I graduated with B.tech CSE, I am Data scientist and also full stack developer, but I am fresher I hv only completed 6 months of training in the entire field and one month of Internship so, I want to do the internship.

@aatmanvaidya
Copy link
Collaborator

aatmanvaidya commented Jun 18, 2024

Weekly Goals

Week 1

  • Set up my local development environment and workflow

Week 2

  • Setup Feluda and run tests for AudioVecEmbedding Operator
  • Collect a dataset of 150-200 Audio Files
  • Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
  • Try out different Embedding Models

Week 3

  • play the audio from the plot in colab
  • try out more embedding models - AST
  • do a review of lit for any video based transformer models are trained on audio
  • try out k-means clustering on Audio Data using Feluda's AudioVec Embedding operator

Week 4

  • Let's keep trying out more transformer based/ CNN ensemble based models
  • Evaluate clustering results
  • Do a review of lit for other clustering algorithms

Week 5

  • exam break

Week 6

  • Keep exploring embedding model's
  • start looking at sampling strategies for audio files.

Week 7

  • Run k-means clustering on CLAP, AST, CED and Feluda AudioVec
  • Profile CLAP, AST for RAM and CPU
  • read and implement sampling strategies.

Week 8

  • Finialize Embedding Model
  • Create a custom audio dataset in Indian context
  • Run K-means clustering on it and visualise using jupyter scatter
  • read sampling strategies.

Week 9

  • run finalised approach on prod data
  • exploratory analysis on how to improve results on custom data
  • run affinity clustering.

Week 10

  • write an operator using CLAP
  • write an clustering operator
  • run CLAP and CED on prod data

Week 11

  • clustering operator
  • test for clustering operator
  • documentation on the CLAP operator.
  • return t-SNE x,y coordinates in the clustering operator.

Week 12

  • finish writing the worker
  • document the reduction operator
  • document the worker

@Chaithanya512
Copy link
Contributor

Chaithanya512 commented Jun 26, 2024

Weekly Learnings and Updates:

Week 1:

  • Set-up feluda and ran tests for AudioVec Operator.

Week 2:

  • Collected a dataset of 314 audio files.

  • Established a baseline evaluation for the performance of AudioVec Operator on the dataset using a visual clustering plot.
    env_data_audiovec

  • Tried out OpenL3 embeddings model.
    env_data_openl3

Colab file: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing

Week 3:

Week 4:

Screenshot from 2024-07-22 15-31-51

Week 5:

  • Exam break

Week 6:

  • Learned about knowledge distillation when came across the CED and EfficientAT embeddings models. It is a technique where a smaller model (also called as student model) is trained to mimic the original large model (also called teacher model). This way, the smaller model maintains high performance while being less computationally expensive.
  • Tested CED, EfficientAT, BYOL-A audio embedding models.
  • Explored different pre-processing steps used to handle variable length of audio. Although most of the models used the usual techniques like trimming, chunking, and resampling, CLAP model used an interesting technique where an audio, length greater than T sec, is downsampled to T seconds and also frames of T sec are sliced from front, middle, and back of the clip. The stack of these 4 frames are passed down for further processing. The authors claim that this captures the local, and global features of the audio data.
  • CED model stood out with minimal time and memory usage. This model nearly perfectly clustered the audio clippings into respective categories. Image below:
    image

Week 7:

  • Examined KMeans clustering on AudioVec, CLAP, AST, and CED models.
  • The CED model effectively clustered audio files into homogeneous groups for categories like 'laughing', 'telugu-politics', 'hen', 'clock_alarm' , 'crow' , 'assamese-education' , 'keyboard_typing' , 'water_drops' , and 'siren'. Mixed clusters included, 'hindi-politics' and 'telugu-politics', various entertainment and politics categories, and 'hen' and 'rooster'. But still, CED performance was good making this a considerable model for operator.
  • Similarly CLAP and AST also performed well. See the colab notebook for more details.
  • Profiled AudioVec, CLAP, AST, and CED models. Information about the Time Taken and Memory Used for these models can be found here
  • Inspired by CLAP model's technique to capture local and global features of audio clip. I have implemented a functionality to slice the audio into 3 parts (from front, middle, and back) to optimize the usage of computational resources for extracting audios.

Week 8:

  • Prepared a custom dataset of indian context covering 4 different languages(assamese, bengali, kannada, telugu) across genres like education, entertainment, political, devotional, business, and motivation.
  • Tested the results on the finalized models, along with the sampling technique mentioned in CLAP paper.
  • Examined clustering algorithms for resulting embeddings.

Week 9:

  • Examined various parametric and non-paramteric clustering algorithms on the finalized models.
  • Tested the results of the finalized models and also whisper encoder model on custom dataset.
  • Discussed possible reasons for relative less performance on the indian-context dataset, this concluded that the models fail to capture the semantic relation from the audio files.

Week 10:

Week 11:

Week 12 :

  • Migrated the clap operator to hugging face to reduce unnecessary dependencies.
  • Completed the implementation of clustering worker (PR- [81] - add worker for media clustering #379).
  • Completed the documentation for clustering worker and dimension reduction operator.

@MadhalasaSJ
Copy link

Hi @dennyabrain , my name is Madhalasa, and I’ve recently completed my B.E in AI & ML from RNSIT, Bangalore. I’ve also done an internship at Infosys Springboard. As a fresher passionate about machine learning, I'm eager to contribute to this project. Is there a preferred method for communicating with the mentors and contributing to this project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests