Michael Rossetti is a data scientist, software developer, and machine learning researcher. He has worked as a polling data analyst for a winning US Presidential campaign, a data analytics director for a Silicon Valley startup, and a technology consultant for the US Government. He teaches courses in data science, computer science, and software development, and conducts research in applied machine learning.
Machine learning researcher with extensive experience in industry and academia. Proficient in supervised methods including regression and classification, and unsupervised methods including dimensionality reduction and clustering. Familiar with reinforcement learning, deep learning, and neural networks, including convolutional and recurrent networks. Experienced in model training, validation, and optimization, including hyperparameter tuning through techniques like grid search with cross-validation. Specializes in natural language processing (NLP), content recommendation systems, and development of novel applications for large language models (LLMs), including retrieval augmented generation (RAG).
Created AI tools in Python to provide automated and impartial assessments of US Presidential debates. Performed retrieval augmented generation using a large language model (Claude from Anthropic). Leveraged AI to identify memorable quotes, assess moderator fairness, assess which candidate won the debate, and assess how likely certain demographic groups are to vote for each candidate.
Created AI tools in Python to automate the grading of student homework documents. Performed retrieval augmented generation using large language models (ChatGPT from OpenAI, and LLaMA from Meta). Used prompt engineering to improve the agent’s grading performance. Validated the proof of concept.
Classified users in social networks based on the content of their posts. Trained classification models (Logistic Regression, Random Forest, and XGBoost), to classify whether or not a given user is an automated "bot". Achieved 95% F1 score and 98% ROC-AUC score on test data. Compared results using different text embedding methods (TF-IDF, Word2Vec, and models from OpenAI).
Assessed the similarity of hashtags in a given Twitter discussion, based on hashtag co-occurrence in user profiles, to identify and monitor disinformation related content on social networks. Applied dimensionality reduction methods (PCA, T-SNE, and UMAP) and clustering methods (HDBSCAN), to identify groups of related hashtags, including obscure hashtags associated with disinformation campaigns.
Assessed the similarity of artists and songs, for music recommendation purposes. Wrote Python code to download audio files from YouTube, and extract audio features such as tempo. Applied dimensionality reduction methods (PCA, T-SNE, and UMAP) to identify related artists and songs.
Analyzed the role of automated accounts called "bots" in spreading disinformation across social networks. Developed Python scripts to extract hundreds of millions of data points from the Twitter API. Architected Google BigQuery databases and ETL pipelines to store large scale data. Wrote SQL queries and Python scripts to perform data analysis and conduct statistical tests. Trained, evaluated, and deployed natural language processing models to classify a user’s political sentiments based on their social media posts. Achieved 88% accuracy on test data using benchmark models (Logistic Regression and Naive Bayes), and 96% accuracy using BERT.