Portfolio

Analyzing Civil Servants data using Unsupervised Learning techniques

This report leverages unsupervised learning techniques, specifically Principal Component Analysis (PCA) and k-means clustering algorithms, to explore the annual declarations of personal interest and investment data from Mongolian civil servants[1]. The dataset, originally scraped from a dynamic web platform and collected for investigative journalism training by the Mongolian Data Club, contains comprehensive information from 2016 to 2021. For this analysis, data from the year 2021 was utilized. PCA was employed to reduce the dimensionality of the dataset, capturing approximately 81.64% of the total variance with the first 16 components. Subsequently, k-means clustering was applied to the principal components, identifying distinct clusters within the data. This approach aimed to uncover underlying trends and group similar declarations, a quantification task that has not previously been attempted on this dataset.

Analyzing-Youth-Substance-Use-Based-on-Family-Religious-and-Education-background

This study leverages the National Survey on Drug Use and Health (NSDUH) to explore different aspects of youth drug use using decision trees and ensemble methods. This project addresses three key predictions: identifying whether an individual uses alcohol with the help of binary classification, then estimating the frequency of alcohol use over the past year using multi-class classification, and finally predicting the age at which an individual first consumed alcohol using regression techniques. From our analysis, several variables proved to be significant predictors across different models. Factors related to educational achievements (EDUSCHGRD2) and demographic factors such as race (NEWRACE2), income levels (INCOME), and lifetime marijuana use (YFLMJMO) and alcohol consumption standards (STNDALC) were influential in predicting youth drug use behaviors. The binary classification gradient boosting model achieved a good accuracy of 83.12%, however the multi-class classification and regression tasks showed relatively poor performance but nonetheless gave us some insights about the factors related to youth substance abuse. These findings highlight the crucial role of both socio-demographic variables and substance use history in predicting drug use. The findings advocate for the critical role of family background and education in mitigating risky behaviors and provide insights for improving preventive strategies.

Analyzing and Predicting Dwelling Occupancy in Washington State

The aim of this study is to predict whether a dwelling is occupied by owners or renters based on several features related to individual demographics and housing characteristics. We have used Support Vector Machines (SVM) to classify the dataset. Our models achieved up to 82% accuracy in classifying dwellings. It was observed that factors such as Age and Number of bedrooms were significant predictors in determining whether a dwelling is owned or rented followed by other predictors like average house income and cost of utilities. Among the models we built, the linear kernel was recommended for its robustness and simplicity. The findings of this study provides a thorough analysis of the factors influencing dwelling occupancy and uncover deeper patterns which will be useful to real estate professionals in understanding housing trends.

Exploring The Sound of Seattle Birds with Neural Networks

In this report, neural networks are employed to identify the bird species of Seattle. We used spectrograms that were derived from Xeno-Canto’s Birdcall competition dataset that consists of 10 high-quality MP3 sound clips for each of the 12 selected bird species[1]. Additionally, three MP3 bird call recordings were available for external testing. The primary goal is to classify bird species based on their distinct vocalizations. We developed two custom neural network models: a binary classification model distinguishing between the American Crow and the Blue Jay, and a multi-class classification model capable of identifying any of the 12 bird species. Predictions on the three external test clips are made to assess the effectiveness of the models. Neural Networks with different architectures and parameters were employed to find the most efficient model.. For hidden layers, various activation functions such as Relu, SoftMax, or Leaky Relu were used. The report concludes with a discussion on alternative modeling approaches and the suitability of neural networks for this specific application.

Blood Bank Management System (BBMS)

The Blood Bank Management System (BBMS) is an application designed to facilitate the querying and visualization of blood bank data. It provides a user-friendly interface to execute predefined SQL queries, access donor information, manage blood inventory, and visualize data analytics through integrated Tableau visualizations.