Big Data Computing Projects
This post highlights a series of big data analytics and machine learning projects I developed using Apache Spark and PySpark. The NBA Shot Log and NYC Parking Violations projects I originally created using Hadoop, and then translated into Spark. These projects demonstrate my ability to process large datasets, build scalable machine learning models, and extract actionable insights:
-
Toxic Comment Classification
Preprocessed text data using Spark MLlib’s pipeline, transforming raw comments into sparse TF-IDF vectors. Applied logistic regression and random forest models to detect toxic language at scale. -
Heart Disease Prediction
Developed a binary classification model in PySpark using logistic regression to predict the likelihood of heart disease based on patient medical data. -
Census Income Classification
Trained and evaluated a logistic regression model in Spark to predict whether an individual earns over $50K annually, using demographic data from the UCI Adult dataset. -
NYC Parking Violations Analysis
Processed New York City parking violation data in PySpark to identify the specific date (month and day) with the highest number of recorded violations. -
NBA Shot Log: Comfortable Zone Analysis
Applied KMeans clustering in PySpark to analyze NBA shot log data and identify players’ most successful shooting zones on the court.
You can view the source code and explore these projects in more detail on my GitHub!