Project Showcase

The collection.

Deep Learning NSFW Image Classifier

Deep Learning | Computer Vision

Learn More

SMS Spam Detection Classifier

NLP | Machine Learning | Text Classification

Learn More

IMDB Movie Review Sentiment Classifier

NLP | Deep Learning |  Sentiment Analysis | Text Classification

Learn More

Reddit Post Authorship & Behavioral Analysis

NLP | Reddit API | Data Visualization | Text Classification 

Learn More

Salmon Weight Prediction with Support Vector Regression

Machine Learning | Regression | Data Cleaning

Learn More

Deep Learning NSFW Image Classifier

Project Type: Image classification using TensorFlow and deep learning
Type: Deep Learning | Computer Vision
Tools: Python | TensorFlow | Keras | OpenCV | NumPy | Reddit API | Jupyter | GitHub

Overview:
Developed a deep learning model to detect and classify unsolicited explicit images (commonly known as “dick pics”) to promote safer digital environments. The model distinguishes between explicit and non-explicit content with 99.26% accuracy using a custom-built image dataset.

Key Technologies:

  • Languages & Frameworks: Python, TensorFlow, Keras, Scikit-learn
  • Libraries & Tools: OpenCV (image processing), Pandas, NumPy, Matplotlib, PRAW (Reddit API), Jupyter Notebook, VS Code
  • OS & Virtualization: Windows 10/11, Linux Mint (via VMware), Anaconda Navigator
  • Version Control: GitHub

Process Highlights:

  • Data Collection & Preprocessing:
    • Manually gathered and web-scraped over 250K unique images
    • Removed duplicates, resized images, maintained aspect ratios, labeled data
  • Model Architecture:
    • 5 Conv2D layersFlatten6 Dense layers
    • Final output layer uses sigmoid for binary classification
    • Trained using binary cross-entropy loss, Adam optimizer
  • Training & Evaluation:
    • Dataset split: 50% training, 25% validation, 25% testing
    • Batch size: 128, Epochs: 35
    • Achieved 99.26% accuracy and 0.0213 loss on test set (56K+ samples)

Impact:
Successfully built a high-accuracy model capable of flagging explicit content, which could be integrated into social media platforms or messaging apps to improve user safety and content moderation.

Gallery

Placeholder Image
Placeholder Image
Placeholder Image
Placeholder Image
Placeholder Image

    SMS Spam Detection Classifier

    Project Type: Image classification using TensorFlow and deep learning
    Type: NLP | Machine Learning | Text Classification
    Tools: Python | Scikit-learn | TF-IDF | Naive Bayes | Logistic Regression

    GitHub: https://github.com/birdmoney11/portfolio-samples/tree/main/NLP%20SMS

    Overview:

    Developed a natural language processing (NLP) pipeline to automatically classify SMS messages as spam or ham (not spam) using classic machine learning models and vectorization techniques. Achieved a 97% test accuracy using a Naive Bayes classifier, and cross-verified with a Logistic Regression model.

    Problem & Motivation

    With SMS marketing growing at over 20% CAGR worldwide, filtering spam is increasingly critical. This project aims to implement a scalable spam detection system based on publicly available SMS datasets and common NLP techniques.

    Dataset

    • Source: UCI SMS Spam Collection
    • Samples: 5,574 SMS messages (747 spam / 4,827 ham)
    • Languages: English (UK & Singapore regional differences noted)
    • Challenge: Linguistic and vocabulary variation created natural biases in the dataset, which were mitigated through preprocessing and normalization.

    Tools & Technologies

    • Languages: Python
    • Libraries: Scikit-learn, NLTK, Pandas, NumPy, Seaborn, Matplotlib
    • Models: Multinomial Naive Bayes, Logistic Regression
    • Vectorization: Bag of Words, TF-IDF, 1–3 gram range
    • Preprocessing: HTML unescape, punctuation/stopword removal, lemmatization (WordNet)
    • Evaluation: Confusion Matrix, Classification Report, Heatmaps

    Implementation Workflow

    1. Data Cleaning: HTML decoding, noise removal, lowercasing
    2. Tokenization & Lemmatization: Preprocessed into root words for better semantic matching
    3. Feature Extraction:
      • CountVectorizer (BoW)
      • TF-IDF with unigrams, bigrams, and trigrams (max 2,500 features)
    4. Model Training & Testing:
      • Stratified 75/25 train-test split
      • Applied both Naive Bayes and Logistic Regression for comparative analysis

    Results

    Naive Bayes Classifier:

    • Test Accuracy: 97%
    • Spam Precision: 0.77
    • Spam Recall: 0.98
    • F1-Score: 0.86

    Logistic Regression:

    • Test Accuracy: 96%
    • Slightly lower spam precision than Naive Bayes

    Both models performed well, but Naive Bayes showed higher robustness in detecting spam with fewer false negatives—critical for spam detection applications.

     

    IMDB Movie Review Sentiment Classifier

    Project Type: Sentiment analysis using TensorFlow and deep learning
    Type: NLP | Deep Learning |  Sentiment Analysis | Text Classification
    Tools: Python | TensorFlow | Keras 

    GitHubhttps://github.com/birdmoney11/portfolio-samples/tree/main/ML%20IMDB

    Overview:

    A binary classification project using deep learning to determine sentiment polarity (positive or negative) of movie reviews from the IMDB dataset.

     Problem & Dataset

    • Task: Classify movie reviews as either positive (≥7/10) or negative (≤4/10) to avoid ambiguity from neutral scores.
    • Dataset: IMDB dataset from keras.datasets, with 50,000 pre-tokenized reviews, split evenly into training and test sets. Each review is encoded using a Bag-of-Words (BoW) technique.

     Model Architecture

    • Built with Keras Sequential API, the model comprises:
      • Embedding layer for word vector representation
      • GlobalAveragePooling1D to flatten embeddings
      • Dense layers with ReLU activation
      • Sigmoid output for binary classification

    Tools & Technologies

    • Languages: Python
    • Libraries: TensorFlow, Keras, NumPy, Pandas, Matplotlib, Seaborn
    • IDE: Jupyter Notebook, VS Code
    • Evaluation Tools: Confusion Matrix, Classification Report (precision, recall, f1-score)

    Evaluation & Results

    • Accuracy: 87.52%
    • Loss: 0.3723
    • Metrics:
      • Positive review recall: 0.90
      • Negative review precision: 0.89
      • Balanced F1-scores (~0.88) indicate strong performance for both classes

    Key Features & Highlights

    • Applied Hold-Out Validation to split data into training and validation subsets.
    • Used classification metrics and visualization (confusion matrix, precision-recall plots) for in-depth model evaluation.
    • Demonstrated ability to preprocess data, build text pipelines, and tune neural networks.

    Reddit Post Authorship & Behavioral Analysis

    Project Type: NLP and Data Visualization
    Type: NLP | Reddit API | Data Visualization | Text Classification 
    Tools: Python | Pandas | Matplotlib | Seaborn

    GitHubhttps://github.com/birdmoney11/portfolio-samples/tree/main/DS%20Reddit%20post

    Project Objective

    Analyzed the writing style, vocabulary, and activity patterns of two Reddit users—Shittymorph (SM) and GuyWithFacts (GWF)—to evaluate the hypothesis: Are these accounts operated by the same person?
    This project combines natural language processing (NLP), behavioral analysis, and custom visualization to explore authorship attribution in real-world online forums.

    Tools & Technologies

    • Languages & Libraries: Python, Pandas, NumPy, Seaborn, Matplotlib
    • APIs: PRAW (Python Reddit API Wrapper)
    • NLP Techniques: Tokenization, stopword removal, custom regex filtering
    • IDE: Jupyter Notebook

    Data Collection & Preprocessing

    • Fetched Reddit comments using PRAW for both SM and GWF accounts
    • Removed repeated quotes and “copypasta” (e.g., Undertaker meme lines) using regex and str.replace()
    • Applied lowercasing, punctuation removal, and stopword filtering
    • Created separate labeled datasets for side-by-side analysis

    Feature Engineering & Analysis

    • Word Cloud Visualizations: Captured vocabulary richness and dominant word themes
    • Comment Length Histograms: Showed SM preferred short posts (peaking at 50–100 words), while GWF leaned toward long-form replies (100–300+ words)
    • Hourly Activity Heatmaps: Compared time-of-day posting patterns across users
    • Lexical Analysis: Identified differences in tone, sentiment, and topic focus

    Key Results

    • Distinct Posting Behavior: SM used more informal, meme-driven language; GWF’s posts were longer and more introspective
    • Unique Vocabularies: Clear divergence in top-used words—no major overlap
    • Time-of-Day Activity: Posting hours differed, reinforcing behavioral distinctions

    Conclusion

    The evidence strongly suggests SM and GWF are separate individuals. This analysis successfully applied NLP, real-world API data scraping, and visual storytelling to approach a forensic linguistics-style question.

    Skills Demonstrated

    • Natural Language Processing (NLP)
    • Data Wrangling & Cleaning
    • API Integration (PRAW)
    • Custom Text Preprocessing with Regex
    • Data Visualization & Exploratory Analysis
    • Behavioral Pattern Recognition

    Salmon Weight Prediction with Support Vector Regression

    Project Type: Sentiment analysis using TensorFlow and deep learning
    Type: Machine Learning | Regression | Data Cleaning
    Tools: Python | Scikit-learn | NumPy | Pandas | Matplotlib | Seaborn

    GitHub: https://github.com/birdmoney11/portfolio-samples/tree/main/DS%20Salmon%20size

    Project Overview

    This project builds a regression-based machine learning model to predict the weight of Pacific salmon based solely on their length, enabling potential real-time applications in ecological monitoring and AI-powered species recognition systems.

    Dataset

    • Source: Alaska Salmon Survey data (via Kaggle)
    • Size: ~14 million records (filtered to ~677,000 valid samples)
    • Species: Chinook, Chum, Coho, Pink, Sockeye
    • Key Features: Length (input), Weight (target variable)
    • Use Case: Simulates AI applications in wildlife management like Norway’s invasive species detection system

    Methodology

    1. Data Cleaning
      • Removed nulls, duplicates, and inconsistent records
      • Focused on biologically valid combinations (e.g., Chinook species with realistic weight-length pairs)
    2. Exploratory Data Analysis
      • Found correlations between age, gender, and weight
      • Verified statistical distributions across species
    3. Modeling: Support Vector Regression (SVR)
      • R² Score
      • Mean Absolute Error (MAE)
      • Root Mean Square Error (RMSE)
      • Used scikit-learn's SVR() with RBF kernel
      • Input: Salmon length
      • Target: Weight
      • Train/test split: 80/20
      • Performance evaluated using:
    4. Visualization
      • Created 2x2 subplots showing prediction vs. actual values
      • Compared multiple models (SVR #1 vs SVR #2) for robustness across different data subsets

    Results & Insights

    • SVR predicted weight values that closely followed the actual regression curve
    • Model #2 performed better at higher-length values
    • Key Insight: With high-quality, cleaned data, simple ML models like SVR can yield highly accurate predictions
    • Limitation: Highly processed data may inflate apparent accuracy—future work could explore raw data performance

    Real-World Relevance

    • Enables non-invasive, real-time predictions of fish weight using only visual input
    • Could power camera-based AI systems to assist in species detection, invasive species filtering, or population studies
    • Highlights use of ML in marine biology, sustainability, and automation

    Tools & Technologies

    • Python, NumPy, Pandas, Matplotlib, Seaborn
    • Scikit-learn (SVR, model selection, metrics)
    • Jupyter Notebook, Visual Studio Code