Fake Job Posting Detector

Overview

Built as a portfolio piece to practice the full ML lifecycle: data exploration, feature engineering, model selection, evaluation, and deployment as an interactive demo.

The dataset is the EMSCAD corpus of 18,000 real job postings, ~5% of which are confirmed fraudulent. Class imbalance is the central challenge.

Approach

Started with simple TF-IDF features on the job description text plus structured features (location, has_company_logo, employment_type). Compared logistic regression, random forest, and gradient boosting baselines.

After hyperparameter tuning and dealing with the class imbalance via SMOTE, the gradient boosting model achieved an F1 of 0.78 on the held-out fraud class — meaningful for a domain where false negatives cost users.

Overview

Approach

Stack