I am partnering with Malaysia's Ministry of Tourism to refine their country's pet adoption process. "Will a stray be adopted or not?" is the driving question. An advanced decision tree model is used as the binary classification model predictor. This model has ~81% (precsion) predicting if a Malaysian stray (cat or dog) will be adopted.
Real world data was imported from PetFinder.my (via Kaggle). The training dataset had over 11000 entries with about 19 attributes descrbing the animal. It will surely help protect Malaysian states health (locals and tourists) while making the adoption process more cost and labor efficient.
Malaysian Strays, 2021
Poltical Animals,2021
VetFuturist,Unknown
Malaysia's Ministry of Tourism is partnering with the local pet adoption agencies to minmize stray animals in the country. They spend about RM10.3 ($2.3Million) managing pounds and euthanization cost.
The ministry cares about correctly predicting adoptable animals well. The adoptable animals will be prioritized in a shelter or pound. It is not determined what will happen with ther other animals. The Ministry of Tourism has another phase to find the best solutions that utilize the animals and minimizes Euthanasia.
Ministry of Tourism (Main Stakeholder)
- Want to understand which animals are the most adoptable. (Phase 1:Current Model)
- Their goal is to improve the safety, attraction of area and soothe political upheival about all the strays.
Adoption Agency (Secondary Stakeholder) - Minimize Euthenasia and maximize holding animals as long as possible
Pet Finder supplied data for about 19,000 adoption entries for dogs and cats in each of Malaysias states.
Kaggle PetFinder, 2018
- Initial Target Classes Distributions:
- One Day Adoption: 410
- One Week Adoption: 3090
- One Month Adoption: 3259
- Two/Three Month Adoption: 4037
- No Adoption: 4197
- Modified Target Classes Distributions:
- Adopted: 10796
- Not Adopted: 4197
Chosen Metrics:
- Primarily want Precision to maximize TP and minimize FP of adoptees.
- Secondary we want Recall and F1 to minimize FN and because of imbalance.
F1 Score Metric, Joos Kortanje, 2021
- Dummy Model:
- Baseline:
- Logistic Regression
- KNN
- XGBoost
- Randon Forest
- Neural Net
- Pipeline (Scaled and Smoted)and GridsearchCV
- Logistic Regression
- KNN
- XGBoost
- Randon Forest
- Neural Net
XGBoost with GridsearchCV Tuning
- StandardScaler()
- SMOTE (random_state=42, sampling_strategy= Minority)
- XGBClassifier (criterion='entropy',max_depth=6, learning_rate=0.01,n_estimators=120,gamma=3,random_state=42)
Results Summary
- They both have an average of ~13% FP rate from the Confusion Matrix. This is a decreased FP from baseline.
- Test precision metrics of ~81%.
- The model performs better (87%) than the base Dummy Classifier (~72%)
- The best modeol based on performance and FP and FN percentage. Slightly better than the Random Forest model.
The top 5 featureswere:
- Age
- Breed1
- Color1
- PhotoAmt
- Gender
- The top two features are the same as the results from the initial correlation
- PhotoAmt and Color were initial weakly correlated to the Adoption Speed
- The Breed would help understand the animal without the Type explicitly known
- Knowing the photoAmt may help us to know that animals with photos and possibly multiple may have an easier time being adopted
- Deeper explantion of attributes
- Target Class imbalance
- Synthetic Data used
- HW Resources: Hyperparamter tuning resource intensive
- Time for more hypertuning
- Analysis on numerical tabular attributes only
- Use model with an experienced rescuer
- Prioritize adoption form attributes
- The feature importance shows with are the driving factors to understand if a animal will be adopted
- Plan cost with ~16% margin
- Acquire photos of all rescues
- Photo Analysis ( Image Classification)
- Textual Analysis (Natural Language Processing )
- Recommend specific attributes to rescue first
Repository Organization
- .gitignore
- License
- data
- images
- reproducibility
- notebook_capstone.ipynb
- notebook.capston.pdf
- presentation_capston.pdf



