diff --git a/Readme.md b/Readme.md index b2a3fbd..1b7020f 100644 --- a/Readme.md +++ b/Readme.md @@ -261,3 +261,29 @@ curl -X POST http://127.0.0.1:8000/predict \ -d '{"url":"http://ex.com/login?acct=12345","p_malicious":0.45}' curl http://127.0.0.1:8000/stats ``` + +--- +### Latest aditions will organize +--- +## Model Performance + +**Validation Metrics (PhiUSIIL Dataset):** +- PR-AUC (phishing detection): **99.92%** +- F1-Macro: **99.70%** +- Brier Score: **0.0026** +- False Positive Rate: **0.09%** (23/26,970 legitimate URLs) + +**Feature Set (8 features):** +- IsHTTPS, TLDLegitimateProb, CharContinuationRate +- SpacialCharRatioInURL, URLCharProb, LetterRatioInURL +- NoOfOtherSpecialCharsInURL, DomainLength + +**Threshold Policy:** +- Low threshold: 0.004 → ALLOW (below this) +- High threshold: 0.999 → BLOCK (above this) +- Gray zone: 10.9% → REVIEW (escalate to judge) + +**Known Limitations:** +- Model trained on PhiUSIIL dataset (2019-2020 URLs) +- Major tech companies (google.com, github.com) are out-of-distribution +- Whitelist override implemented for known legitimate short domains \ No newline at end of file diff --git a/docs/FEATURE_EXTRACTION.md b/docs/FEATURE_EXTRACTION.md new file mode 100644 index 0000000..4a4285c --- /dev/null +++ b/docs/FEATURE_EXTRACTION.md @@ -0,0 +1,44 @@ +# Feature Extraction Documentation + +## Overview +All features extracted using `src/common/feature_extraction.py` for training/serving consistency. + +## Feature Definitions + +### 1. IsHTTPS +- **Type:** Binary (0/1) +- **Definition:** URL uses HTTPS protocol +- **Range:** [0, 1] + +### 2. TLDLegitimateProb +- **Type:** Float +- **Definition:** Bayesian legitimacy probability for TLD +- **Range:** [0, 1] +- **Source:** `common/tld_probs.json` (695 TLDs) +- **Priors:** α=1, β=2 (conservative) + +### 3. CharContinuationRate +- **Type:** Float +- **Definition:** Ratio of consecutive identical characters +- **Range:** [0, 1] +- **Example:** "google.com" → 0.176 + +[... continue for all 8 features ...] + +## Training/Serving Consistency +- ✅ Same extraction logic for training and production +- ✅ No data leakage (trained on raw PhiUSIIL URLs) +- ✅ Validated: Batch vs live extraction matches + +### **Step 4: Clean Up Notebooks (30 min)** + + +``` +notebooks/ + ├── 00_eda.ipynb + ├── feature_engineering.ipynb + ├── 03_ablation_url_only.ipynb + ├── 03_ablation_url_only_copy.ipynb + └── archive/ + └── old_experiments/ +``` \ No newline at end of file diff --git a/docs/Phishing_Detection_Model_Report.md b/docs/Phishing_Detection_Model_Report.md new file mode 100644 index 0000000..2e2eba2 --- /dev/null +++ b/docs/Phishing_Detection_Model_Report.md @@ -0,0 +1,80 @@ +## Phishing Detection Model Report: Performance, Optimization, and Distribution Shift Analysis + +### 1. Executive Summary + +This report details the successful development and optimization of two phishing detection models: an 8-feature "Research Model" and a 7-feature "Production Model." Both models demonstrate exceptional performance, with PR-AUC scores of 0.9992 and 0.9988, respectively. The Production Model is recommended for deployment due to its robustness against future changes in the HTTPS landscape. + +A sophisticated threshold optimization strategy has been implemented, incorporating a "gray-zone" for uncertain classifications that require review. While the models exhibit near-perfect accuracy on in-distribution data, a critical finding reveals a significant "distribution shift" issue. Major legitimate websites like Google.com and GitHub.com are being misclassified as phishing due to discrepancies between their URL features and the characteristics of the training data. This report delves into the root causes and offers a clear understanding of this challenge. + +### 2. Model Development and Artifacts + +Two primary models have been developed and saved for production: + +* **Research Model (8-feature with IsHTTPS):** + * **Performance (PR-AUC):** 0.9992 + * **Purpose:** Achieves maximum performance on the current dataset. + * **Path:** `models\dev\model_8feat.pkl` + +* **Production Model (7-feature without IsHTTPS):** + * **Performance (PR-AUC):** 0.9988 + * **Purpose:** Designed for robustness against the anticipated 2025 HTTPS phishing landscape, removing `IsHTTPS` as a feature. + * **Path:** `models\dev\model_7feat.pkl` + * **Recommendation:** **RECOMMENDED FOR DEPLOYMENT** + +Associated metadata and optimized thresholds for both models have also been saved, ensuring readiness for service integration. + +### 3. Enhanced Threshold Optimization and Decision Logic + +A refined threshold optimization process has been implemented, moving beyond a simple binary classification to include a "REVIEW" gray-zone: + +* **Optimal Decision Threshold (`t_star`):** 0.350 (achieving an F1-macro of 0.9972) +* **Gray-Zone Band:** A range from 0.004 (Low) to 0.999 (High) creates a 10.9% gray-zone rate. +* **Decision Distribution:** + * **ALLOW:** 48.1% (22,584 samples) + * **REVIEW:** 10.9% (5,135 samples) + * **BLOCK:** 41.0% (19,234 samples) + +This multi-tiered decision logic allows for a more nuanced handling of URLs, flagging uncertain cases for manual review rather than making a potentially incorrect automated decision. + +### 4. Model Performance and Key Insight: Distribution Shift + +The models exhibit excellent performance on validation data, demonstrating high confidence in their predictions: + +* **Validation Prediction Distribution:** + * **Extreme Phishing (p >= 0.99):** 41.5% + * **Extreme Legitimate (p <= 0.01):** 55.2% + * **Moderate (0.01 < p < 0.99):** Only 3.3% (Uncertain) + +* **Misclassification Rate:** A remarkably low 0.09% misclassification rate for legitimate URLs (only 23 out of 26,970 legitimate samples were misclassified as phishing). + +However, a critical issue identified is the misclassification of well-known legitimate URLs (e.g., `https://google.com`, `https://github.com`) as phishing. + +This anomaly is attributed to a **distribution shift** between the training data and these common URLs. The training data, sourced from the PhiUSIIL dataset (2019-2020), primarily focuses on obscure and suspicious URLs and lacks representation from major legitimate tech companies. + +### 5. Root Cause Analysis: The Impact of `URLCharProb` and `DomainLength` + +Detailed debugging of `https://google.com` revealed specific feature disparities: + +* **`URLCharProb` Outlier:** The `URLCharProb` for `google.com` (1.000) is an extreme outlier, 4073.95 standard deviations from the training data mean (0.060). This feature, indicating the probability of characters appearing in URLs, suggests `google.com` uses a character distribution vastly different from the training examples. +* **`DomainLength` Discrepancy:** `google.com` has a `DomainLength` of 10 characters, significantly shorter than the training data's average of 21.467 characters. The model appears to associate shorter, simpler domains with suspicious characteristics. +* **`TLDLegitimateProb`:** While `google.com`'s TLDLegitimateProb (0.612) is within the training average, it is slightly lower than the average legitimate training TLD probability of 0.709. + +**Conclusion:** The model's classification of `google.com` as phishing stems from its feature set being out-of-distribution compared to the training data. The model was trained on a dataset where legitimate URLs tended to have longer domains and different character probability distributions, leading it to perceive well-known, short, and simple legitimate domains as suspicious. + +### 6. Recommendations + +1. **Deploy the 7-feature Production Model:** Proceed with the deployment of `model_7feat.pkl` as recommended, leveraging its robust design. +2. **Implement Robust Out-of-Distribution Handling:** Develop and integrate a mechanism to detect and appropriately handle URLs that are significantly out-of-distribution. This could involve: + * **Whitelisting:** Create and maintain a curated whitelist of known legitimate domains that bypass model prediction. + * **Ensemble Methods:** Explore integrating other detection methods or a secondary model specifically designed for high-confidence legitimate URLs. + * **Data Augmentation:** Incrementally expand the training dataset to include a diverse range of legitimate, well-known URLs. +3. **Monitor Gray-Zone Effectively:** Establish clear protocols and tools for reviewing URLs flagged within the "REVIEW" gray-zone to continuously refine the model and its thresholds. +4. **Feature Engineering Review:** Re-evaluate features like `URLCharProb` and `DomainLength` to ensure they are universally applicable or consider alternative normalizations that are less susceptible to domain length biases. + +### 7. Visualizing the Decision Process + +>![alt text](../outputs/Visualizing_Decision_Proces.png) + +*An illustrative flowchart showing the model's decision process: input URL, feature extraction, model prediction, threshold application (ALLOW, REVIEW, BLOCK), and finally, the output decision.* + +This report highlights the dual success of achieving high-performing phishing detection models and the critical challenge of distribution shift. Addressing this shift through strategic data augmentation and robust OOD handling will be paramount for real-world reliability. \ No newline at end of file diff --git a/notebooks/feature_engineering.ipynb b/notebooks/01_feature_engineering.ipynb similarity index 100% rename from notebooks/feature_engineering.ipynb rename to notebooks/01_feature_engineering.ipynb diff --git a/notebooks/03_ablation_url_only copy.ipynb b/notebooks/02_ablation_url_only.ipynb similarity index 90% rename from notebooks/03_ablation_url_only copy.ipynb rename to notebooks/02_ablation_url_only.ipynb index 1ddc183..b6d8204 100644 --- a/notebooks/03_ablation_url_only copy.ipynb +++ b/notebooks/02_ablation_url_only.ipynb @@ -18,7 +18,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "5413dffd", "metadata": {}, "outputs": [], @@ -49,9 +49,7 @@ "import hashlib\n", "from urllib.parse import urlparse\n", "\n", - "import matplotlib.pyplot as plt\n", - "from scipy.stats import ks_2samp\n", - "from sklearn.model_selection import cross_val_score\n" + "\n" ] }, { @@ -64,7 +62,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 2, "id": "dc490286", "metadata": {}, "outputs": [ @@ -72,7 +70,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "Working directory: d:\\MLops\\NetworkSecurity\n" + "Working directory: d:\\MLops\\NetworkSecurity\n", + "[feature_extraction] Loaded 1401 TLD probabilities\n" ] } ], @@ -98,7 +97,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "677b791d", "metadata": {}, "outputs": [ @@ -112,15 +111,7 @@ "python-dotenv could not parse statement starting at line 10\n", "python-dotenv could not parse statement starting at line 11\n", "python-dotenv could not parse statement starting at line 12\n", - "python-dotenv could not parse statement starting at line 3\n", - "python-dotenv could not parse statement starting at line 7\n", - "python-dotenv could not parse statement starting at line 10\n", - "python-dotenv could not parse statement starting at line 11\n", - "python-dotenv could not parse statement starting at line 12\n", "python-dotenv could not parse statement starting at line 13\n", - "python-dotenv could not parse statement starting at line 13\n", - "python-dotenv could not parse statement starting at line 14\n", - "python-dotenv could not parse statement starting at line 15\n", "python-dotenv could not parse statement starting at line 14\n", "python-dotenv could not parse statement starting at line 15\n" ] @@ -184,7 +175,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 4, "id": "f3582563", "metadata": {}, "outputs": [ @@ -215,26 +206,6 @@ "Final feature matrix: (234764, 8)\n", "Label distribution:\n", " Legitimate (1): 134850 (57.4%)\n", - " Phishing (0): 99914 (42.6%)\n", - "\n", - "Loaded data shape: (234764, 10)\n", - "Expected: (234764, 9)\n", - "\n", - "Warning: Extra columns in dataset (will be ignored): ['URL']\n", - "\n", - "Feature validation:\n", - " IsHTTPS: float64, 0 nulls\n", - " TLDLegitimateProb: float64, 0 nulls\n", - " CharContinuationRate: float64, 0 nulls\n", - " SpacialCharRatioInURL: float64, 0 nulls\n", - " URLCharProb: float64, 0 nulls\n", - " LetterRatioInURL: float64, 0 nulls\n", - " NoOfOtherSpecialCharsInURL: int64, 0 nulls\n", - " DomainLength: int64, 0 nulls\n", - "\n", - "Final feature matrix: (234764, 8)\n", - "Label distribution:\n", - " Legitimate (1): 134850 (57.4%)\n", " Phishing (0): 99914 (42.6%)\n" ] } @@ -248,7 +219,7 @@ "\n", "df = pd.read_csv(DATA_PATH)\n", "print(f\"\\nLoaded data shape: {df.shape}\")\n", - "print(f\"Expected: ({len(df)}, {len(OPTIMAL_FEATURES) + 1})\")\n", + "print(f\"Expected: ({len(df)}, {len(OPTIMAL_FEATURES) + 1})\") # +1 for label\n", "\n", "missing_features = [f for f in OPTIMAL_FEATURES if f not in df.columns]\n", "if missing_features:\n", @@ -258,7 +229,9 @@ "\n", "extra_features = [c for c in df.columns if c not in OPTIMAL_FEATURES + [\"label\"]]\n", "if extra_features:\n", - " print(f\"\\nWarning: Extra columns in dataset (will be ignored): {extra_features}\")\n", + " print(\n", + " f\"\\nWarning: Extra columns in dataset (will be ignored): {extra_features}\"\n", + " ) # URL will be ignored\n", "\n", "print(\"\\nFeature validation:\")\n", "for feature in OPTIMAL_FEATURES:\n", @@ -280,7 +253,7 @@ "id": "25f12332", "metadata": {}, "source": [ - "### **SECTION 3: Train/Test Split**\n", + "### **SECTION 3: Data splitting and Model training**\n", "- Purpose: Split data into training and validation sets with stratification to maintain class balance. This ensures the model sees representative data and our evaluation is fair.\n", "- Explanation:\n", "\n", @@ -289,12 +262,20 @@ " - stratify=y ensures both sets have same class distribution as full dataset\n", " - random_state=SEED makes split reproducible for debugging\n", " - Prints detailed class distributions to verify stratification worked correctly\n", - " - After deduplication in EDA, we know there are no duplicate URLs across splits" + " - Also takes care of deduplication make sure no duplicates leak or confuse the model" + ] + }, + { + "cell_type": "markdown", + "id": "ee015b8e", + "metadata": {}, + "source": [ + "#### **3.1. Train/Test Split**" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 5, "id": "7846af25", "metadata": {}, "outputs": [ @@ -319,20 +300,6 @@ "\n", "Feature matrix shapes:\n", " X_train: (187811, 8)\n", - " X_val: (46953, 8)\n", - "\n", - "Training set:\n", - " Samples: 187,811\n", - " Phishing: 79,931 (42.6%)\n", - " Legitimate: 107,880 (57.4%)\n", - "\n", - "Validation set:\n", - " Samples: 46,953\n", - " Phishing: 19,983 (42.6%)\n", - " Legitimate: 26,970 (57.4%)\n", - "\n", - "Feature matrix shapes:\n", - " X_train: (187811, 8)\n", " X_val: (46953, 8)\n" ] } @@ -366,7 +333,7 @@ "id": "67d141cc", "metadata": {}, "source": [ - "### **SECTION 4: Model Training & Calibration**\n", + "#### **3.2. Model Training & Calibration**\n", "- Purpose: Train candidate models (LogisticRegression, XGBoost) and calibrate their probabilities using isotonic regression. Calibration ensures p_malicious values are reliable for threshold-based decisions\n", "- Explanation:\n", "\n", @@ -376,12 +343,12 @@ " - fit_calibrated() wraps model training with isotonic calibration (5-fold CV)\n", " - Calibration corrects probability estimates so p_malicious is reliable for thresholding\n", " - Selects best model by PR-AUC (primary) then F1-macro (tiebreaker)\n", - "Stores all models and predictions for later analysis" + " - Stores all models and predictions for later analysis" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 9, "id": "80bd21b4", "metadata": {}, "outputs": [ @@ -400,20 +367,7 @@ " Sample p_mal: [0.00862244 1. 0.00589575] | Sample p_legit: [0.99137756 0. 0.99410425]\n", " Class 0 (phishing) is at index: 0\n", " Class 1 (legit) is at index: 1\n", - " Training sample prediction: [[0.00713379 0.99286621]]\n", - " Sum of probabilities: 1.000000\n", - " P(phishing) for training sample: 0.007134\n", - " P(legit) for training sample: 0.992866\n", - " PR-AUC (phishing): 0.9965\n", - " F1-macro @0.5: 0.9848\n", - " Brier score: 0.011757\n", - "\n", - "Training xgb...\n", - " Model classes: [0 1]\n", - " Phishing prob column: 0, Legitimate prob column: 1\n", - " Sample p_mal: [0.00862244 1. 0.00589575] | Sample p_legit: [0.99137756 0. 0.99410425]\n", - " Class 0 (phishing) is at index: 0\n", - " Class 1 (legit) is at index: 1\n", + " Test sample features: {'IsHTTPS': 1.0, 'TLDLegitimateProb': 0.612, 'CharContinuationRate': 0.16, 'SpacialCharRatioInURL': 0.1923076923076923, 'URLCharProb': 0.06, 'LetterRatioInURL': 0.8076923076923077, 'NoOfOtherSpecialCharsInURL': 5, 'DomainLength': 18}\n", " Training sample prediction: [[0.00713379 0.99286621]]\n", " Sum of probabilities: 1.000000\n", " P(phishing) for training sample: 0.007134\n", @@ -429,39 +383,23 @@ "name": "stderr", "output_type": "stream", "text": [ - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:15] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", - "Parameters: { \"verbose\" } are not used.\n", - "\n", - " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:22] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", - "Parameters: { \"verbose\" } are not used.\n", - "\n", - " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:22] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", - "Parameters: { \"verbose\" } are not used.\n", - "\n", - " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:31] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [09:56:10] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:31] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [09:56:12] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:35] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [09:56:14] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:35] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [09:56:16] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:38] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", - "Parameters: { \"verbose\" } are not used.\n", - "\n", - " warnings.warn(smsg, UserWarning)\n", - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:46:38] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [09:56:19] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n" @@ -476,6 +414,7 @@ " Sample p_mal: [0.00815403 1. 0.00186653] | Sample p_legit: [0.99184597 0. 0.99813347]\n", " Class 0 (phishing) is at index: 0\n", " Class 1 (legit) is at index: 1\n", + " Test sample features: {'IsHTTPS': 1.0, 'TLDLegitimateProb': 0.612, 'CharContinuationRate': 0.16, 'SpacialCharRatioInURL': 0.1923076923076923, 'URLCharProb': 0.06, 'LetterRatioInURL': 0.8076923076923077, 'NoOfOtherSpecialCharsInURL': 5, 'DomainLength': 18}\n", " Training sample prediction: [[0.00232477 0.99767523]]\n", " Sum of probabilities: 1.000000\n", " P(phishing) for training sample: 0.002325\n", @@ -502,7 +441,10 @@ "\n", "logreg_base = Pipeline(\n", " [\n", - " (\"scaler\", StandardScaler(with_mean=False)),\n", + " (\n", + " \"scaler\",\n", + " StandardScaler(with_mean=False),\n", + " ), # to preserve range and avoid negative values\n", " (\n", " \"clf\",\n", " LogisticRegression(\n", @@ -568,6 +510,7 @@ "\n", " # Test on a training sample to verify probability extraction\n", " test_sample = X_train.iloc[[0]]\n", + " print(f\" Test sample features: {test_sample.to_dict(orient='records')[0]}\")\n", " test_proba = calib.predict_proba(test_sample)\n", " print(f\" Training sample prediction: {test_proba}\")\n", " print(f\" Sum of probabilities: {test_proba.sum():.6f}\") # Should be 1.0\n", @@ -629,10 +572,19 @@ "print(f\" Brier: {best_metrics['brier_phish']:.6f}\")\n" ] }, + { + "cell_type": "markdown", + "id": "92d3d968", + "metadata": {}, + "source": [ + "### **SECTION 4: Threshold Optimization + Gray-Zone Judge**\n", + "- Find the optimal decision threshold t_star (F1-macro on validation), then locate a gray-zone band around it for judge routing." + ] + }, { "cell_type": "code", - "execution_count": 28, - "id": "2e8004ba", + "execution_count": 13, + "id": "d03baaa6", "metadata": {}, "outputs": [ { @@ -641,424 +593,310 @@ "text": [ "\n", "============================================================\n", - "SPOT CHECK: SHORT URLS MISCLASSIFIED AS PHISHING\n", - "============================================================\n", - "Total legitimate samples in validation: 26970\n", - "Legitimate samples misclassified as phishing: 23\n", - "Misclassification rate: 0.09%\n", - "\n", - "🔍 TOP SHORT LEGITIMATE URLS MISCLASSIFIED AS PHISHING:\n", - "================================================================================\n", - "URL Domain P(phish) HTTPS TLD_Prob Dom_Len\n", - "--------------------------------------------------------------------------------\n", - "https://www.it120.cc www.it120.cc 0.781 1 0.159 12 \n", - "\n", - "📊 ANALYSIS OF MISCLASSIFIED SHORT URLS:\n", - "--------------------------------------------------\n", - " Average IsHTTPS: 1.00 (0=HTTP, 1=HTTPS)\n", - " Average TLD Legitimacy Prob: 0.159\n", - " Average Character Continuation Rate: 0.263\n", - " Average Special Character Ratio: 0.250\n", - " Domain lengths: min=12, max=12, avg=12.0\n", - "\n", - " Most frequently misclassified short domains:\n", - " www.it120.cc: 1 times\n", - "\n", - "🔬 COMPARISON WITH TRAINING DATA DISTRIBUTION:\n", - "--------------------------------------------------\n", - " TLDLegitimateProb:\n", - " Misclassified short URLs: 0.159\n", - " Training legitimate URLs: 0.709\n", - " Difference: -0.551\n", - " CharContinuationRate:\n", - " Misclassified short URLs: 0.263\n", - " Training legitimate URLs: 0.169\n", - " Difference: 0.094\n", - " SpacialCharRatioInURL:\n", - " Misclassified short URLs: 0.250\n", - " Training legitimate URLs: 0.198\n", - " Difference: 0.052\n", - " DomainLength:\n", - " Misclassified short URLs: 12.000\n", - " Training legitimate URLs: 19.215\n", - " Difference: -7.215\n", - "\n", - "============================================================\n", - "SPECIFIC EXAMPLES: WELL-KNOWN DOMAINS\n", + "ENHANCED THRESHOLD OPTIMIZATION WITH JUDGE INTEGRATION\n", "============================================================\n", - "URL P(phishing) Classification Status \n", - "--------------------------------------------------------------\n", - "Total legitimate samples in validation: 26970\n", - "Legitimate samples misclassified as phishing: 23\n", - "Misclassification rate: 0.09%\n", "\n", - "🔍 TOP SHORT LEGITIMATE URLS MISCLASSIFIED AS PHISHING:\n", - "================================================================================\n", - "URL Domain P(phish) HTTPS TLD_Prob Dom_Len\n", - "--------------------------------------------------------------------------------\n", - "https://www.it120.cc www.it120.cc 0.781 1 0.159 12 \n", + "1. Standard F1-macro threshold optimization:\n", + " t_star: 0.350\n", + " F1-macro @t_star: 0.9972\n", + "THRESHOLD OPTIMIZATION RESULTS:\n", + "==================================================\n", "\n", - "📊 ANALYSIS OF MISCLASSIFIED SHORT URLS:\n", - "--------------------------------------------------\n", - " Average IsHTTPS: 1.00 (0=HTTP, 1=HTTPS)\n", - " Average TLD Legitimacy Prob: 0.159\n", - " Average Character Continuation Rate: 0.263\n", - " Average Special Character Ratio: 0.250\n", - " Domain lengths: min=12, max=12, avg=12.0\n", + "1. Optimal decision threshold:\n", + " t_star: 0.350\n", + " F1-macro @t_star: 0.0028\n", "\n", - " Most frequently misclassified short domains:\n", - " www.it120.cc: 1 times\n", + "2. Standard gray-zone band (target: 10-15%):\n", + " Low threshold: 0.004\n", + " High threshold: 0.999\n", + " Gray-zone rate: 10.9%\n", "\n", - "🔬 COMPARISON WITH TRAINING DATA DISTRIBUTION:\n", - "--------------------------------------------------\n", - " TLDLegitimateProb:\n", - " Misclassified short URLs: 0.159\n", - " Training legitimate URLs: 0.709\n", - " Difference: -0.551\n", - " CharContinuationRate:\n", - " Misclassified short URLs: 0.263\n", - " Training legitimate URLs: 0.169\n", - " Difference: 0.094\n", - " SpacialCharRatioInURL:\n", - " Misclassified short URLs: 0.250\n", - " Training legitimate URLs: 0.198\n", - " Difference: 0.052\n", - " DomainLength:\n", - " Misclassified short URLs: 12.000\n", - " Training legitimate URLs: 19.215\n", - " Difference: -7.215\n", + "3. Standard decision distribution:\n", + " ALLOW: 48.1% (22,584 samples)\n", + " REVIEW: 10.9% (5,135 samples)\n", + " BLOCK: 41.0% (19,234 samples)\n", "\n", - "============================================================\n", - "SPECIFIC EXAMPLES: WELL-KNOWN DOMAINS\n", - "============================================================\n", - "URL P(phishing) Classification Status \n", - "--------------------------------------------------------------\n", - "https://google.com 1.000 PHISHING ❌ WRONG \n", - "https://github.com 1.000 PHISHING ❌ WRONG \n", - "https://fb.com 1.000 PHISHING ⚠️ CHECK \n", - "https://bit.ly 1.000 PHISHING ⚠️ CHECK \n", - "https://t.co 0.943 PHISHING ⚠️ CHECK \n", - "https://apple.com 1.000 PHISHING ❌ WRONG \n", - "https://amazon.com 1.000 PHISHING ❌ WRONG \n", - "https://microsoft.com 1.000 PHISHING ❌ WRONG \n", + "4. Enhanced Decision Logic Examples:\n", + "----------------------------------------\n", + " URL: https://images.google.com | Decision: REVIEW | Uncertain classification (p=0.250)\n", + " URL: https://github.com/user/repo | Decision: ALLOW | Well-known domain override (p=0.350)\n", + " URL: https://suspicious-bank-login.com | Decision: REVIEW | Uncertain classification (p=0.450)\n", + " URL: https://definitely-phishing-site.evil | Decision: REVIEW | Uncertain classification (p=0.850)\n", + " URL: https://legitimate-company.com | Decision: REVIEW | Uncertain classification (p=0.150)\n", "\n", - "💡 INSIGHTS:\n", - "- Look for patterns in misclassified short domains\n", - "- Check if TLD legitimacy probabilities are unusually low\n", - "- Verify if domain length feature is causing bias against short domains\n", - "- Consider adding domain whitelist for well-known legitimate short domains\n", - "https://google.com 1.000 PHISHING ❌ WRONG \n", - "https://github.com 1.000 PHISHING ❌ WRONG \n", - "https://fb.com 1.000 PHISHING ⚠️ CHECK \n", - "https://bit.ly 1.000 PHISHING ⚠️ CHECK \n", - "https://t.co 0.943 PHISHING ⚠️ CHECK \n", - "https://apple.com 1.000 PHISHING ❌ WRONG \n", - "https://amazon.com 1.000 PHISHING ❌ WRONG \n", - "https://microsoft.com 1.000 PHISHING ❌ WRONG \n", + "5. Configuration saved to: configs\\dev\\thresholds.json\n", "\n", - "💡 INSIGHTS:\n", - "- Look for patterns in misclassified short domains\n", - "- Check if TLD legitimacy probabilities are unusually low\n", - "- Verify if domain length feature is causing bias against short domains\n", - "- Consider adding domain whitelist for well-known legitimate short domains\n" + "Threshold optimization complete! ✓\n" ] } ], "source": [ "# ========================================\n", - "# SPOT CHECK: SHORT URLS MISCLASSIFIED AS PHISHING\n", + "# SECTION 4: THRESHOLD OPTIMIZATION & JUDGE INTEGRATION\n", "# ========================================\n", "\"\"\"\n", - "**Purpose:** Identify short legitimate URLs that are being incorrectly classified as phishing.\n", - "This helps understand model biases and potential issues with short domain classification.\n", + "**Purpose:** Find optimal decision thresholds and implement enhanced routing logic for edge cases.\n", + "\n", + "**Enhanced Strategy:** \n", + "- Standard F1-optimized thresholds for typical URLs\n", + "- Judge-based routing for short domains (addresses github.com/google.com misclassification)\n", + "- Three-tier decision framework: ALLOW/REVIEW/BLOCK\n", "\"\"\"\n", "\n", "print(\"\\n\" + \"=\" * 60)\n", - "print(\"SPOT CHECK: SHORT URLS MISCLASSIFIED AS PHISHING\")\n", + "print(\"ENHANCED THRESHOLD OPTIMIZATION WITH JUDGE INTEGRATION\")\n", "print(\"=\" * 60)\n", "\n", - "# Load the full dataset to get URLs\n", - "df_full = pd.read_csv(DATA_PATH)\n", + "print(\"\\n1. Standard F1-macro threshold optimization:\")\n", + "grid = np.linspace(0.05, 0.95, 19)\n", + "f1_scores = []\n", "\n", - "# Get legitimate URLs from validation set that are predicted as phishing\n", - "val_indices = X_val.index\n", - "legitimate_val_indices = val_indices[y_val == 1] # Legitimate samples in validation\n", - "misclassified_mask = p_mal >= 0.5 # Predicted as phishing (high p_malicious)\n", + "for t in grid:\n", + " y_hat = (p_mal >= t).astype(int) # Predict phishing if p_mal >= t\n", + " y_pred = 1 - y_hat # Convert to label space\n", + " f1_scores.append(f1_score(y_val, y_pred, average=\"macro\"))\n", "\n", - "# Find legitimate URLs predicted as phishing\n", - "misclassified_legit_indices = legitimate_val_indices[misclassified_mask[y_val == 1]]\n", + "t_star = float(grid[np.argmax(f1_scores)])\n", + "best_f1 = max(f1_scores)\n", "\n", - "print(f\"Total legitimate samples in validation: {len(legitimate_val_indices)}\")\n", - "print(\n", - " f\"Legitimate samples misclassified as phishing: {len(misclassified_legit_indices)}\"\n", - ")\n", - "print(\n", - " f\"Misclassification rate: {len(misclassified_legit_indices) / len(legitimate_val_indices):.2%}\"\n", - ")\n", + "print(f\" t_star: {t_star:.3f}\")\n", + "print(f\" F1-macro @t_star: {best_f1:.4f}\")\n", "\n", - "if len(misclassified_legit_indices) > 0:\n", - " # Get the URLs for misclassified samples\n", - " misclassified_urls = df_full.loc[misclassified_legit_indices, \"URL\"]\n", - " misclassified_probabilities = p_mal[y_val == 1][misclassified_mask[y_val == 1]]\n", - "\n", - " # Focus on short URLs (domain length <= 15 characters)\n", - " short_url_data = []\n", - "\n", - " for idx, (url, prob) in enumerate(\n", - " zip(misclassified_urls, misclassified_probabilities)\n", - " ):\n", - " try:\n", - " from urllib.parse import urlparse\n", "\n", - " domain = urlparse(url).netloc.lower()\n", - " domain_length = len(domain)\n", + "def pick_band_for_target(predictions, optimal_threshold, target=0.12, step=0.001):\n", + " \"\"\"\n", + " Pick gray-zone thresholds to achieve target gray-zone rate.\n", "\n", - " if domain_length <= 15: # Short domains\n", - " # Get the features for this URL\n", - " features = df_full.loc[misclassified_urls.index[idx], OPTIMAL_FEATURES]\n", + " Parameters:\n", + " - predictions: Model probability predictions\n", + " - optimal_threshold: The optimal decision threshold (t_star)\n", + " - target: Target gray-zone rate (default 12%)\n", + " - step: Search step size\n", "\n", - " short_url_data.append(\n", - " {\n", - " \"url\": url,\n", - " \"domain\": domain,\n", - " \"domain_length\": domain_length,\n", - " \"p_malicious\": prob,\n", - " \"IsHTTPS\": features[\"IsHTTPS\"],\n", - " \"TLDLegitimateProb\": features[\"TLDLegitimateProb\"],\n", - " \"CharContinuationRate\": features[\"CharContinuationRate\"],\n", - " \"SpacialCharRatioInURL\": features[\"SpacialCharRatioInURL\"],\n", - " \"URLCharProb\": features[\"URLCharProb\"],\n", - " \"LetterRatioInURL\": features[\"LetterRatioInURL\"],\n", - " \"NoOfOtherSpecialCharsInURL\": features[\n", - " \"NoOfOtherSpecialCharsInURL\"\n", - " ],\n", - " \"DomainLength\": features[\"DomainLength\"],\n", - " }\n", - " )\n", - " except Exception as e:\n", - " continue\n", + " Returns:\n", + " - low_threshold, high_threshold, actual_gray_rate\n", + " \"\"\"\n", + " best_low, best_high, best_rate = None, None, float(\"inf\")\n", "\n", - " # Sort by probability (highest misclassification confidence first)\n", - " short_url_data = sorted(\n", - " short_url_data, key=lambda x: x[\"p_malicious\"], reverse=True\n", - " )\n", + " # Search for thresholds that give us closest to target rate\n", + " for low in np.arange(0.001, optimal_threshold, step):\n", + " for high in np.arange(optimal_threshold, 1.0, step):\n", + " gray_rate = np.mean((predictions >= low) & (predictions <= high))\n", + " if abs(gray_rate - target) < abs(best_rate - target):\n", + " best_low, best_high, best_rate = low, high, gray_rate\n", "\n", - " print(f\"\\n🔍 TOP SHORT LEGITIMATE URLS MISCLASSIFIED AS PHISHING:\")\n", - " print(\"=\" * 80)\n", - " print(\n", - " f\"{'URL':<35} {'Domain':<20} {'P(phish)':<10} {'HTTPS':<6} {'TLD_Prob':<8} {'Dom_Len':<7}\"\n", - " )\n", - " print(\"-\" * 80)\n", + " return best_low, best_high, best_rate\n", "\n", - " # Show top 15 misclassified short URLs\n", - " for item in short_url_data[:15]:\n", - " url_short = (\n", - " item[\"url\"][:34] if len(item[\"url\"]) <= 34 else item[\"url\"][:31] + \"...\"\n", - " )\n", - " domain_short = (\n", - " item[\"domain\"][:19]\n", - " if len(item[\"domain\"]) <= 19\n", - " else item[\"domain\"][:16] + \"...\"\n", - " )\n", "\n", - " print(\n", - " f\"{url_short:<35} {domain_short:<20} {item['p_malicious']:<10.3f} {item['IsHTTPS']:<6.0f} {item['TLDLegitimateProb']:<8.3f} {item['DomainLength']:<7.0f}\"\n", - " )\n", + "# Assume we have the optimal model and threshold from previous analysis\n", + "print(\"THRESHOLD OPTIMIZATION RESULTS:\")\n", + "print(\"=\" * 50)\n", "\n", - " # Analyze patterns in misclassified short URLs\n", - " print(f\"\\n📊 ANALYSIS OF MISCLASSIFIED SHORT URLS:\")\n", - " print(\"-\" * 50)\n", + "# 1. Optimal threshold for maximum F1\n", + "print(f\"\\n1. Optimal decision threshold:\")\n", + "print(f\" t_star: {t_star:.3f}\")\n", "\n", - " if short_url_data:\n", - " # Common characteristics\n", - " avg_https = np.mean([item[\"IsHTTPS\"] for item in short_url_data])\n", - " avg_tld_prob = np.mean([item[\"TLDLegitimateProb\"] for item in short_url_data])\n", - " avg_char_cont = np.mean(\n", - " [item[\"CharContinuationRate\"] for item in short_url_data]\n", - " )\n", - " avg_special_ratio = np.mean(\n", - " [item[\"SpacialCharRatioInURL\"] for item in short_url_data]\n", - " )\n", + "# Calculate F1 at optimal threshold\n", + "y_pred_optimal = (p_mal > t_star).astype(int)\n", + "f1_optimal = f1_score(y_val, y_pred_optimal, average=\"macro\")\n", + "print(f\" F1-macro @t_star: {f1_optimal:.4f}\")\n", "\n", - " print(f\" Average IsHTTPS: {avg_https:.2f} (0=HTTP, 1=HTTPS)\")\n", - " print(f\" Average TLD Legitimacy Prob: {avg_tld_prob:.3f}\")\n", - " print(f\" Average Character Continuation Rate: {avg_char_cont:.3f}\")\n", - " print(f\" Average Special Character Ratio: {avg_special_ratio:.3f}\")\n", + "# 2. Gray-zone band for judge integration\n", + "print(\"\\n2. Standard gray-zone band (target: 10-15%):\")\n", + "low, high, gray_rate = pick_band_for_target(p_mal, t_star, target=0.12)\n", "\n", - " # Domain length distribution\n", - " domain_lengths = [item[\"domain_length\"] for item in short_url_data]\n", - " print(\n", - " f\" Domain lengths: min={min(domain_lengths)}, max={max(domain_lengths)}, avg={np.mean(domain_lengths):.1f}\"\n", - " )\n", + "print(f\" Low threshold: {low:.3f}\")\n", + "print(f\" High threshold: {high:.3f}\")\n", + "print(f\" Gray-zone rate: {gray_rate:.1%}\")\n", "\n", - " # Common domains\n", - " domains = [item[\"domain\"] for item in short_url_data]\n", - " domain_counts = pd.Series(domains).value_counts()\n", + "# Standard decision categories\n", + "decisions = pd.cut(\n", + " p_mal,\n", + " bins=[0, low, high, 1.0],\n", + " labels=[\"ALLOW\", \"REVIEW\", \"BLOCK\"],\n", + " include_lowest=True,\n", + ")\n", "\n", - " print(f\"\\n Most frequently misclassified short domains:\")\n", - " for domain, count in domain_counts.head(10).items():\n", - " print(f\" {domain}: {count} times\")\n", + "print(\"\\n3. Standard decision distribution:\")\n", + "# Fix: Calculate proportions manually instead of using normalize=True\n", + "counts = decisions.value_counts().sort_index()\n", + "proportions = counts / len(decisions)\n", + "for category, prop in proportions.items():\n", + " print(f\" {category}: {prop:.1%} ({counts[category]:,} samples)\")\n", "\n", - " # Feature comparison with training data\n", - " print(f\"\\n🔬 COMPARISON WITH TRAINING DATA DISTRIBUTION:\")\n", - " print(\"-\" * 50)\n", "\n", - " legit_train = X_train[y_train == 1] # Legitimate training samples\n", + "# Enhanced decision logic with judge integration\n", + "def enhanced_decision_logic(url, ml_confidence, low_thresh, high_thresh):\n", + " \"\"\"\n", + " Enhanced decision logic with judge integration for edge cases.\n", "\n", - " for feature in [\n", - " \"TLDLegitimateProb\",\n", - " \"CharContinuationRate\",\n", - " \"SpacialCharRatioInURL\",\n", - " \"DomainLength\",\n", - " ]:\n", - " misclass_values = [item[feature] for item in short_url_data]\n", - " train_values = legit_train[feature].values\n", + " Parameters:\n", + " - url: The URL being evaluated\n", + " - ml_confidence: ML model confidence (p_malicious)\n", + " - low_thresh, high_thresh: Gray-zone boundaries\n", "\n", - " misclass_mean = np.mean(misclass_values)\n", - " train_mean = train_values.mean()\n", + " Returns:\n", + " - decision: \"ALLOW\", \"REVIEW\", or \"BLOCK\"\n", + " - reasoning: Explanation of the decision\n", + " \"\"\"\n", "\n", - " print(f\" {feature}:\")\n", - " print(f\" Misclassified short URLs: {misclass_mean:.3f}\")\n", - " print(f\" Training legitimate URLs: {train_mean:.3f}\")\n", - " print(f\" Difference: {misclass_mean - train_mean:.3f}\")\n", + " try:\n", + " domain = urlparse(url).netloc.lower() # Extract domain portion of URL\n", + " except:\n", + " domain = \"\"\n", "\n", - "else:\n", - " print(\"✅ No legitimate URLs misclassified as phishing!\")\n", + " # High confidence cases\n", + " if ml_confidence < low_thresh:\n", + " return \"ALLOW\", f\"High confidence legitimate (p={ml_confidence:.3f})\"\n", + " elif ml_confidence > high_thresh:\n", + " return \"BLOCK\", f\"High confidence malicious (p={ml_confidence:.3f})\"\n", "\n", - "print(\"\\n\" + \"=\" * 60)\n", - "print(\"SPECIFIC EXAMPLES: WELL-KNOWN DOMAINS\")\n", - "print(\"=\" * 60)\n", + " # Gray-zone: Check for judge criteria\n", + " else:\n", + " # Known legitimate short domains that might confuse the model\n", + " well_known_domains = {\n", + " \"google.com\",\n", + " \"github.com\",\n", + " \"microsoft.com\",\n", + " \"amazon.com\",\n", + " \"apple.com\",\n", + " \"facebook.com\",\n", + " \"twitter.com\",\n", + " \"linkedin.com\",\n", + " \"youtube.com\",\n", + " \"wikipedia.org\",\n", + " \"stackoverflow.com\",\n", + " }\n", "\n", - "# Test specific well-known short domains\n", - "test_domains = [\n", - " \"https://google.com\",\n", - " \"https://github.com\",\n", - " \"https://fb.com\",\n", - " \"https://bit.ly\",\n", - " \"https://t.co\",\n", - " \"https://apple.com\",\n", - " \"https://amazon.com\",\n", - " \"https://microsoft.com\",\n", - "]\n", + " if domain in well_known_domains:\n", + " return \"ALLOW\", f\"Well-known domain override (p={ml_confidence:.3f})\"\n", + " elif len(domain) <= 10 and ml_confidence < 0.5:\n", + " return (\n", + " \"REVIEW\",\n", + " f\"Short domain, moderate confidence (p={ml_confidence:.3f})\",\n", + " )\n", + " else:\n", + " return \"REVIEW\", f\"Uncertain classification (p={ml_confidence:.3f})\"\n", "\n", - "print(f\"{'URL':<25} {'P(phishing)':<12} {'Classification':<15} {'Status':<10}\")\n", - "print(\"-\" * 62)\n", "\n", - "for url in test_domains:\n", - " try:\n", - " # Extract features for this URL\n", - " features = extract_features(url, include_https=True)\n", - " feature_array = np.array([features[feat] for feat in OPTIMAL_FEATURES]).reshape(\n", - " 1, -1\n", - " )\n", + "# Test enhanced decision logic on sample cases\n", + "print(\"\\n4. Enhanced Decision Logic Examples:\")\n", + "print(\"-\" * 40)\n", "\n", - " # Get probability\n", - " p_phish = best_model.predict_proba(feature_array)[0, 0]\n", + "test_cases = [\n", + " (\"https://images.google.com\", 0.25),\n", + " (\"https://github.com/user/repo\", 0.35),\n", + " (\"https://suspicious-bank-login.com\", 0.45),\n", + " (\"https://definitely-phishing-site.evil\", 0.85),\n", + " (\"https://legitimate-company.com\", 0.15),\n", + "]\n", "\n", - " # Classify\n", - " if p_phish >= 0.5:\n", - " classification = \"PHISHING\"\n", - " status = (\n", - " \"❌ WRONG\"\n", - " if any(\n", - " known in url.lower()\n", - " for known in [\"google\", \"github\", \"apple\", \"amazon\", \"microsoft\"]\n", - " )\n", - " else \"⚠️ CHECK\"\n", - " )\n", - " else:\n", - " classification = \"LEGITIMATE\"\n", - " status = \"✅ CORRECT\"\n", + "for url, confidence in test_cases:\n", + " decision, reasoning = enhanced_decision_logic(url, confidence, low, high)\n", + " print(f\" URL: {url[:40]:<40} | Decision: {decision:<6} | {reasoning}\")\n", "\n", - " print(f\"{url:<25} {p_phish:<12.3f} {classification:<15} {status:<10}\")\n", + "# Save thresholds for production use\n", + "threshold_config = {\n", + " \"optimal_threshold\": float(t_star),\n", + " \"gray_zone_low\": float(low),\n", + " \"gray_zone_high\": float(high),\n", + " \"gray_zone_rate\": float(gray_rate),\n", + " \"f1_score_at_optimal\": float(f1_optimal),\n", + " \"decision_distribution\": {\n", + " \"allow_rate\": float(proportions.get(\"ALLOW\", 0)),\n", + " \"review_rate\": float(proportions.get(\"REVIEW\", 0)),\n", + " \"block_rate\": float(proportions.get(\"BLOCK\", 0)),\n", + " },\n", + "}\n", "\n", - " except Exception as e:\n", - " print(f\"{url:<25} {'ERROR':<12} {'FAILED':<15} {'❌ ERROR':<10}\")\n", + "print(f\"\\n5. Configuration saved to: {THRESH_PATH}\")\n", + "with open(THRESH_PATH, \"w\") as f:\n", + " json.dump(threshold_config, f, indent=2)\n", "\n", - "print(\"\\n💡 INSIGHTS:\")\n", - "print(\"- Look for patterns in misclassified short domains\")\n", - "print(\"- Check if TLD legitimacy probabilities are unusually low\")\n", - "print(\"- Verify if domain length feature is causing bias against short domains\")\n", - "print(\"- Consider adding domain whitelist for well-known legitimate short domains\")" + "print(\"\\nThreshold optimization complete! ✓\")\n" ] }, { "cell_type": "markdown", - "id": "a1b14339", + "id": "1c644030", "metadata": {}, "source": [ - "**The Key Insight**\n", - "\n", - "- **Looking at these critical facts:**\n", - "\n", - " 1. The Model's Performance is EXCELLENT\n", - "\n", - " ```\n", - " Validation prediction distribution:\n", - " Extreme phishing (p >= 0.99): 19,490 (41.5%) ← Confident phishing\n", - " Moderate (0.01 < p < 0.99): 1,557 (3.3%) ← Uncertain\n", - " Extreme legit (p <= 0.01): 25,906 (55.2%) ← Confident legit\n", - " ```\n", - " - 96.7% of predictions are confident! Only 3.3% are uncertain\n", "\n", - " 2. Misclassification Rate is TINY\n", - " \n", - " ```\n", - " Total legitimate samples in validation: 26,970\n", - " Legitimate samples misclassified as phishing: 23\n", - " Misclassification rate: 0.09%\n", - " ```\n", - " - Only 23 out of 26,970 legitimate URLs are misclassified!\n", - " - That's 99.91% accuracy on legitimate URLs!\n", "\n", - " 3. But Why Do google.com, github.com, etc. Get 1.0?\n", - " -Because they're NOT in the training data!\n", - " - Your training data is from PhiUSIIL dataset which:\n", + "### **SECTION 5: SPOT CHECK & Model Performance Evaluation**\n", "\n", - " - Focused on obscure/suspicious URLs\n", - " - Didn't include major tech companies\n", - " - Used URLs from 2019-2020 era\n", - "\n", - " - google.com, github.com, amazon.com are OUT-OF-DISTRIBUTION for this model!\n", + "**Purpose:** Evaluate trained models to identify the best performer and validate training quality.\n", "\n", - "**The Real Issue: Distribution Shift**\n", - "- Training Data Characteristics \n", - " \n", - " ```\n", - " Training legitimate URLs:\n", - " Average TLDLegitimateProb: 0.709\n", - " Average DomainLength: 19.2 characters\n", - " TLDs: Mostly .com, .org, .net, .edu from dataset\n", - " ```\n", - "**google.com Characteristics**\n", + "**Workflow:**\n", + "1. **Model Selection** - Compare performance metrics across candidates\n", + "2. **Training Quality Assessment** - Validate model reliability and detect potential issues\n", + "3. **URLS MISCLASSIFIED AS PHISHING** - Why URLs are being missclassified\n", "\n", - " ```\n", - " google.com:\n", - " TLDLegitimateProb: 0.6111 ← Lower than training average!\n", - " DomainLength: 10 ← Much shorter than training average!\n", - " Pattern: Very short, very simple → looks \"suspicious\" to model\n", - " ```\n", - "**Why? Because:**\n", + "**Key Deliverables:**\n", + "- Best performing model identification\n", + "- Model validation report" + ] + }, + { + "cell_type": "markdown", + "id": "24cb4f9c", + "metadata": {}, + "source": [ + "##### **5.1 How many samples predict as 1.0 in validation?**" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "b1c19832", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Validation prediction distribution:\n", + " Extreme phishing (p >= 0.99): 19,490 (41.5%)\n", + " Moderate (0.01 < p < 0.99): 1,557 (3.3%)\n", + " Extreme legit (p <= 0.01): 25,906 (55.2%)\n", + "\n", + "Sample p_malicious: [0.00815403 1. 0.00186653 0.00126379 0.00186653]\n" + ] + } + ], + "source": [ + "# Count extreme predictions\n", + "extreme_phish = (p_mal >= 0.99).sum()\n", + "extreme_legit = (p_mal <= 0.01).sum()\n", + "moderate = ((p_mal > 0.01) & (p_mal < 0.99)).sum()\n", "\n", - "- Training data has longer domains (avg 19 chars)\n", - "- Training data has higher TLD probs (avg 0.71)\n", - "- google.com is shorter (10 chars) with lower TLD prob (0.61)\n", - "- To the model: \"This domain is too short and has an unusual TLD probability → probably phishing!\"" + "print(f\"\\nValidation prediction distribution:\")\n", + "print(\n", + " f\" Extreme phishing (p >= 0.99): {extreme_phish:,} ({extreme_phish / len(p_mal):.1%})\"\n", + ")\n", + "print(f\" Moderate (0.01 < p < 0.99): {moderate:,} ({moderate / len(p_mal):.1%})\")\n", + "print(\n", + " f\" Extreme legit (p <= 0.01): {extreme_legit:,} ({extreme_legit / len(p_mal):.1%})\"\n", + ")\n", + "print(f\"\\nSample p_malicious: {p_mal[:5]}\")\n" ] }, { "cell_type": "markdown", - "id": "92d3d968", + "id": "d2b0b14c", "metadata": {}, "source": [ - "### **SECTION 5: Threshold Optimization + Gray-Zone Judge**\n", - "- Find the optimal decision threshold t_star (F1-macro on validation), then locate a gray-zone band around it for judge routing." + "##### **5.2 IDENTIFY URLS MISCLASSIFIED AS PHISHING**\n", + "- Purpose:\n", + " - Identify short legitimate URLs that are being incorrectly classified as phishing.\n", + " - This helps understand model biases and potential issues with short domain classification" ] }, { "cell_type": "code", - "execution_count": 21, - "id": "d03baaa6", + "execution_count": 17, + "id": "823a2bb2", "metadata": {}, "outputs": [ { @@ -1067,250 +905,350 @@ "text": [ "\n", "============================================================\n", - "ENHANCED THRESHOLD OPTIMIZATION WITH JUDGE INTEGRATION\n", + "SPOT CHECK: URLs MISCLASSIFIED AS PHISHING\n", "============================================================\n", + "Total legitimate samples in validation: 26970\n", + "Legitimate samples misclassified as phishing: 23\n", + "Misclassification rate: 0.09%\n", "\n", - "1. Standard F1-macro threshold optimization:\n", - " t_star: 0.350\n", - " F1-macro @t_star: 0.9972\n", - "THRESHOLD OPTIMIZATION RESULTS:\n", - "==================================================\n", - "\n", - "1. Optimal decision threshold:\n", - " t_star: 0.350\n", - " F1-macro @t_star: 0.0028\n", + "🔍 TOP SHORT LEGITIMATE URLS MISCLASSIFIED AS PHISHING:\n", + "================================================================================\n", + "URL Domain P(phish) HTTPS TLD_Prob Dom_Len\n", + "--------------------------------------------------------------------------------\n", + "https://www.it120.cc www.it120.cc 0.781 1 0.159 12 \n", "\n", - "2. Standard gray-zone band (target: 10-15%):\n", - " Low threshold: 0.004\n", - " High threshold: 0.999\n", - " Gray-zone rate: 10.9%\n", + "📊 ANALYSIS OF MISCLASSIFIED SHORT URLS:\n", + "--------------------------------------------------\n", + " Average IsHTTPS: 1.00 (0=HTTP, 1=HTTPS)\n", + " Average TLD Legitimacy Prob: 0.159\n", + " Average Character Continuation Rate: 0.263\n", + " Average Special Character Ratio: 0.250\n", + " Domain lengths: min=12, max=12, avg=12.0\n", "\n", - "3. Standard decision distribution:\n", - " ALLOW: 48.1% (22,584 samples)\n", - " REVIEW: 10.9% (5,135 samples)\n", - " BLOCK: 41.0% (19,234 samples)\n", + " Most frequently misclassified short domains:\n", + " www.it120.cc: 1 times\n", "\n", - "4. Enhanced Decision Logic Examples:\n", - "----------------------------------------\n", - " URL: https://images.google.com | Decision: REVIEW | Uncertain classification (p=0.250)\n", - " URL: https://github.com/user/repo | Decision: ALLOW | Well-known domain override (p=0.350)\n", - " URL: https://suspicious-bank-login.com | Decision: REVIEW | Uncertain classification (p=0.450)\n", - " URL: https://definitely-phishing-site.evil | Decision: REVIEW | Uncertain classification (p=0.850)\n", - " URL: https://legitimate-company.com | Decision: REVIEW | Uncertain classification (p=0.150)\n", + "🔬 COMPARISON WITH TRAINING DATA DISTRIBUTION:\n", + "--------------------------------------------------\n", + " TLDLegitimateProb:\n", + " Misclassified short URLs: 0.159\n", + " Training legitimate URLs: 0.709\n", + " Difference: -0.551\n", + " CharContinuationRate:\n", + " Misclassified short URLs: 0.263\n", + " Training legitimate URLs: 0.169\n", + " Difference: 0.094\n", + " SpacialCharRatioInURL:\n", + " Misclassified short URLs: 0.250\n", + " Training legitimate URLs: 0.198\n", + " Difference: 0.052\n", + " DomainLength:\n", + " Misclassified short URLs: 12.000\n", + " Training legitimate URLs: 19.215\n", + " Difference: -7.215\n", "\n", - "5. Configuration saved to: configs\\dev\\thresholds.json\n", + "============================================================\n", + "SPECIFIC EXAMPLES: WELL-KNOWN DOMAINS\n", + "============================================================\n", + "URL P(phishing) Classification Status \n", + "--------------------------------------------------------------\n", + "https://google.com 1.000 PHISHING ❌ WRONG \n", + "https://github.com 1.000 PHISHING ❌ WRONG \n", + "https://fb.com 1.000 PHISHING ⚠️ CHECK \n", + "https://bit.ly 1.000 PHISHING ⚠️ CHECK \n", + "https://t.co 0.943 PHISHING ⚠️ CHECK \n", + "https://apple.com 1.000 PHISHING ❌ WRONG \n", + "https://amazon.com 1.000 PHISHING ❌ WRONG \n", + "https://microsoft.com 1.000 PHISHING ❌ WRONG \n", "\n", - "Threshold optimization complete! ✓\n" + "💡 INSIGHTS:\n", + "- Look for patterns in misclassified short domains\n", + "- Check if TLD legitimacy probabilities are unusually low\n", + "- Verify if domain length feature is causing bias against short domains\n", + "- Consider adding domain whitelist for well-known legitimate short domains\n" ] } ], "source": [ "# ========================================\n", - "# SECTION 4: THRESHOLD OPTIMIZATION & JUDGE INTEGRATION\n", + "# SPOT CHECK: URLS MISCLASSIFIED AS PHISHING\n", "# ========================================\n", "\"\"\"\n", - "**Purpose:** Find optimal decision thresholds and implement enhanced routing logic for edge cases.\n", - "\n", - "**Enhanced Strategy:** \n", - "- Standard F1-optimized thresholds for typical URLs\n", - "- Judge-based routing for short domains (addresses github.com/google.com misclassification)\n", - "- Three-tier decision framework: ALLOW/REVIEW/BLOCK\n", + "**Purpose:** Identify legitimate URLs that are being incorrectly classified as phishing.\n", + "This helps understand model biases and potential issues with domain classification.\n", "\"\"\"\n", "\n", "print(\"\\n\" + \"=\" * 60)\n", - "print(\"ENHANCED THRESHOLD OPTIMIZATION WITH JUDGE INTEGRATION\")\n", + "print(\"SPOT CHECK: URLs MISCLASSIFIED AS PHISHING\")\n", "print(\"=\" * 60)\n", "\n", - "print(\"\\n1. Standard F1-macro threshold optimization:\")\n", - "grid = np.linspace(0.05, 0.95, 19)\n", - "f1_scores = []\n", + "# Load the full dataset to get URLs\n", + "df_full = pd.read_csv(DATA_PATH)\n", "\n", - "for t in grid:\n", - " y_hat = (p_mal >= t).astype(int) # Predict phishing if p_mal >= t\n", - " y_pred = 1 - y_hat # Convert to label space\n", - " f1_scores.append(f1_score(y_val, y_pred, average=\"macro\"))\n", + "# Get legitimate URLs from validation set that are predicted as phishing\n", + "val_indices = X_val.index\n", + "legitimate_val_indices = val_indices[y_val == 1] # Legitimate samples in validation\n", + "misclassified_mask = p_mal >= 0.5 # Predicted as phishing (high p_malicious)\n", "\n", - "t_star = float(grid[np.argmax(f1_scores)])\n", - "best_f1 = max(f1_scores)\n", + "# Find legitimate URLs predicted as phishing\n", + "misclassified_legit_indices = legitimate_val_indices[misclassified_mask[y_val == 1]]\n", "\n", - "print(f\" t_star: {t_star:.3f}\")\n", - "print(f\" F1-macro @t_star: {best_f1:.4f}\")\n", + "print(f\"Total legitimate samples in validation: {len(legitimate_val_indices)}\")\n", + "print(\n", + " f\"Legitimate samples misclassified as phishing: {len(misclassified_legit_indices)}\"\n", + ")\n", + "print(\n", + " f\"Misclassification rate: {len(misclassified_legit_indices) / len(legitimate_val_indices):.2%}\"\n", + ")\n", "\n", + "if len(misclassified_legit_indices) > 0:\n", + " # Get the URLs for misclassified samples\n", + " misclassified_urls = df_full.loc[misclassified_legit_indices, \"URL\"]\n", + " misclassified_probabilities = p_mal[y_val == 1][misclassified_mask[y_val == 1]]\n", "\n", - "def pick_band_for_target(predictions, optimal_threshold, target=0.12, step=0.001):\n", - " \"\"\"\n", - " Pick gray-zone thresholds to achieve target gray-zone rate.\n", + " # Focus on short URLs (domain length <= 15 characters)\n", + " short_url_data = []\n", "\n", - " Parameters:\n", - " - predictions: Model probability predictions\n", - " - optimal_threshold: The optimal decision threshold (t_star)\n", - " - target: Target gray-zone rate (default 12%)\n", - " - step: Search step size\n", + " for idx, (url, prob) in enumerate(\n", + " zip(misclassified_urls, misclassified_probabilities)\n", + " ):\n", + " try:\n", + " domain = urlparse(url).netloc.lower()\n", + " domain_length = len(domain)\n", "\n", - " Returns:\n", - " - low_threshold, high_threshold, actual_gray_rate\n", - " \"\"\"\n", - " best_low, best_high, best_rate = None, None, float(\"inf\")\n", + " if domain_length <= 15: # Short domains\n", + " # Get the features for this URL\n", + " features = df_full.loc[misclassified_urls.index[idx], OPTIMAL_FEATURES]\n", "\n", - " # Search for thresholds that give us closest to target rate\n", - " for low in np.arange(0.001, optimal_threshold, step):\n", - " for high in np.arange(optimal_threshold, 1.0, step):\n", - " gray_rate = np.mean((predictions >= low) & (predictions <= high))\n", - " if abs(gray_rate - target) < abs(best_rate - target):\n", - " best_low, best_high, best_rate = low, high, gray_rate\n", + " short_url_data.append(\n", + " {\n", + " \"url\": url,\n", + " \"domain\": domain,\n", + " \"domain_length\": domain_length,\n", + " \"p_malicious\": prob,\n", + " \"IsHTTPS\": features[\"IsHTTPS\"],\n", + " \"TLDLegitimateProb\": features[\"TLDLegitimateProb\"],\n", + " \"CharContinuationRate\": features[\"CharContinuationRate\"],\n", + " \"SpacialCharRatioInURL\": features[\"SpacialCharRatioInURL\"],\n", + " \"URLCharProb\": features[\"URLCharProb\"],\n", + " \"LetterRatioInURL\": features[\"LetterRatioInURL\"],\n", + " \"NoOfOtherSpecialCharsInURL\": features[\n", + " \"NoOfOtherSpecialCharsInURL\"\n", + " ],\n", + " \"DomainLength\": features[\"DomainLength\"],\n", + " }\n", + " )\n", + " except Exception as e:\n", + " continue\n", "\n", - " return best_low, best_high, best_rate\n", + " # Sort by probability (highest misclassification confidence first)\n", + " short_url_data = sorted(\n", + " short_url_data, key=lambda x: x[\"p_malicious\"], reverse=True\n", + " )\n", + "\n", + " print(f\"\\n🔍 TOP SHORT LEGITIMATE URLS MISCLASSIFIED AS PHISHING:\")\n", + " print(\"=\" * 80)\n", + " print(\n", + " f\"{'URL':<35} {'Domain':<20} {'P(phish)':<10} {'HTTPS':<6} {'TLD_Prob':<8} {'Dom_Len':<7}\"\n", + " )\n", + " print(\"-\" * 80)\n", + "\n", + " # Show top 15 misclassified short URLs\n", + " for item in short_url_data[:15]:\n", + " url_short = (\n", + " item[\"url\"][:34] if len(item[\"url\"]) <= 34 else item[\"url\"][:31] + \"...\"\n", + " )\n", + " domain_short = (\n", + " item[\"domain\"][:19]\n", + " if len(item[\"domain\"]) <= 19\n", + " else item[\"domain\"][:16] + \"...\"\n", + " )\n", + "\n", + " print(\n", + " f\"{url_short:<35} {domain_short:<20} {item['p_malicious']:<10.3f} {item['IsHTTPS']:<6.0f} {item['TLDLegitimateProb']:<8.3f} {item['DomainLength']:<7.0f}\"\n", + " )\n", + "\n", + " # Analyze patterns in misclassified short URLs\n", + " print(f\"\\n📊 ANALYSIS OF MISCLASSIFIED SHORT URLS:\")\n", + " print(\"-\" * 50)\n", + "\n", + " if short_url_data:\n", + " # Common characteristics\n", + " avg_https = np.mean([item[\"IsHTTPS\"] for item in short_url_data])\n", + " avg_tld_prob = np.mean([item[\"TLDLegitimateProb\"] for item in short_url_data])\n", + " avg_char_cont = np.mean(\n", + " [item[\"CharContinuationRate\"] for item in short_url_data]\n", + " )\n", + " avg_special_ratio = np.mean(\n", + " [item[\"SpacialCharRatioInURL\"] for item in short_url_data]\n", + " )\n", "\n", + " print(f\" Average IsHTTPS: {avg_https:.2f} (0=HTTP, 1=HTTPS)\")\n", + " print(f\" Average TLD Legitimacy Prob: {avg_tld_prob:.3f}\")\n", + " print(f\" Average Character Continuation Rate: {avg_char_cont:.3f}\")\n", + " print(f\" Average Special Character Ratio: {avg_special_ratio:.3f}\")\n", "\n", - "# Assume we have the optimal model and threshold from previous analysis\n", - "print(\"THRESHOLD OPTIMIZATION RESULTS:\")\n", - "print(\"=\" * 50)\n", + " # Domain length distribution\n", + " domain_lengths = [item[\"domain_length\"] for item in short_url_data]\n", + " print(\n", + " f\" Domain lengths: min={min(domain_lengths)}, max={max(domain_lengths)}, avg={np.mean(domain_lengths):.1f}\"\n", + " )\n", "\n", - "# 1. Optimal threshold for maximum F1\n", - "print(f\"\\n1. Optimal decision threshold:\")\n", - "print(f\" t_star: {t_star:.3f}\")\n", + " # Common domains\n", + " domains = [item[\"domain\"] for item in short_url_data]\n", + " domain_counts = pd.Series(domains).value_counts()\n", "\n", - "# Calculate F1 at optimal threshold\n", - "y_pred_optimal = (p_mal > t_star).astype(int)\n", - "f1_optimal = f1_score(y_val, y_pred_optimal, average=\"macro\")\n", - "print(f\" F1-macro @t_star: {f1_optimal:.4f}\")\n", + " print(f\"\\n Most frequently misclassified short domains:\")\n", + " for domain, count in domain_counts.head(10).items():\n", + " print(f\" {domain}: {count} times\")\n", "\n", - "# 2. Gray-zone band for judge integration\n", - "print(\"\\n2. Standard gray-zone band (target: 10-15%):\")\n", - "low, high, gray_rate = pick_band_for_target(p_mal, t_star, target=0.12)\n", + " # Feature comparison with training data\n", + " print(f\"\\n🔬 COMPARISON WITH TRAINING DATA DISTRIBUTION:\")\n", + " print(\"-\" * 50)\n", "\n", - "print(f\" Low threshold: {low:.3f}\")\n", - "print(f\" High threshold: {high:.3f}\")\n", - "print(f\" Gray-zone rate: {gray_rate:.1%}\")\n", + " legit_train = X_train[y_train == 1] # Legitimate training samples\n", "\n", - "# Standard decision categories\n", - "decisions = pd.cut(\n", - " p_mal,\n", - " bins=[0, low, high, 1.0],\n", - " labels=[\"ALLOW\", \"REVIEW\", \"BLOCK\"],\n", - " include_lowest=True,\n", - ")\n", + " for feature in [\n", + " \"TLDLegitimateProb\",\n", + " \"CharContinuationRate\",\n", + " \"SpacialCharRatioInURL\",\n", + " \"DomainLength\",\n", + " ]:\n", + " misclass_values = [item[feature] for item in short_url_data]\n", + " train_values = legit_train[feature].values\n", "\n", - "print(\"\\n3. Standard decision distribution:\")\n", - "# Fix: Calculate proportions manually instead of using normalize=True\n", - "counts = decisions.value_counts().sort_index()\n", - "proportions = counts / len(decisions)\n", - "for category, prop in proportions.items():\n", - " print(f\" {category}: {prop:.1%} ({counts[category]:,} samples)\")\n", + " misclass_mean = np.mean(misclass_values)\n", + " train_mean = train_values.mean()\n", "\n", + " print(f\" {feature}:\")\n", + " print(f\" Misclassified short URLs: {misclass_mean:.3f}\")\n", + " print(f\" Training legitimate URLs: {train_mean:.3f}\")\n", + " print(f\" Difference: {misclass_mean - train_mean:.3f}\")\n", "\n", - "# Enhanced decision logic with judge integration\n", - "def enhanced_decision_logic(url, ml_confidence, low_thresh, high_thresh):\n", - " \"\"\"\n", - " Enhanced decision logic with judge integration for edge cases.\n", + "else:\n", + " print(\"✅ No legitimate URLs misclassified as phishing!\")\n", "\n", - " Parameters:\n", - " - url: The URL being evaluated\n", - " - ml_confidence: ML model confidence (p_malicious)\n", - " - low_thresh, high_thresh: Gray-zone boundaries\n", + "print(\"\\n\" + \"=\" * 60)\n", + "print(\"SPECIFIC EXAMPLES: WELL-KNOWN DOMAINS\")\n", + "print(\"=\" * 60)\n", "\n", - " Returns:\n", - " - decision: \"ALLOW\", \"REVIEW\", or \"BLOCK\"\n", - " - reasoning: Explanation of the decision\n", - " \"\"\"\n", + "# Test specific well-known short domains\n", + "test_domains = [\n", + " \"https://google.com\",\n", + " \"https://github.com\",\n", + " \"https://fb.com\",\n", + " \"https://bit.ly\",\n", + " \"https://t.co\",\n", + " \"https://apple.com\",\n", + " \"https://amazon.com\",\n", + " \"https://microsoft.com\",\n", + "]\n", "\n", - " # Extract domain for judge criteria\n", - " from urllib.parse import urlparse\n", + "print(f\"{'URL':<25} {'P(phishing)':<12} {'Classification':<15} {'Status':<10}\")\n", + "print(\"-\" * 62)\n", "\n", + "for url in test_domains:\n", " try:\n", - " domain = urlparse(url).netloc.lower()\n", - " except:\n", - " domain = \"\"\n", - "\n", - " # High confidence cases\n", - " if ml_confidence < low_thresh:\n", - " return \"ALLOW\", f\"High confidence legitimate (p={ml_confidence:.3f})\"\n", - " elif ml_confidence > high_thresh:\n", - " return \"BLOCK\", f\"High confidence malicious (p={ml_confidence:.3f})\"\n", + " # Extract features for this URL\n", + " features = extract_features(url, include_https=True)\n", + " feature_array = np.array([features[feat] for feat in OPTIMAL_FEATURES]).reshape(\n", + " 1, -1\n", + " )\n", "\n", - " # Gray-zone: Check for judge criteria\n", - " else:\n", - " # Known legitimate short domains that might confuse the model\n", - " well_known_domains = {\n", - " \"google.com\",\n", - " \"github.com\",\n", - " \"microsoft.com\",\n", - " \"amazon.com\",\n", - " \"apple.com\",\n", - " \"facebook.com\",\n", - " \"twitter.com\",\n", - " \"linkedin.com\",\n", - " \"youtube.com\",\n", - " \"wikipedia.org\",\n", - " \"stackoverflow.com\",\n", - " }\n", + " # Get probability\n", + " p_phish = best_model.predict_proba(feature_array)[0, 0]\n", "\n", - " if domain in well_known_domains:\n", - " return \"ALLOW\", f\"Well-known domain override (p={ml_confidence:.3f})\"\n", - " elif len(domain) <= 10 and ml_confidence < 0.5:\n", - " return (\n", - " \"REVIEW\",\n", - " f\"Short domain, moderate confidence (p={ml_confidence:.3f})\",\n", + " # Classify\n", + " if p_phish >= 0.5:\n", + " classification = \"PHISHING\"\n", + " status = (\n", + " \"❌ WRONG\"\n", + " if any(\n", + " known in url.lower()\n", + " for known in [\"google\", \"github\", \"apple\", \"amazon\", \"microsoft\"]\n", + " )\n", + " else \"⚠️ CHECK\"\n", " )\n", " else:\n", - " return \"REVIEW\", f\"Uncertain classification (p={ml_confidence:.3f})\"\n", - "\n", - "\n", - "# Test enhanced decision logic on sample cases\n", - "print(\"\\n4. Enhanced Decision Logic Examples:\")\n", - "print(\"-\" * 40)\n", - "\n", - "test_cases = [\n", - " (\"https://images.google.com\", 0.25),\n", - " (\"https://github.com/user/repo\", 0.35),\n", - " (\"https://suspicious-bank-login.com\", 0.45),\n", - " (\"https://definitely-phishing-site.evil\", 0.85),\n", - " (\"https://legitimate-company.com\", 0.15),\n", - "]\n", - "\n", - "for url, confidence in test_cases:\n", - " decision, reasoning = enhanced_decision_logic(url, confidence, low, high)\n", - " print(f\" URL: {url[:40]:<40} | Decision: {decision:<6} | {reasoning}\")\n", + " classification = \"LEGITIMATE\"\n", + " status = \"✅ CORRECT\"\n", "\n", - "# Save thresholds for production use\n", - "threshold_config = {\n", - " \"optimal_threshold\": float(t_star),\n", - " \"gray_zone_low\": float(low),\n", - " \"gray_zone_high\": float(high),\n", - " \"gray_zone_rate\": float(gray_rate),\n", - " \"f1_score_at_optimal\": float(f1_optimal),\n", - " \"decision_distribution\": {\n", - " \"allow_rate\": float(proportions.get(\"ALLOW\", 0)),\n", - " \"review_rate\": float(proportions.get(\"REVIEW\", 0)),\n", - " \"block_rate\": float(proportions.get(\"BLOCK\", 0)),\n", - " },\n", - "}\n", + " print(f\"{url:<25} {p_phish:<12.3f} {classification:<15} {status:<10}\")\n", "\n", - "print(f\"\\n5. Configuration saved to: {THRESH_PATH}\")\n", - "with open(THRESH_PATH, \"w\") as f:\n", - " json.dump(threshold_config, f, indent=2)\n", + " except Exception as e:\n", + " print(f\"{url:<25} {'ERROR':<12} {'FAILED':<15} {'❌ ERROR':<10}\")\n", "\n", - "print(\"\\nThreshold optimization complete! ✓\")\n" + "print(\"\\n💡 INSIGHTS:\")\n", + "print(\"- Look for patterns in misclassified short domains\")\n", + "print(\"- Check if TLD legitimacy probabilities are unusually low\")\n", + "print(\"- Verify if domain length feature is causing bias against short domains\")\n", + "print(\"- Consider adding domain whitelist for well-known legitimate short domains\")\n" ] }, { "cell_type": "markdown", - "id": "1c644030", + "id": "1fb60aad", "metadata": {}, "source": [ "\n", + "**The Key Insight**\n", "\n", - "### **SECTION 6: Model Performance Evaluation**\n", + "- **Looking at these critical facts:**\n", "\n", - "**Purpose:** Evaluate trained models to identify the best performer and validate training quality.\n", + " 1. The Model's Performance is EXCELLENT\n", "\n", - "**Workflow:**\n", - "1. **Model Selection** - Compare performance metrics across candidates\n", - "2. **Training Quality Assessment** - Validate model reliability and detect potential issues\n", + " ```\n", + " Validation prediction distribution:\n", + " Extreme phishing (p >= 0.99): 19,490 (41.5%) ← Confident phishing\n", + " Moderate (0.01 < p < 0.99): 1,557 (3.3%) ← Uncertain\n", + " Extreme legit (p <= 0.01): 25,906 (55.2%) ← Confident legit\n", + " ```\n", + " - 96.7% of predictions are confident! Only 3.3% are uncertain\n", "\n", - "**Key Deliverables:**\n", - "- Best performing model identification\n", - "- Model validation report" + " 2. Misclassification Rate is TINY\n", + " \n", + " ```\n", + " Total legitimate samples in validation: 26,970\n", + " Legitimate samples misclassified as phishing: 23\n", + " Misclassification rate: 0.09%\n", + " ```\n", + " - Only 23 out of 26,970 legitimate URLs are misclassified!\n", + " - That's 99.91% accuracy on legitimate URLs!\n", + "\n", + " 3. But Why Do google.com, github.com, etc. Get 1.0?\n", + " -Because they're NOT in the training data!\n", + " - Your training data is from PhiUSIIL dataset which:\n", + "\n", + " - Focused on obscure/suspicious URLs\n", + " - Didn't include major tech companies\n", + " - Used URLs from 2019-2020 era\n", + "\n", + " - google.com, github.com, amazon.com are OUT-OF-DISTRIBUTION for this model!\n", + "\n", + "**The Real Issue: Distribution Shift**\n", + "\n", + "***Training Data Characteristics*** \n", + " \n", + " ```\n", + " Training legitimate URLs:\n", + " Average TLDLegitimateProb: 0.709\n", + " Average DomainLength: 19.2 characters\n", + " TLDs: Mostly .com, .org, .net, .edu from dataset\n", + " ```\n", + "***google.com Characteristics***\n", + "\n", + " ```\n", + " google.com:\n", + " TLDLegitimateProb: 0.6111 ← Lower than training average!\n", + " DomainLength: 10 ← Much shorter than training average!\n", + " Pattern: Very short, very simple → looks \"suspicious\" to model\n", + " ```\n", + "**Why? Because:**\n", + "\n", + "- Training data has longer domains (avg 19 chars)\n", + "- Training data has higher TLD probs (avg 0.71)\n", + "- google.com is shorter (10 chars) with lower TLD prob (0.61)\n", + "- To the model: \"This domain is too short and has an unusual TLD probability → probably phishing!\"" ] }, { @@ -1318,14 +1256,14 @@ "id": "f9e9ab19", "metadata": {}, "source": [ - "#### **6.1 Model Validation & Quality Assurance**\n", + "#### **5.3 Model Validation & Quality Assurance**\n", "\n", "**Purpose:** Validate model performance and ensure training quality through comprehensive checks." ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 14, "id": "40bd7eb8", "metadata": {}, "outputs": [ @@ -1579,13 +1517,13 @@ "id": "e59a5e29", "metadata": {}, "source": [ - "#### **6.2 Validation & QA**\n", + "#### **5.4 Validation & QA**\n", "- Purpose: F1/PR-AUC/Brier checks, confusion sanity, outlier inspection" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 15, "id": "feeeb978", "metadata": {}, "outputs": [ @@ -1702,14 +1640,14 @@ "id": "42d64a2b", "metadata": {}, "source": [ - "### **SECTION 7: Model Comparison & Ablation Analysis**\n", + "### **SECTION 6: Model Comparison & Ablation Analysis**\n", "\n", "**Objective:** Compare 7-feature vs 8-feature models to understand the impact of HTTPS feature on performance." ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 18, "id": "cdf76bda", "metadata": {}, "outputs": [ @@ -1741,7 +1679,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [00:33:12] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", + "d:\\MLops\\NetworkSecurity\\venv\\Lib\\site-packages\\xgboost\\core.py:158: UserWarning: [11:11:02] WARNING: C:\\buildkite-agent\\builds\\buildkite-windows-cpu-autoscaling-group-i-08cbc0333d8d4aae1-1\\xgboost\\xgboost-ci-windows\\src\\learner.cc:740: \n", "Parameters: { \"verbose\" } are not used.\n", "\n", " warnings.warn(smsg, UserWarning)\n" @@ -2151,7 +2089,7 @@ "id": "3703916e", "metadata": {}, "source": [ - "### **SECTION 8: Model Artifact Persistence**" + "### **SECTION 7: Model Artifact Persistence**" ] }, { @@ -2159,7 +2097,7 @@ "id": "966393dc", "metadata": {}, "source": [ - "#### **SECTION 8.1: Model Artifact Persistence**" + "#### **SECTION 7.1: Model Artifact Persistence**" ] }, { @@ -2334,7 +2272,7 @@ "id": "0e9c8af7", "metadata": {}, "source": [ - "#### **SECTION 8.2 MLflow Experiment Logging**" + "#### **SECTION 7.2 MLflow Experiment Logging**" ] }, { diff --git a/notebooks/02_baseline_and_calibration.ipynb b/notebooks/archive/02_baseline_and_calibration.ipynb similarity index 100% rename from notebooks/02_baseline_and_calibration.ipynb rename to notebooks/archive/02_baseline_and_calibration.ipynb diff --git a/notebooks/03_ablation_url_only.ipynb b/notebooks/archive/03_ablation_url_only.ipynb similarity index 100% rename from notebooks/03_ablation_url_only.ipynb rename to notebooks/archive/03_ablation_url_only.ipynb diff --git a/notebooks/04_robustness_checks.ipynb b/notebooks/archive/04_robustness_checks.ipynb similarity index 100% rename from notebooks/04_robustness_checks.ipynb rename to notebooks/archive/04_robustness_checks.ipynb