-
-
Notifications
You must be signed in to change notification settings - Fork 1
Refactor and expand social media analyzer #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit refactors the existing Facebook and scam analyzers into a single, generic social media analyzer. The new `social_media_analyzer` supports the following platforms: - Facebook - Instagram - WhatsApp - TikTok - Tinder - Snapchat - WeChat The fake profile detector and scam message analyzer have been generalized to be platform-aware. The user is now prompted to select a platform before performing an analysis. The old `facebook_analyzer` and `scam_detector` directories have been removed.
Reviewer's GuideRefactors and unifies the Facebook and scam analyzers into a single social_media_analyzer package that supports seven platforms by consolidating shared logic, introducing platform-aware heuristics, and providing a unified CLI interface. Entity relationship diagram for platform-specific advice and legitimate domainserDiagram
PLATFORM_SPECIFIC_ADVICE {
string platform
list advice
}
LEGITIMATE_DOMAINS {
string platform
list domains
}
PLATFORM_SPECIFIC_ADVICE ||--|{ LEGITIMATE_DOMAINS : "platform"
Class diagram for the new social_media_analyzer packageclassDiagram
class fake_profile_detector {
+analyze_profile_based_on_user_input(profile_url, platform)
+guide_reverse_image_search(image_url=None)
+print_platform_specific_advice(platform)
PLATFORM_SPECIFIC_ADVICE : dict
FAKE_PROFILE_INDICATORS : list
}
class scam_detector {
+analyze_text_for_scams(text_content, platform=None)
+is_url_suspicious(url, platform=None)
+get_domain_from_url(url)
+get_legitimate_domains(platform=None)
}
class heuristics {
LEGITIMATE_DOMAINS : dict
URGENCY_KEYWORDS : list
SENSITIVE_INFO_KEYWORDS : list
TOO_GOOD_TO_BE_TRUE_KEYWORDS : list
GENERIC_GREETINGS : list
TECH_SUPPORT_SCAM_KEYWORDS : list
PAYMENT_KEYWORDS : list
URL_PATTERN : regex
SUSPICIOUS_TLDS : list
CRYPTO_ADDRESS_PATTERNS : dict
PHONE_NUMBER_PATTERN : regex
SUSPICIOUS_URL_PATTERNS : list
HEURISTIC_WEIGHTS : dict
}
class main {
+main()
}
fake_profile_detector --|> heuristics : uses
scam_detector --|> heuristics : uses
main --> fake_profile_detector : imports
main --> scam_detector : imports
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `social_media_analyzer/scam_detector.py:29` </location>
<code_context>
- url_pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
- return re.findall(url_pattern, text)
-
-def get_domain_from_url(url):
- """Extracts the domain (e.g., 'example.com') from a URL."""
- if "://" in url:
- domain = url.split("://")[1].split("/")[0].split("?")[0]
- else: # Handles www.example.com cases without http(s)
</code_context>
<issue_to_address>
Domain extraction logic may not handle URLs with subdomains or ports correctly.
Splitting by delimiters may fail for URLs with subdomains, ports, or credentials. Use urllib.parse.urlparse for reliable domain extraction.
</issue_to_address>
### Comment 2
<location> `social_media_analyzer/scam_detector.py:66` </location>
<code_context>
+ return True, f"URL uses a potentially suspicious TLD."
+
+ # 4. Check if a known legitimate service name is part of the domain, but it's not official
+ for service in LEGITIMATE_DOMAINS.keys():
+ if service != "general" and service in domain:
+ return True, f"URL contains the name of a legitimate service ('{service}') but is not an official domain."
+
</code_context>
<issue_to_address>
Service name substring check may produce false positives for legitimate domains.
This logic may incorrectly flag official domains as suspicious. Please update the check to distinguish between legitimate and unofficial uses of service names.
</issue_to_address>
### Comment 3
<location> `social_media_analyzer/scam_detector.py:94` </location>
<code_context>
- "PAYMENT_REQUEST": PAYMENT_KEYWORDS,
- }
-
- for category, keywords in keyword_checks.items():
- for keyword in keywords:
- if keyword in text_lower:
- message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'"
</code_context>
<issue_to_address>
Simple substring matching for keywords may lead to false positives.
Consider using regular expressions with word boundaries to avoid matching keywords within other words.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# 1. Keyword-based checks
keyword_checks = {
"URGENCY": URGENCY_KEYWORDS,
"SENSITIVE_INFO": SENSITIVE_INFO_KEYWORDS,
"TOO_GOOD_TO_BE_TRUE": TOO_GOOD_TO_BE_TRUE_KEYWORDS,
"GENERIC_GREETING": GENERIC_GREETINGS,
"TECH_SUPPORT": TECH_SUPPORT_SCAM_KEYWORDS,
"PAYMENT_REQUEST": PAYMENT_KEYWORDS,
}
for category, keywords in keyword_checks.items():
for keyword in keywords:
if keyword in text_lower:
message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'"
if message not in indicators_found:
indicators_found.append(message)
score += HEURISTIC_WEIGHTS.get(category, 1.0)
=======
import re
# 1. Keyword-based checks
keyword_checks = {
"URGENCY": URGENCY_KEYWORDS,
"SENSITIVE_INFO": SENSITIVE_INFO_KEYWORDS,
"TOO_GOOD_TO_BE_TRUE": TOO_GOOD_TO_BE_TRUE_KEYWORDS,
"GENERIC_GREETING": GENERIC_GREETINGS,
"TECH_SUPPORT": TECH_SUPPORT_SCAM_KEYWORDS,
"PAYMENT_REQUEST": PAYMENT_KEYWORDS,
}
for category, keywords in keyword_checks.items():
for keyword in keywords:
# Use regex with word boundaries to avoid matching keywords within other words
pattern = r"\b" + re.escape(keyword) + r"\b"
if re.search(pattern, text_lower):
message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'"
if message not in indicators_found:
indicators_found.append(message)
score += HEURISTIC_WEIGHTS.get(category, 1.0)
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def get_domain_from_url(url): | ||
| """Extracts the domain (e.g., 'example.com') from a URL.""" | ||
| if "://" in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Domain extraction logic may not handle URLs with subdomains or ports correctly.
Splitting by delimiters may fail for URLs with subdomains, ports, or credentials. Use urllib.parse.urlparse for reliable domain extraction.
| for service in LEGITIMATE_DOMAINS.keys(): | ||
| if service != "general" and service in domain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Service name substring check may produce false positives for legitimate domains.
This logic may incorrectly flag official domains as suspicious. Please update the check to distinguish between legitimate and unofficial uses of service names.
| # 1. Keyword-based checks | ||
| keyword_checks = { | ||
| "URGENCY": URGENCY_KEYWORDS, | ||
| "SENSITIVE_INFO": SENSITIVE_INFO_KEYWORDS, | ||
| "TOO_GOOD_TO_BE_TRUE": TOO_GOOD_TO_BE_TRUE_KEYWORDS, | ||
| "GENERIC_GREETING": GENERIC_GREETINGS, | ||
| "TECH_SUPPORT": TECH_SUPPORT_SCAM_KEYWORDS, | ||
| "PAYMENT_REQUEST": PAYMENT_KEYWORDS, | ||
| } | ||
|
|
||
| for category, keywords in keyword_checks.items(): | ||
| for keyword in keywords: | ||
| if keyword in text_lower: | ||
| message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'" | ||
| if message not in indicators_found: | ||
| indicators_found.append(message) | ||
| score += HEURISTIC_WEIGHTS.get(category, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Simple substring matching for keywords may lead to false positives.
Consider using regular expressions with word boundaries to avoid matching keywords within other words.
| # 1. Keyword-based checks | |
| keyword_checks = { | |
| "URGENCY": URGENCY_KEYWORDS, | |
| "SENSITIVE_INFO": SENSITIVE_INFO_KEYWORDS, | |
| "TOO_GOOD_TO_BE_TRUE": TOO_GOOD_TO_BE_TRUE_KEYWORDS, | |
| "GENERIC_GREETING": GENERIC_GREETINGS, | |
| "TECH_SUPPORT": TECH_SUPPORT_SCAM_KEYWORDS, | |
| "PAYMENT_REQUEST": PAYMENT_KEYWORDS, | |
| } | |
| for category, keywords in keyword_checks.items(): | |
| for keyword in keywords: | |
| if keyword in text_lower: | |
| message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'" | |
| if message not in indicators_found: | |
| indicators_found.append(message) | |
| score += HEURISTIC_WEIGHTS.get(category, 1.0) | |
| import re | |
| # 1. Keyword-based checks | |
| keyword_checks = { | |
| "URGENCY": URGENCY_KEYWORDS, | |
| "SENSITIVE_INFO": SENSITIVE_INFO_KEYWORDS, | |
| "TOO_GOOD_TO_BE_TRUE": TOO_GOOD_TO_BE_TRUE_KEYWORDS, | |
| "GENERIC_GREETING": GENERIC_GREETINGS, | |
| "TECH_SUPPORT": TECH_SUPPORT_SCAM_KEYWORDS, | |
| "PAYMENT_REQUEST": PAYMENT_KEYWORDS, | |
| } | |
| for category, keywords in keyword_checks.items(): | |
| for keyword in keywords: | |
| # Use regex with word boundaries to avoid matching keywords within other words | |
| pattern = r"\b" + re.escape(keyword) + r"\b" | |
| if re.search(pattern, text_lower): | |
| message = f"Presence of '{category.replace('_', ' ').title()}' keyword: '{keyword}'" | |
| if message not in indicators_found: | |
| indicators_found.append(message) | |
| score += HEURISTIC_WEIGHTS.get(category, 1.0) |
| if re.search(pattern, normalized_url, re.IGNORECASE): | ||
| if not domain.endswith(tuple(legitimate_domains)): | ||
| return True, f"URL impersonates a legitimate domain: {pattern}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Merge nested if conditions (merge-nested-ifs)
| if re.search(pattern, normalized_url, re.IGNORECASE): | |
| if not domain.endswith(tuple(legitimate_domains)): | |
| return True, f"URL impersonates a legitimate domain: {pattern}" | |
| if re.search(pattern, normalized_url, re.IGNORECASE) and not domain.endswith(tuple(legitimate_domains)): | |
| return True, f"URL impersonates a legitimate domain: {pattern}" | |
Explanation
Too much nesting can make code difficult to understand, and this is especiallytrue in Python, where there are no brackets to help out with the delineation of
different nesting levels.
Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two if conditions can be combined using
and is an easy win.
| google_url = f"https://images.google.com/searchbyimage?image_url={image_url}" | ||
| tineye_url = f"https://tineye.com/search?url={image_url}" | ||
| print(f"Attempting to open Google Images: {google_url}") | ||
| webbrowser.open(google_url) | ||
| print(f"Attempting to open TinEye: {tineye_url}") | ||
| webbrowser.open(tineye_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Extract code out into function (extract-method)
| profile_url = input(f"Enter the {platform.capitalize()} profile URL to analyze: ").strip() | ||
| if profile_url: | ||
| fake_profile_detector.analyze_profile_based_on_user_input(profile_url, platform) | ||
| else: | ||
| print("No profile URL entered.") | ||
| break | ||
| elif analysis_choice == 2: | ||
| message = input("Paste the message you want to analyze: ").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)
|
|
||
| def is_url_suspicious(url, platform=None): | ||
| """ | ||
| Checks if a URL is suspicious based on various patterns and lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use the built-in function
nextinstead of a for-loop (use-next) - Replace f-string with no interpolated values with string (
remove-redundant-fstring)
| # Example Usage | ||
| test_message = "URGENT: Your Instagram account has unusual activity. Please verify your account now by clicking http://instagram.security-update.com/login to avoid suspension." | ||
| analysis_result = analyze_text_for_scams(test_message, platform="instagram") | ||
| print(f"--- Analyzing Instagram Scam Message ---") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace f-string with no interpolated values with string (remove-redundant-fstring)
| print(f"--- Analyzing Instagram Scam Message ---") | |
| print("--- Analyzing Instagram Scam Message ---") |
This commit refactors the existing Facebook and scam analyzers into a single, generic social media analyzer.
The new
social_media_analyzersupports the following platforms:The fake profile detector and scam message analyzer have been generalized to be platform-aware. The user is now prompted to select a platform before performing an analysis.
The old
facebook_analyzerandscam_detectordirectories have been removed.Summary by Sourcery
Refactor existing Facebook and scam analyzers into a single generic social media analyzer supporting seven platforms, generalizing fake profile and scam message detection under one CLI and cleaning up deprecated code
New Features:
Enhancements:
Chores: