diff --git a/README.md b/README.md index b1fa028..5c9a197 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,8 @@ Grok Search MCP 是一个基于 [FastMCP](https://github.com/jlowin/fastmcp) 构 - ✅ 动态模型切换(支持切换不同 Grok 模型并持久化保存) - ✅ **工具路由控制(一键禁用官方 WebSearch/WebFetch,强制使用 GrokSearch)** - ✅ **自动时间注入(搜索时自动获取本地时间,确保时间相关查询的准确性)** +- ✅ **多平台支持(自动检测 xAI 官方 / OpenRouter / 通用 OpenAI 兼容平台)** +- ✅ **`GROK_MODEL` 环境变量支持(可配置 `:online` 后缀启用 OpenRouter 原生搜索)** - ✅ 可扩展架构,支持添加其他搜索 Provider @@ -92,7 +94,7 @@ wget -qO- https://astral.sh/uv/install.sh | sh ### Step 1. 安装 Grok Search MCP 使用 `claude mcp add-json` 一键安装并配置: -**注意:** 需要替换 **GROK_API_URL** 以及 **GROK_API_KEY**这两个字段为你自己的站点以及密钥,目前只支持openai格式,所以如果需要使用grok,也需要使用转为openai格式的grok镜像站 +**注意:** 需要替换 **GROK_API_URL**、**GROK_API_KEY** 以及可选的 **GROK_MODEL** 字段。支持 xAI 官方 API、OpenRouter 以及其他 OpenAI 兼容平台。 ```bash claude mcp add-json grok-search --scope user '{ @@ -104,8 +106,9 @@ claude mcp add-json grok-search --scope user '{ "grok-search" ], "env": { - "GROK_API_URL": "https://your-api-endpoint.com/v1", - "GROK_API_KEY": "your-api-key-here" + "GROK_API_URL": "https://openrouter.ai/api/v1", + "GROK_API_KEY": "your-api-key-here", + "GROK_MODEL": "x-ai/grok-4-fast:online" } }' ``` @@ -143,7 +146,52 @@ claude mcp list - API Key 是否有效 - 网络连接是否正常 -### Step 3. 配置系统提示词 +### Step 3. 多平台支持 + +Grok Search MCP 支持多种 API 平台,根据 `GROK_API_URL` 自动检测: + +| 平台 | URL 特征 | 搜索能力 | 配置示例 | +|------|---------|---------|---------| +| **OpenRouter** | `openrouter.ai` | ✅ 原生实时搜索(需 `:online` 后缀) | `GROK_MODEL=x-ai/grok-4-fast:online` | +| **xAI 官方** | `api.x.ai` | ✅ 原生实时搜索(规划中) | `GROK_MODEL=grok-4-fast` | +| **通用平台** | 其他 URL | ⚠️ 基于 Prompt 的搜索 | `GROK_MODEL=grok-4-fast` | + +#### OpenRouter 用户(推荐) + +使用 OpenRouter 时,在模型名后添加 `:online` 后缀即可启用**原生实时网络搜索**: + +```bash +claude mcp add-json grok-search --scope user '{ + "type": "stdio", + "command": "uvx", + "args": [ + "--from", + "git+https://github.com/GuDaStudio/GrokSearch", + "grok-search" + ], + "env": { + "GROK_API_URL": "https://openrouter.ai/api/v1", + "GROK_API_KEY": "sk-or-v1-xxx", + "GROK_MODEL": "x-ai/grok-4-fast:online" + } +}' +``` + +> **💡 提示**:`:online` 是 OpenRouter 特有的模型变体后缀,会自动启用 xAI 的原生 Web Search + X Search 能力。搜索结果附带真实来源引用。 + +#### 环境变量说明 + +| 环境变量 | 必填 | 默认值 | 说明 | +|---------|------|--------|------| +| `GROK_API_URL` | ✅ | - | API 端点 URL | +| `GROK_API_KEY` | ✅ | - | API 密钥 | +| `GROK_MODEL` | ❌ | `grok-4-fast` | 模型 ID(OpenRouter 用户建议加 `:online` 后缀) | +| `GROK_PROVIDER` | ❌ | 自动检测 | 手动指定平台类型:`xai` / `openrouter` / `generic` | +| `GROK_DEBUG` | ❌ | `false` | 启用调试模式 | +| `GROK_LOG_LEVEL` | ❌ | `INFO` | 日志级别 | +| `GROK_LOG_DIR` | ❌ | `logs` | 日志目录 | + +### Step 4. 配置系统提示词 为了更好的使用Grok Search 可以通过配置Claude Code或者类似的系统提示词来对整体Vibe Coding Cli进行优化,以Claude Code 为例可以编辑 ~/.claude/CLAUDE.md中追加下面内容,提供了两版使用详细版更能激活工具的能力: **💡 提示**:现在可以使用 `toggle_builtin_tools` 工具一键禁用官方 WebSearch/WebFetch,强制路由到 GrokSearch! @@ -465,6 +513,15 @@ A: 注册第三方平台 → 获取 API Endpoint 和 Key → 使用 `claude mcp **Q: 配置后如何验证?** A: 在 Claude 对话中说"显示 grok-search 配置信息",查看连接测试结果 +**Q: 使用 OpenRouter 时搜索结果不准确?** +A: 请确保 `GROK_MODEL` 设置了 `:online` 后缀(如 `x-ai/grok-4-fast:online`)。没有 `:online` 后缀时,模型没有实时搜索能力,结果基于训练数据生成。 + +**Q: 支持哪些 API 平台?** +A: 支持所有 OpenAI 兼容格式的 API 平台。推荐使用 OpenRouter(支持原生搜索)或 xAI 官方 API。代码会根据 `GROK_API_URL` 自动检测平台类型。 + +**Q: `:online` 后缀是什么?** +A: 这是 OpenRouter 特有的模型变体后缀,会自动启用该模型的实时网络搜索能力。对于 xAI 模型,会同时启用 Web Search 和 X (Twitter) Search。 + ## 许可证 本项目采用 [MIT License](LICENSE) 开源。 diff --git a/docs/README_EN.md b/docs/README_EN.md index cd62a2a..7c759fa 100644 --- a/docs/README_EN.md +++ b/docs/README_EN.md @@ -49,6 +49,8 @@ Comparison with other search solutions: - ✅ Dynamic model switching (switch between Grok models with persistent settings) - ✅ **Tool routing control (one-click disable built-in WebSearch/WebFetch, force use GrokSearch)** - ✅ **Automatic time injection (automatically gets local time during search for accurate time-sensitive queries)** +- ✅ **Multi-platform support (auto-detect xAI Official / OpenRouter / Generic OpenAI-compatible)** +- ✅ **`GROK_MODEL` environment variable (configure `:online` suffix for OpenRouter native search)** - ✅ Extensible architecture for additional search providers ## Quick Start @@ -101,8 +103,9 @@ claude mcp add-json grok-search --scope user '{ "grok-search" ], "env": { - "GROK_API_URL": "https://your-api-endpoint.com/v1", - "GROK_API_KEY": "your-api-key-here" + "GROK_API_URL": "https://openrouter.ai/api/v1", + "GROK_API_KEY": "your-api-key-here", + "GROK_MODEL": "x-ai/grok-4-fast:online" } }' ``` @@ -113,8 +116,10 @@ Configuration is done through **environment variables**, set directly in the `en | Environment Variable | Required | Default | Description | |---------------------|----------|---------|-------------| -| `GROK_API_URL` | ✅ | - | Grok API endpoint (OpenAI-compatible format) | +| `GROK_API_URL` | ✅ | - | API endpoint URL | | `GROK_API_KEY` | ✅ | - | Your API Key | +| `GROK_MODEL` | ❌ | `grok-4-fast` | Model ID (OpenRouter users should add `:online` suffix) | +| `GROK_PROVIDER` | ❌ | Auto-detect | Manually specify platform: `xai` / `openrouter` / `generic` | | `GROK_DEBUG` | ❌ | `false` | Enable debug mode (`true`/`false`) | | `GROK_LOG_LEVEL` | ❌ | `INFO` | Log level (DEBUG/INFO/WARNING/ERROR) | | `GROK_LOG_DIR` | ❌ | `logs` | Log file storage directory | @@ -173,6 +178,39 @@ If you see `❌ 连接失败` or `⚠️ 连接异常`, please check: - API Key is valid - Network connection is working +### Multi-Platform Support + +Grok Search MCP supports multiple API platforms with automatic detection based on `GROK_API_URL`: + +| Platform | URL Pattern | Search Capability | Example | +|----------|------------|-------------------|---------| +| **OpenRouter** | `openrouter.ai` | ✅ Native real-time search (requires `:online` suffix) | `GROK_MODEL=x-ai/grok-4-fast:online` | +| **xAI Official** | `api.x.ai` | ✅ Native real-time search (planned) | `GROK_MODEL=grok-4-fast` | +| **Generic** | Other URLs | ⚠️ Prompt-based search | `GROK_MODEL=grok-4-fast` | + +#### OpenRouter Users (Recommended) + +When using OpenRouter, add the `:online` suffix to the model name to enable **native real-time web search**: + +```bash +claude mcp add-json grok-search --scope user '{ + "type": "stdio", + "command": "uvx", + "args": [ + "--from", + "git+https://github.com/GuDaStudio/GrokSearch", + "grok-search" + ], + "env": { + "GROK_API_URL": "https://openrouter.ai/api/v1", + "GROK_API_KEY": "sk-or-v1-xxx", + "GROK_MODEL": "x-ai/grok-4-fast:online" + } +}' +``` + +> **💡 Tip**: `:online` is an OpenRouter-specific model variant suffix that automatically enables xAI's native Web Search + X Search capabilities. Search results include real source citations. + ### 4. Advanced Configuration (Optional) To better utilize Grok Search, you can optimize the overall Vibe Coding CLI by configuring Claude Code or similar system prompts. For Claude Code, edit ~/.claude/CLAUDE.md with the following content:
@@ -492,6 +530,15 @@ A: Register with a third-party platform → Obtain API Endpoint and Key → Conf **Q: How to verify configuration after setup?** A: Say "Show grok-search configuration info" in Claude conversation to check connection test results +**Q: Search results are inaccurate when using OpenRouter?** +A: Make sure `GROK_MODEL` has the `:online` suffix (e.g., `x-ai/grok-4-fast:online`). Without `:online`, the model has no real-time search capability and generates results from training data. + +**Q: Which API platforms are supported?** +A: All OpenAI-compatible API platforms are supported. OpenRouter (with native search) or xAI Official API are recommended. The platform type is auto-detected from `GROK_API_URL`. + +**Q: What is the `:online` suffix?** +A: It's an OpenRouter-specific model variant suffix that enables real-time web search. For xAI models, it enables both Web Search and X (Twitter) Search. + ## License This project is open source under the [MIT License](LICENSE). diff --git a/src/grok_search/config.py b/src/grok_search/config.py index 006d340..9df5043 100644 --- a/src/grok_search/config.py +++ b/src/grok_search/config.py @@ -2,15 +2,18 @@ import json from pathlib import Path + class Config: _instance = None _SETUP_COMMAND = ( - 'claude mcp add-json grok-search --scope user ' + "claude mcp add-json grok-search --scope user " '\'{"type":"stdio","command":"uvx","args":["--from",' '"git+https://github.com/GuDaStudio/GrokSearch","grok-search"],' '"env":{"GROK_API_URL":"your-api-url","GROK_API_KEY":"your-api-key"}}\'' ) _DEFAULT_MODEL = "grok-4-fast" + _ONLINE_PROVIDERS = ("openrouter.ai",) + _XAI_PROVIDERS = ("api.x.ai",) def __new__(cls): if cls._instance is None: @@ -31,14 +34,14 @@ def _load_config_file(self) -> dict: if not self.config_file.exists(): return {} try: - with open(self.config_file, 'r', encoding='utf-8') as f: + with open(self.config_file, "r", encoding="utf-8") as f: return json.load(f) except (json.JSONDecodeError, IOError): return {} def _save_config_file(self, config_data: dict) -> None: try: - with open(self.config_file, 'w', encoding='utf-8') as f: + with open(self.config_file, "w", encoding="utf-8") as f: json.dump(config_data, f, ensure_ascii=False, indent=2) except IOError as e: raise ValueError(f"无法保存配置文件: {str(e)}") @@ -102,6 +105,11 @@ def log_dir(self) -> Path: @property def grok_model(self) -> str: + # 优先级:环境变量 > 配置文件 > 默认值 + env_model = os.getenv("GROK_MODEL") + if env_model: + return env_model + if self._cached_model is not None: return self._cached_model @@ -120,6 +128,30 @@ def set_model(self, model: str) -> None: self._save_config_file(config_data) self._cached_model = model + @property + def provider_type(self) -> str: + """检测当前使用的 provider 类型: 'xai', 'openrouter', 'generic'""" + override = os.getenv("GROK_PROVIDER", "").lower().strip() + if override in ("xai", "openrouter", "generic"): + return override + + try: + url = self.grok_api_url.lower() + except ValueError: + return "generic" + + for domain in self._XAI_PROVIDERS: + if domain in url: + return "xai" + for domain in self._ONLINE_PROVIDERS: + if domain in url: + return "openrouter" + return "generic" + + @property + def is_online_model(self) -> bool: + return ":online" in self.grok_model + @staticmethod def _mask_api_key(key: str) -> str: """脱敏显示 API Key,只显示前后各 4 个字符""" @@ -143,12 +175,17 @@ def get_config_info(self) -> dict: "GROK_API_URL": api_url, "GROK_API_KEY": api_key_masked, "GROK_MODEL": self.grok_model, + "GROK_PROVIDER": self.provider_type, + "GROK_ONLINE_SEARCH": self.is_online_model, "GROK_DEBUG": self.debug_enabled, "GROK_LOG_LEVEL": self.log_level, "GROK_LOG_DIR": str(self.log_dir), "TAVILY_ENABLED": self.tavily_enabled, - "TAVILY_API_KEY": self._mask_api_key(self.tavily_api_key) if self.tavily_api_key else "未配置", - "config_status": config_status + "TAVILY_API_KEY": self._mask_api_key(self.tavily_api_key) + if self.tavily_api_key + else "未配置", + "config_status": config_status, } + config = Config() diff --git a/src/grok_search/providers/grok.py b/src/grok_search/providers/grok.py index 6e5c1c9..e7a14c6 100644 --- a/src/grok_search/providers/grok.py +++ b/src/grok_search/providers/grok.py @@ -3,11 +3,21 @@ from datetime import datetime, timezone from email.utils import parsedate_to_datetime from typing import List, Optional -from tenacity import AsyncRetrying, retry_if_exception, stop_after_attempt, wait_random_exponential +from tenacity import ( + AsyncRetrying, + retry_if_exception, + stop_after_attempt, + wait_random_exponential, +) from tenacity.wait import wait_base from zoneinfo import ZoneInfo from .base import BaseSearchProvider, SearchResult -from ..utils import search_prompt, fetch_prompt +from ..utils import ( + search_prompt, + fetch_prompt, + online_search_prompt, + online_fetch_prompt, +) from ..logger import log_info from ..config import config @@ -38,21 +48,54 @@ def _needs_time_context(query: str) -> bool: """检查查询是否需要时间上下文""" # 中文时间相关关键词 cn_keywords = [ - "当前", "现在", "今天", "明天", "昨天", - "本周", "上周", "下周", "这周", - "本月", "上月", "下月", "这个月", - "今年", "去年", "明年", - "最新", "最近", "近期", "刚刚", "刚才", - "实时", "即时", "目前", + "当前", + "现在", + "今天", + "明天", + "昨天", + "本周", + "上周", + "下周", + "这周", + "本月", + "上月", + "下月", + "这个月", + "今年", + "去年", + "明年", + "最新", + "最近", + "近期", + "刚刚", + "刚才", + "实时", + "即时", + "目前", ] # 英文时间相关关键词 en_keywords = [ - "current", "now", "today", "tomorrow", "yesterday", - "this week", "last week", "next week", - "this month", "last month", "next month", - "this year", "last year", "next year", - "latest", "recent", "recently", "just now", - "real-time", "realtime", "up-to-date", + "current", + "now", + "today", + "tomorrow", + "yesterday", + "this week", + "last week", + "next week", + "this month", + "last month", + "next month", + "this year", + "last year", + "next year", + "latest", + "recent", + "recently", + "just now", + "real-time", + "realtime", + "up-to-date", ] query_lower = query.lower() @@ -67,12 +110,21 @@ def _needs_time_context(query: str) -> bool: return False + RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504} def _is_retryable_exception(exc) -> bool: """检查异常是否可重试""" - if isinstance(exc, (httpx.TimeoutException, httpx.NetworkError, httpx.ConnectError, httpx.RemoteProtocolError)): + if isinstance( + exc, + ( + httpx.TimeoutException, + httpx.NetworkError, + httpx.ConnectError, + httpx.RemoteProtocolError, + ), + ): return True if isinstance(exc, httpx.HTTPStatusError): return exc.response.status_code in RETRYABLE_STATUS_CODES @@ -89,7 +141,10 @@ def __init__(self, multiplier: float, max_wait: int): def __call__(self, retry_state): if retry_state.outcome and retry_state.outcome.failed: exc = retry_state.outcome.exception() - if isinstance(exc, httpx.HTTPStatusError) and exc.response.status_code == 429: + if ( + isinstance(exc, httpx.HTTPStatusError) + and exc.response.status_code == 429 + ): retry_after = self._parse_retry_after(exc.response) if retry_after is not None: return retry_after @@ -125,7 +180,14 @@ def __init__(self, api_url: str, api_key: str, model: str = "grok-4-fast"): def get_provider_name(self) -> str: return "Grok" - async def search(self, query: str, platform: str = "", min_results: int = 3, max_results: int = 10, ctx=None) -> List[SearchResult]: + async def search( + self, + query: str, + platform: str = "", + min_results: int = 3, + max_results: int = 10, + ctx=None, + ) -> List[SearchResult]: headers = { "Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json", @@ -134,12 +196,25 @@ async def search(self, query: str, platform: str = "", min_results: int = 3, max return_prompt = "" if platform: - platform_prompt = "\n\nYou should search the web for the information you need, and focus on these platform: " + platform + platform_prompt = ( + "\n\nYou should search the web for the information you need, and focus on these platform: " + + platform + ) if max_results: - return_prompt = "\n\nYou should return the results in a JSON format, and the results should at least be " + str(min_results) + " and at most be " + str(max_results) + " results." + return_prompt = ( + "\n\nYou should return the results in a JSON format, and the results should at least be " + + str(min_results) + + " and at most be " + + str(max_results) + + " results." + ) + + if config.is_online_model: + system_content = online_search_prompt + else: + system_content = search_prompt - # 仅在查询包含时间相关关键词时注入当前时间信息 if _needs_time_context(query): time_context = get_local_time_info() + "\n" else: @@ -150,14 +225,21 @@ async def search(self, query: str, platform: str = "", min_results: int = 3, max "messages": [ { "role": "system", - "content": search_prompt, + "content": system_content, + }, + { + "role": "user", + "content": time_context + query + platform_prompt + return_prompt, }, - {"role": "user", "content": time_context + query + platform_prompt + return_prompt }, ], "stream": True, } - await log_info(ctx, f"platform_prompt: { query + platform_prompt + return_prompt}", config.debug_enabled) + await log_info( + ctx, + f"platform_prompt: {query + platform_prompt + return_prompt}", + config.debug_enabled, + ) return await self._execute_stream_with_retry(headers, payload, ctx) @@ -166,14 +248,23 @@ async def fetch(self, url: str, ctx=None) -> str: "Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json", } + + if config.is_online_model: + system_content = online_fetch_prompt + else: + system_content = fetch_prompt + payload = { "model": self.model, "messages": [ { "role": "system", - "content": fetch_prompt, + "content": system_content, + }, + { + "role": "user", + "content": url + "\n获取该网页内容并返回其结构化Markdown格式", }, - {"role": "user", "content": url + "\n获取该网页内容并返回其结构化Markdown格式" }, ], "stream": True, } @@ -181,21 +272,20 @@ async def fetch(self, url: str, ctx=None) -> str: async def _parse_streaming_response(self, response, ctx=None) -> str: content = "" - full_body_buffer = [] - + annotations = [] + full_body_buffer = [] + async for line in response.aiter_lines(): line = line.strip() if not line: continue - + full_body_buffer.append(line) - # 兼容 "data: {...}" 和 "data:{...}" 两种 SSE 格式 if line.startswith("data:"): if line in ("data: [DONE]", "data:[DONE]"): continue try: - # 去掉 "data:" 前缀,并去除可能的空格 json_str = line[5:].lstrip() data = json.loads(json_str) choices = data.get("choices", []) @@ -203,9 +293,11 @@ async def _parse_streaming_response(self, response, ctx=None) -> str: delta = choices[0].get("delta", {}) if "content" in delta: content += delta["content"] + if "annotations" in delta: + annotations.extend(delta["annotations"]) except (json.JSONDecodeError, IndexError): continue - + if not content and full_body_buffer: try: full_text = "".join(full_body_buffer) @@ -213,21 +305,50 @@ async def _parse_streaming_response(self, response, ctx=None) -> str: if "choices" in data and len(data["choices"]) > 0: message = data["choices"][0].get("message", {}) content = message.get("content", "") + if "annotations" in message: + annotations.extend(message["annotations"]) except json.JSONDecodeError: pass - + + if annotations: + content = self._append_citations(content, annotations) + await log_info(ctx, f"content: {content}", config.debug_enabled) return content - async def _execute_stream_with_retry(self, headers: dict, payload: dict, ctx=None) -> str: + @staticmethod + def _append_citations(content: str, annotations: list) -> str: + citations = [] + for ann in annotations: + if ann.get("type") != "url_citation": + continue + citation = ann.get("url_citation", ann) + url = citation.get("url", "") + title = citation.get("title", url) + if url and url not in [c[1] for c in citations]: + citations.append((title, url)) + + if not citations: + return content + + content += "\n\n---\n**Sources:**\n" + for title, url in citations: + content += f"- [{title}]({url})\n" + return content + + async def _execute_stream_with_retry( + self, headers: dict, payload: dict, ctx=None + ) -> str: """执行带重试机制的流式 HTTP 请求""" timeout = httpx.Timeout(connect=6.0, read=120.0, write=10.0, pool=None) async with httpx.AsyncClient(timeout=timeout, follow_redirects=True) as client: async for attempt in AsyncRetrying( stop=stop_after_attempt(config.retry_max_attempts + 1), - wait=_WaitWithRetryAfter(config.retry_multiplier, config.retry_max_wait), + wait=_WaitWithRetryAfter( + config.retry_multiplier, config.retry_max_wait + ), retry=retry_if_exception(_is_retryable_exception), reraise=True, ): diff --git a/src/grok_search/utils.py b/src/grok_search/utils.py index f54b5e9..5925851 100644 --- a/src/grok_search/utils.py +++ b/src/grok_search/utils.py @@ -9,23 +9,24 @@ def format_search_results(results: List[SearchResult]) -> str: formatted = [] for i, result in enumerate(results, 1): parts = [f"## Result {i}: {result.title}"] - + if result.url: parts.append(f"**URL:** {result.url}") - + if result.snippet: parts.append(f"**Summary:** {result.snippet}") - + if result.source: parts.append(f"**Source:** {result.source}") - + if result.published_date: parts.append(f"**Published:** {result.published_date}") - + formatted.append("\n".join(parts)) return "\n\n---\n\n".join(formatted) + fetch_prompt = """ # Profile: Web Content Fetcher @@ -241,3 +242,36 @@ def format_search_results(results: List[SearchResult]) -> str: ## Initialization 作为MCP高效搜索助手,你必须遵守上述Rules,按输出的JSON必须语法正确、可直接解析,不添加任何代码块标记、解释或确认性文字。 """ + + +online_search_prompt = """You are a web search assistant with real-time internet access. +Search the web for the user's query and return results as a pure JSON array. + +Each element must have exactly these fields: +{ + "title": "string, required", + "url": "string, required, valid URL", + "description": "string, required, 20-50 word summary" +} + +Rules: +- Output ONLY valid JSON array, no markdown fences, no explanation +- Use double quotes for all keys and string values +- No trailing commas +- UTF-8 encoding, display CJK characters directly +- Prioritize authoritative and recent sources +- Cross-reference multiple sources for accuracy +""" + + +online_fetch_prompt = """You are a web content fetcher with real-time internet access. +Fetch the given URL and return its complete content as structured Markdown. + +Requirements: +- Preserve the original content structure (headings, lists, tables, code blocks) +- Include a metadata header: source URL, title, fetch timestamp +- Do NOT summarize or modify content - return complete original text +- Convert HTML elements to proper Markdown syntax +- Use UTF-8 encoding +- Remove scripts, styles, ads, and non-content elements +"""