Skip to content

431 website scraper#433

Merged
Behzad-rabiei merged 9 commits intomainfrom
431-website-scraper
Feb 19, 2025
Merged

431 website scraper#433
Behzad-rabiei merged 9 commits intomainfrom
431-website-scraper

Conversation

@Behzad-rabiei
Copy link
Member

@Behzad-rabiei Behzad-rabiei commented Feb 19, 2025

Summary by CodeRabbit

  • New Features
    • Introduced a new “website” platform option, allowing users to integrate and manage website-related workflows seamlessly.
    • Enabled automated scheduling capabilities for website operations, including creation, pausing, and deletion.
  • Documentation
    • Updated API documentation with expanded platform options and enhanced metadata details to support website integrations.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2025

Walkthrough

This pull request integrates a new "website" platform across various modules. It updates the dependency version of @togethercrew.dev/db in package.json and revises API documentation to include additional platform options such as "telegram" and "website." Service logic is extended to support website scheduling in module updates, with added methods in temporal and core website services. Validation schemas are updated to accommodate website metadata requirements. Minor logging improvements were also introduced in the temporal discourse service.

Changes

File(s) Change Summary
package.json Updated @togethercrew.dev/db dependency version from ^3.2.3 to ^3.3.0.
src/docs/{module,platform}.doc.yml Modified API docs to expand the platform enum and add new metadata descriptions for "telegram" and "website".
src/services/{index, module.service.ts, platform.service.ts} Added website platform handling in module updates, including conditional scheduling via websiteService and updates to metadata key logic.
src/services/temporal/{discourse.service.ts, website.service.ts} Introduced logger usage and added temporal scheduling methods for website operations.
src/services/website/{core.service.ts, index.ts} Added core functions (createWebsiteSchedule, deleteWebsiteSchedule) for managing website schedules and centralized export of website services.
src/validations/{module.validation.ts, platform.validation.ts} Enhanced metadata validation by adding functions for website-related schemas and updating platform metadata switch cases.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant M as ModuleService
    participant W as WebsiteCoreService
    participant T as TemporalWebsiteService
    participant P as PlatformService

    C->>M: Request update for Hivemind module (Website platform)
    M->>W: Invoke createWebsiteSchedule(platformId)
    W->>T: Call createSchedule(platformId)
    T-->>W: Return scheduleId
    W-->>M: Provide scheduleId
    M->>P: Retrieve platform by ID
    P-->>M: Return platform details
    M->>P: Update platform metadata with scheduleId
    P-->>M: Confirm platform update
Loading

Possibly related PRs

Suggested reviewers

  • cyri113

Poem

I’m a rabbit hopping through code so light,
With website platforms now shining bright.
Schedules and metadata dance in the flow,
New methods and logging helping us grow.
In every hop I see improvements unfold—
Celebrating changes, brave and bold!

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
src/services/temporal/website.service.ts (4)

11-25: Consider user-configurable scheduling.
By tying the schedule to the current UTC date/time (lines 13–18), every new schedule adopts the weekday, hour, and minute at which the code is invoked. If your use case eventually requires more flexible scheduling or user-defined intervals, you may want to extract this logic (e.g., to environment variables or request parameters) or allow day-of-week/hour overrides.


46-48: Consider using a custom error or logging context.
Currently, the catch block rethrows a standard Error. If desired, wrap it in your consistent error-handling strategy (similar to ApiError) to streamline error reporting and troubleshooting across the codebase.


51-55: Pause schedule error-handling.
Interacting with a non-existent or already-paused schedule might throw runtime errors. If you need to handle or ignore those specific cases gracefully, consider adding a try/catch block.


57-61: Ensure schedule deletion is idempotent.
Similar to pausing, attempting to delete a non-existent schedule can throw. If desired, handle or log such errors more gracefully.

src/services/module.service.ts (2)

66-73: Validate or safeguard platform data.
Your logic checks for updateBody.options.platforms[0].name == undefined and updates metadata accordingly. While functional, consider validating this data structure more robustly (e.g., ensuring the array is not empty, verifying metadata shape) to avoid potential runtime errors and improve maintainability.

🧰 Tools
🪛 Biome (1.9.4)

[error] 66-66: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


80-85: Handle potential errors when creating website schedule.
In this block, you invoke websiteService.coreService.createWebsiteSchedule and then attempt to save the resulting schedule ID to the platform’s metadata. If getPlatformById returns null or schedule creation fails, the subsequent code may silently do nothing. A dedicated try/catch here would allow you to manage failures (e.g., logging or reverting partial updates).

Do you want me to generate an example refactor that adds local error handling for schedule creation or platform retrieval?

src/services/website/core.service.ts (1)

21-28: Unify error handling strategy.
You use ApiError(590, ...) for failed schedule deletions, which is consistent with your create method. This is good for capturing system-level issues. Consider standardizing on a narrower error code or HTTP status if that suits your application’s design (e.g., 404 for a non-existent schedule).

src/validations/platform.validation.ts (1)

222-231: Consider adding validation for update-specific fields.

The update metadata schema could include additional fields specific to updating website resources, such as:

  • Last scrape timestamp
  • Success/failure metrics
  • Resource status
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e4877c and 23eb600.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (12)
  • package.json (1 hunks)
  • src/docs/module.doc.yml (2 hunks)
  • src/docs/platform.doc.yml (3 hunks)
  • src/services/index.ts (2 hunks)
  • src/services/module.service.ts (2 hunks)
  • src/services/platform.service.ts (2 hunks)
  • src/services/temporal/discourse.service.ts (1 hunks)
  • src/services/temporal/website.service.ts (1 hunks)
  • src/services/website/core.service.ts (1 hunks)
  • src/services/website/index.ts (1 hunks)
  • src/validations/module.validation.ts (3 hunks)
  • src/validations/platform.validation.ts (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/services/website/index.ts
🧰 Additional context used
🪛 Biome (1.9.4)
src/validations/module.validation.ts

[error] 117-117: Do not add then to an object.

(lint/suspicious/noThenProperty)

src/validations/platform.validation.ts

[error] 147-147: Do not add then to an object.

(lint/suspicious/noThenProperty)


[error] 151-151: Do not add then to an object.

(lint/suspicious/noThenProperty)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / lint / Lint
🔇 Additional comments (14)
src/services/temporal/website.service.ts (2)

27-45: Check handling for existing schedules.
When creating a schedule with a derived “website/” ID, if a schedule with the same ID already exists, a conflict might arise. Confirm whether you want to overwrite, update, or fail in these cases. Temporal’s client.schedule.create can raise errors if the ID is in use.


64-64: Singleton export confirmed.
Exporting a single instance of TemporalWebsiteService supports straightforward usage across the codebase. Looks good!

src/services/website/core.service.ts (2)

9-19: Confirm schedule collision behaviors.
When calling temporalWebsite.createSchedule, you log the schedule ID and trigger it. If a schedule with the same ID already exists, or if any concurrency constraints fail, the request might raise exceptions that differ from typical flow. Ensure these potential collisions align with your larger application strategy (e.g., whether to update existing or fail fast).


30-33: Core service exports look consistent.
Your default export collects both scheduling functions in one place, keeping them easily discoverable. Great job!

src/services/index.ts (1)

16-16: LGTM!

The addition of websiteService follows the established pattern for service imports and exports.

Also applies to: 40-40

src/services/temporal/discourse.service.ts (1)

1-8: LGTM! Logger integration enhances observability.

The addition of logging configuration follows best practices by:

  • Using a child logger with module context
  • Maintaining separation of concerns
src/services/platform.service.ts (2)

181-182: LGTM! Added support for website platform metadata.

The metadata key 'resources' aligns with the metadata schema defined in the API documentation.


215-215: LGTM! Enhanced platform connection validation.

Added platform name check to prevent duplicate connections across communities.

package.json (1)

31-31:

Details

❓ Verification inconclusive

Verify the updated dependency version.

Let's check if version 3.3.0 of @togethercrew.dev/db exists and if there are any security advisories.


🏁 Script executed:

#!/bin/bash
# Description: Check for security advisories and latest versions of the `@togethercrew.dev/db` package.

# Check npm for latest versions
npm view @togethercrew.dev/db versions --json

# Check for security advisories
npm audit @togethercrew.dev/db@3.3.0

Length of output: 1340


Dependency Version Confirmed – Manual Security Review Needed

  • The available versions list confirms that version 3.3.0 of @togethercrew.dev/db exists.
  • The npm audit check did not complete due to a missing lockfile (error ENOLOCK). Please create a lockfile (e.g., using npm i --package-lock-only) and re-run the audit manually to verify that no security advisories affect version 3.3.0.
src/docs/module.doc.yml (2)

172-173: LGTM! Added website platform to module API.

The platform enum has been updated to include the new website platform.


240-242: LGTM! Added metadata description for website platform.

The metadata description for the website platform is consistent with the schema defined in platform.doc.yml.

src/docs/platform.doc.yml (3)

72-72: LGTM! Added website platform to platform creation API.

The platform enum has been updated to include the new website platform.


199-208: LGTM! Added metadata schema for website platform.

The metadata schema for the website platform:

  • Requires 'resources' array with URI format.
  • Consistent with module API documentation.

241-242: LGTM! Added website platform to platform retrieval API.

The platform enum has been updated to include the new website platform.

Comment on lines +82 to +84
const websiteMediaWikiMetadata = () => {
return Joi.object().keys({});
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing website metadata validation.

The empty object schema for website metadata might be too permissive. Consider adding validation for essential website-related fields such as:

  • Base URL
  • Scraping configuration
  • Rate limiting parameters

Comment on lines +32 to +36
const websiteUpdateMetadata = () => {
return Joi.object().keys({
resources: Joi.array().items(Joi.string().uri({ scheme: ['http', 'https'] })),
});
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance website resource validation with additional safeguards.

While URI validation is good, consider adding:

  1. Rate limiting parameters to prevent aggressive scraping
  2. Allowed domains validation to prevent unauthorized access
  3. Maximum number of resources limit

Example enhancement:

 const websiteMetadata = () => {
   return Joi.object().keys({
     resources: Joi.array()
       .items(Joi.string().uri({ scheme: ['http', 'https'] }))
+      .max(100) // Prevent excessive resource lists
       .required(),
+    rateLimit: Joi.object().keys({
+      requestsPerMinute: Joi.number().min(1).max(60).required(),
+      concurrency: Joi.number().min(1).max(10).required()
+    }).required(),
+    allowedDomains: Joi.array().items(Joi.string().domain()).required()
   });
 };

Also applies to: 101-107

@Behzad-rabiei Behzad-rabiei merged commit 5a1b7b7 into main Feb 19, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant