Skip to content

Fetch newer groups & their events #77

@lyqht

Description

@lyqht

With the current scrape implementation, it only fetches events from the most active 100 tech groups due to the default sort & pagination imposed by the Meetup Website. Hence @danielepolencic suggested to fetch newer groups and add them to DB first, then we fetch events based on the groups in the DB.

This issue should be addressed with the following solution:

Step by Step description

Current Implementation:

  1. Fetch 100 most active groups & their RSS urls
  2. Parse RSS urls to get relevant event urls
  3. fetch event details from event urls
  4. if events don't already exist in events table, add them to events table. otherwise update state of existing events.

Proposed New Implementation:

Getting groups

  1. get 100 newest groups
  2. if groups don't already exist in groups table, add them to groups table. otherwise update state of existing events.
  3. if any already exist, stop this task.

Getting events

  1. based on groups table, get RSS urls, and parse them to get relevant event urls.
  2. fetch event details from event urls
  3. if events don't already exist in events table, add them to events table. otherwise update state of existing events.

High level overview of tasks:

  • Configure the harvester service to continuously scrape for groups & add them to the DB until it find a group that already exists in the DB
  • Configure the harvester service to parse RSS from groups existing in the DB
  • Check for duplication of groups & events
  • Add integration tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions