Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/lucytech.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 53 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Scraper API

An API service to scrape a URL and get a summary.

## High level architecture

![High level diagram](./docs/lucytech.jpg)

### Components

1. Scrape handler - Handles the initial scrape request.
2. Page handler - Handles subsequent pagination requests.
3. HTML parser - Fetch the HTML content of the given URL and process.
4. In-memory storage - Holds fetched information mapped to a random unique key.
5. URL status checker - Checks statuses of URLs found on HTML content.

### Design concerns

> Checking statuses of URLs found on the scrapped HTML content was identified as the most expensive operation as of now in this application. Though we use `go routines`, the response latency of URLs makes a huge impact on system performance and response time of this API. To tackle this issue, we decided not to process all found URLs at the first request. We trigger only 10 URLs to check the status, and keep the rest in memory with the support of pagination. So user can request as batches of 10 URLs per request on subsequent pagination requests. With this approach we were able scrape websites with huge amount of URLs (ie: yahoo.com) effortlessly and without breaking the system.

#### Further improvements

* We can replace the in-memory storage with a database.
* We can use a messaging technique to pass data changes in real-time to the UI.

## How to run using Docker

* Run `docker-compose up --build`
Expand All @@ -8,6 +31,36 @@

* Run `go test -coverprofile=coverage.out ./...`
* To check the test coverage run `go tool cover -func=coverage.out`
* Test coverage output:

```bash
scraper/cmd/main.go:12: main 0.0%
scraper/handlers/scrape.go:19: ScrapeHandler 94.1%
scraper/handlers/scrape.go:51: PageHandler 85.7%
scraper/logger/logger.go:14: Debug 100.0%
scraper/logger/logger.go:18: Info 0.0%
scraper/logger/logger.go:22: Error 100.0%
scraper/services/htmlparser.go:13: FetchPageInfo 100.0%
scraper/services/htmlparser.go:24: ParseHTML 90.9%
scraper/services/htmlparser.go:64: traverse 80.0%
scraper/services/htmlparser.go:74: extractHref 75.0%
scraper/services/htmlparser.go:83: resolveURL 100.0%
scraper/services/htmlparser.go:89: isInternal 100.0%
scraper/services/htmlparser.go:96: containsPasswordInput 100.0%
scraper/services/htmlparser.go:112: extractTitle 100.0%
scraper/services/htmlparser.go:125: extractHtmlVersion 50.0%
scraper/services/urlstatus.go:21: CheckURLStatus 100.0%
scraper/storage/memory.go:15: StorePageInfo 100.0%
scraper/storage/memory.go:24: RetrievePageInfo 100.0%
scraper/storage/memory.go:32: generateID 100.0%
scraper/storage/memory.go:36: randomString 100.0%
scraper/utils/helpers.go:10: CalculateTotalPages 100.0%
scraper/utils/helpers.go:14: CalculatePageBounds 100.0%
scraper/utils/helpers.go:20: min 66.7%
scraper/utils/helpers.go:27: BuildPageResponse 50.0%
total: (statements) 86.6%

```

## API Documentation

Expand Down
Loading