diff --git a/docs/lucytech.jpg b/docs/lucytech.jpg new file mode 100644 index 0000000..74ce0de Binary files /dev/null and b/docs/lucytech.jpg differ diff --git a/readme.md b/readme.md index 905c12a..4c4e533 100644 --- a/readme.md +++ b/readme.md @@ -1,5 +1,28 @@ # Scraper API +An API service to scrape a URL and get a summary. + +## High level architecture + +![High level diagram](./docs/lucytech.jpg) + +### Components + +1. Scrape handler - Handles the initial scrape request. +2. Page handler - Handles subsequent pagination requests. +3. HTML parser - Fetch the HTML content of the given URL and process. +4. In-memory storage - Holds fetched information mapped to a random unique key. +5. URL status checker - Checks statuses of URLs found on HTML content. + +### Design concerns + +> Checking statuses of URLs found on the scrapped HTML content was identified as the most expensive operation as of now in this application. Though we use `go routines`, the response latency of URLs makes a huge impact on system performance and response time of this API. To tackle this issue, we decided not to process all found URLs at the first request. We trigger only 10 URLs to check the status, and keep the rest in memory with the support of pagination. So user can request as batches of 10 URLs per request on subsequent pagination requests. With this approach we were able scrape websites with huge amount of URLs (ie: yahoo.com) effortlessly and without breaking the system. + +#### Further improvements + +* We can replace the in-memory storage with a database. +* We can use a messaging technique to pass data changes in real-time to the UI. + ## How to run using Docker * Run `docker-compose up --build` @@ -8,6 +31,36 @@ * Run `go test -coverprofile=coverage.out ./...` * To check the test coverage run `go tool cover -func=coverage.out` +* Test coverage output: + +```bash +scraper/cmd/main.go:12: main 0.0% +scraper/handlers/scrape.go:19: ScrapeHandler 94.1% +scraper/handlers/scrape.go:51: PageHandler 85.7% +scraper/logger/logger.go:14: Debug 100.0% +scraper/logger/logger.go:18: Info 0.0% +scraper/logger/logger.go:22: Error 100.0% +scraper/services/htmlparser.go:13: FetchPageInfo 100.0% +scraper/services/htmlparser.go:24: ParseHTML 90.9% +scraper/services/htmlparser.go:64: traverse 80.0% +scraper/services/htmlparser.go:74: extractHref 75.0% +scraper/services/htmlparser.go:83: resolveURL 100.0% +scraper/services/htmlparser.go:89: isInternal 100.0% +scraper/services/htmlparser.go:96: containsPasswordInput 100.0% +scraper/services/htmlparser.go:112: extractTitle 100.0% +scraper/services/htmlparser.go:125: extractHtmlVersion 50.0% +scraper/services/urlstatus.go:21: CheckURLStatus 100.0% +scraper/storage/memory.go:15: StorePageInfo 100.0% +scraper/storage/memory.go:24: RetrievePageInfo 100.0% +scraper/storage/memory.go:32: generateID 100.0% +scraper/storage/memory.go:36: randomString 100.0% +scraper/utils/helpers.go:10: CalculateTotalPages 100.0% +scraper/utils/helpers.go:14: CalculatePageBounds 100.0% +scraper/utils/helpers.go:20: min 66.7% +scraper/utils/helpers.go:27: BuildPageResponse 50.0% +total: (statements) 86.6% + +``` ## API Documentation