Added highlevel arch diagram of API/Updated readme #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

dushanlk merged 1 commit into master from doc/architecture

Jan 2, 2025

docs/lucytech.jpg

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

readme.md

-Original file line number
+Diff line change
@@ -1,5 +1,28 @@
     # Scraper API
+    An API service to scrape a URL and get a summary.
+    ## High level architecture
+    ![High level diagram](./docs/lucytech.jpg)
+    ### Components
+. Scrape handler - Handles the initial scrape request.
+. Page handler - Handles subsequent pagination requests.
+. HTML parser - Fetch the HTML content of the given URL and process.
+. In-memory storage - Holds fetched information mapped to a random unique key.
+. URL status checker - Checks statuses of URLs found on HTML content.
+    ### Design concerns
+    > Checking statuses of URLs found on the scrapped HTML content was identified as the most expensive operation as of now in this application. Though we use `go routines`, the response latency of URLs makes a huge impact on system performance and response time of this API. To tackle this issue, we decided not to process all found URLs at the first request. We trigger only 10 URLs to check the status, and keep the rest in memory with the support of pagination. So user can request as batches of 10 URLs per request on subsequent pagination requests. With this approach we were able scrape websites with huge amount of URLs (ie: yahoo.com) effortlessly and without breaking the system.
+    #### Further improvements
+    * We can replace the in-memory storage with a database.
+    * We can use a messaging technique to pass data changes in real-time to the UI.
     ## How to run using Docker
     * Run `docker-compose up --build`
@@ Expand All / @@ -8,6 +31,36 @@ @@
     * Run `go test -coverprofile=coverage.out ./...`
     * To check the test coverage run `go tool cover -func=coverage.out`
+    * Test coverage output:
+    ```bash
+    scraper/cmd/main.go:12:			    main			        0.0%
+    scraper/handlers/scrape.go:19:		ScrapeHandler		    94.1%
+    scraper/handlers/scrape.go:51:		PageHandler		        85.7%
+    scraper/logger/logger.go:14:		Debug			        100.0%
+    scraper/logger/logger.go:18:		Info			        0.0%
+    scraper/logger/logger.go:22:		Error			        100.0%
+    scraper/services/htmlparser.go:13:	FetchPageInfo		    100.0%
+    scraper/services/htmlparser.go:24:	ParseHTML		        90.9%
+    scraper/services/htmlparser.go:64:	traverse		        80.0%
+    scraper/services/htmlparser.go:74:	extractHref		        75.0%
+    scraper/services/htmlparser.go:83:	resolveURL		        100.0%
+    scraper/services/htmlparser.go:89:	isInternal		        100.0%
+    scraper/services/htmlparser.go:96:	containsPasswordInput	100.0%
+    scraper/services/htmlparser.go:112:	extractTitle		    100.0%
+    scraper/services/htmlparser.go:125:	extractHtmlVersion	    50.0%
+    scraper/services/urlstatus.go:21:	CheckURLStatus		    100.0%
+    scraper/storage/memory.go:15:		StorePageInfo		    100.0%
+    scraper/storage/memory.go:24:		RetrievePageInfo	    100.0%
+    scraper/storage/memory.go:32:		generateID		        100.0%
+    scraper/storage/memory.go:36:		randomString		    100.0%
+    scraper/utils/helpers.go:10:		CalculateTotalPages	    100.0%
+    scraper/utils/helpers.go:14:		CalculatePageBounds	    100.0%
+    scraper/utils/helpers.go:20:		min			            66.7%
+    scraper/utils/helpers.go:27:		BuildPageResponse	    50.0%
+    total:					            (statements)		    86.6%
+    ```
     ## API Documentation
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added highlevel arch diagram of API/Updated readme #8

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!