[#21] [Backend] As a system, I can scrap data from Google search result page by mosharaf13 · Pull Request #43 · nimblehq/google-scrapper-ruby

mosharaf13 · 2023-06-12T09:13:00Z

closes #21

What happened 👀

Added backend mechanism for scrapping data from google

Insight 📝

Added a job that takes a search_stat id as input and after scrapping data from google, updates relevant columns for it

Proof Of Work 📹

Testsuite run stat

…ge' of github.com:mosharaf13/google-scrapper-ruby into backend/scrap-google-search-page

github-actions · 2023-06-12T09:22:06Z

Code coverage is now at 0.00% (0/204 lines)

Generated by 🚫 Danger

sanG-github

Some early suggestions 😉 Please take a look at your code to re-format it, I saw some empty lines in the service. 🙇
Hope that is useful for you.

app/jobs/google/search_keyword_job.rb

sanG-github · 2023-06-12T10:16:08Z

app/jobs/google/search_keyword_job.rb

+  class SearchKeywordJob < ApplicationJob
+    queue_as :default
+
+    def perform(search_stat_id)


We should better make the call explicit with the keyword arguments.

Suggested change

def perform(search_stat_id)

def perform(search_stat_id:)

Fixed in 45c5304

sanG-github · 2023-06-12T10:18:49Z

app/services/google/parser_service.rb

+    end
+
+    def ads_page_urls
+      document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }


We only need to return true/false.

Suggested change

document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }

document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].present? }

Reverting back to presence here 57afc7e as present? returns a collection of only true values, not actual urls

Sorry about that, we can use filter with present?,

Suggested change

document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }

document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? }

app/services/google/parser_service.rb

sanG-github · 2023-06-12T10:20:42Z

spec/fabricators/search_stat_fabricator.rb

  status { rand(1..3) }
  raw_response { FFaker::HTMLIpsum.body }
-  user_id { demo_user.id }
+  user { User.create(email: 'user@demo.com', password: 'Secret@11') }


Why don't we do it explicitly as previously? 💭

Suggested change

user { User.create(email: 'user@demo.com', password: 'Secret@11') }

user { demo_user }

Creating explicit user throws below exception while running tests. However, implicit creation doesn't throw any error. If it's required to use explicit declaration, we will have to find other ways to solve the problem

let's specify the class name explicitly

Fabricator(:search_stat, class_name: SearchStat) do

why do you need to use a specific user for the search_stat, how about generating randomly, e.g.

Fabricator(:search_stat, class_name: SearchStat) do ... user { Fabricate(:user) } end

when you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this

john = User.find(1) search_stat = Fabricate(:search_stat, user: john)

Sorry about that, I thought we were in the seeds file. How about using Fabricate instead of creating a new one manually?

Suggested change

user { User.create(email: 'user@demo.com', password: 'Secret@11') }

user { Fabricate(:user) }

Fixed in 5c5eaad

sanG-github · 2023-06-12T10:22:56Z

app/jobs/google/search_keyword_job.rb

+
+      html_result = Google::ClientService.new(keyword: search_stat.keyword).call
+
+      raise ClientServiceError unless html_result


It does NOT make total sense as the returned data can be blank, right? 🙊

IMO, I think we can separate it into a method, and then use the begin/rescue to catch the errors.

Fixed in 122de2f

@mosharaf13 Don't know why the changes don't reflect here.

I am not quite sure about that. This is how the job looks currently.

…per-ruby into backend/scrap-google-search-page

Co-authored-by: Sang Huynh Thanh <63148598+sanG-github@users.noreply.github.com>

sanG-github

Some suggestions, please update tests for the missing service.

app/jobs/google/search_keyword_job.rb

sanG-github · 2023-06-16T08:21:09Z

app/jobs/google/search_keyword_job.rb

+    def perform(search_stat_id:)
+      search_stat = SearchStat.find search_stat_id
+      html_result = fetch_html_result(search_stat.keyword)
+      update_search_stat search_stat, ParserService.new(html_response: html_result).call


I think it's more explicit and readable.
First, we call the service and assign the returned data to the variable, then, we pass the variable to the update method.

Suggested change

update_search_stat search_stat, ParserService.new(html_response: html_result).call

parsed_attributes = ParserService.new(html_response: html_result).call

update_search_stat(search_stat, parsed_attributes)

Fixed in 40d3740

sanG-github · 2023-06-16T08:21:54Z

app/jobs/google/search_keyword_job.rb

+    queue_as :default
+
+    def perform(search_stat_id:)
+      search_stat = SearchStat.find search_stat_id


What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)

Is this really a possibility 🤔? Should this case be handled?

Sure Mosharaf, we must handle this case. 🙏

Fixed in b1fde06

About the fix, it will never reach the line return unless search_stat when there is no search_stat.
You can take a look at the difference between find and find_by.

sanG-github · 2023-06-16T08:24:44Z

app/jobs/google/search_keyword_job.rb

+      Rails.logger.error("Error while fetching HTML result: #{e.message}")
+      raise ClientServiceError, 'Error fetching HTML result'


We never reach that codes as we already rescue the errors in the ClientService and then return false.
Otherwise, if there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.

So, should this section be removed?

If there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.

We can remove it in this service, but please handle it in the ClientService alternately.

Fixed in 582a136

sanG-github · 2023-06-16T08:27:03Z

app/jobs/google/search_keyword_job.rb

+        # rubocop:disable Rails/SkipsModelValidations
+        search_stat.result_links.insert_all attributes[:result_links]
+        # rubocop:enable Rails/SkipsModelValidations


Why do we need to disable this rule, usually we find a way to work around this instead of disabling the rule each time we faced it.

I think we should better use create instead of insert_all, then we won't miss any validations as we don't want to create a record with an url.

validates :url, presence: true

Fixed in 8051e7c

sanG-github · 2023-06-16T08:48:27Z

app/services/google/client_service.rb

+    # Inspect Http response status code
+    # Any non 200 response code will be logged
+    def valid_result?(result)
+      return true if result&.response&.code == '200'


We can early return false if there is no result, so we don't need the safe navigator anymore. ✨

Suggested change

return true if result&.response&.code == '200'

return false unless result

return true if result.response.code == SUCCESS_STATUS_CODE

Also, please define a constant for SUCCESS_STATUS_CODE = '200'

Fixed in bce3b3e

sanG-github · 2023-06-16T08:50:12Z

app/services/google/client_service.rb

+    def valid_result?(result)
+      return true if result&.response&.code == '200'
+
+      Rails.logger.warn "Warning: Query Google with '#{@escaped_keyword}' return status code #{result.response.code}"


It should not belongs to the valid_result? method, we only expect it returns the valid? property of the result, not to write the log. (We can move it to the call method)

Fixed in bce3b3e

sanG-github · 2023-06-16T08:58:25Z

app/services/google/parser_service.rb

+    AD_CONTAINER_ID = 'tads'
+    ADWORDS_CLASS = 'adwords'


I see that if we use ID or CLASS, we need to manually add a corresponding prefix # or . for it.
So, what do you think about defining it as a SELECTOR as you have done with NON_ADS_RESULT_SELECTOR

Suggested change

AD_CONTAINER_ID = 'tads'

ADWORDS_CLASS = 'adwords'

AD_CONTAINER_ID = '#tads'

ADWORDS_CLASS = '.adwords'

# Add a class to all AdWords link for easier manipulation document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)

ADWORDS_CLASS is being added for easier manipulation. Adding . to this constant would result in stripping . before using it in this line.

sanG-github · 2023-06-16T09:00:18Z

app/services/google/parser_service.rb

+      raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank?
+
+      @html = html_response
+
+      @document = Nokogiri::HTML.parse(html_response)
+
+      # Add a class to all AdWords link for easier manipulation
+      document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)
+
+      # Mark footer links to identify them
+      document.css('#footcnt a').add_class('footer-links')


In this scope, we should better not raise any errors or customize the document, just initialize some values that we need before moving on.

Suggested change

raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank?

@html = html_response

@document = Nokogiri::HTML.parse(html_response)

# Add a class to all AdWords link for easier manipulation

document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)

# Mark footer links to identify them

document.css('#footcnt a').add_class('footer-links')

@html = html_response

@document = Nokogiri::HTML.parse(html_response) if html_response.body

sanG-github · 2023-06-16T09:02:36Z

app/services/google/parser_service.rb

+    end
+
+    # Parse html data and return a hash with the results
+    def call


And then in this method, we will early return if it does not pass the valid? check.

Suggested change

def call

def call

return unless valid?

def valid? html.present? && document.present? end

7dda9bb. Did a bit of refactoring based on your suggestions. Please let me know if this covers everything.

Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.

Fixed in bd1df63

longnd · 2023-06-16T08:10:37Z

spec/fabricators/search_stat_fabricator.rb

  status { rand(1..3) }
  raw_response { FFaker::HTMLIpsum.body }
-  user_id { demo_user.id }
+  user { User.create(email: 'user@demo.com', password: 'Secret@11') }


let's specify the class name explicitly

Fabricator(:search_stat, class_name: SearchStat) do

longnd · 2023-06-16T08:10:40Z

spec/fabricators/search_stat_fabricator.rb

  status { rand(1..3) }
  raw_response { FFaker::HTMLIpsum.body }
-  user_id { demo_user.id }
+  user { User.create(email: 'user@demo.com', password: 'Secret@11') }


why do you need to use a specific user for the search_stat, how about generating randomly, e.g.

Fabricator(:search_stat, class_name: SearchStat) do ... user { Fabricate(:user) } end

when you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this

john = User.find(1) search_stat = Fabricate(:search_stat, user: john)

app/services/google/parser_service.rb

github-actions · 2023-06-20T10:40:13Z

app/services/google/client_service.rb

+      @uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}")
+    end
+
+    def call


⚠️ Has approx 6 statements

github-actions · 2023-06-20T10:40:14Z

app/services/google/client_service.rb

+      return false unless valid_result? result
+
+      result
+    rescue HTTParty::Error, Timeout::Error, SocketError => e


⚠️ Has the variable name 'e'

github-actions · 2023-06-20T10:40:15Z

app/services/google/client_service.rb

+
+    private
+
+    def valid_result?(result)


⚠️ Doesn't depend on instance state (maybe move it to another class?)

github-actions · 2023-06-21T07:59:54Z

app/jobs/google/search_keyword_job.rb

+
+    def fetch_html_result(keyword)
+      Google::ClientService.new(keyword: keyword).call
+    rescue StandardError => e


⚠️ Has the variable name 'e'

sanG-github

Please help to resolve all the remaining comments, and add the missing tests (client_service.rb)

sanG-github · 2023-06-28T03:42:35Z

app/jobs/google/search_keyword_job.rb

+    queue_as :default
+
+    def perform(search_stat_id:)
+      search_stat = SearchStat.find search_stat_id


About the fix, it will never reach the line return unless search_stat when there is no search_stat.
You can take a look at the difference between find and find_by.

sanG-github · 2023-06-28T03:44:41Z

app/jobs/google/search_keyword_job.rb

+
+      html_result = Google::ClientService.new(keyword: search_stat.keyword).call
+
+      raise ClientServiceError unless html_result


@mosharaf13 Don't know why the changes don't reflect here.

sanG-github · 2023-06-28T03:47:07Z

app/jobs/google/search_keyword_job.rb

+
+      update_search_stat(search_stat, parsed_attributes)
+    rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError
+      update_keyword_status search_stat, :failed


Moreover, what happens if there is an error that is out of these errors? 🤔 Do we need to roll it back too?

ced1a0e Refactored search_keyword_job and client_service as ClientServiceError exception wasn't handled properly.

We can try wrapping up parser_service call method in a rescue and handle exception here. In that case, too many params in rescue method.

current

rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid

rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid, ParserServiceError

What we keep this many params or should we create a generic exception class to handle these?

sanG-github · 2023-06-28T03:50:04Z

app/services/google/parser_service.rb

+    end
+
+    def ads_page_urls
+      document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }


Sorry about that, we can use filter with present?,

Suggested change

document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }

document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? }

sanG-github · 2023-06-28T03:51:08Z

app/services/google/parser_service.rb

+    end
+
+    # Parse html data and return a hash with the results
+    def call


Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.

sanG-github · 2023-06-28T03:52:11Z

lib/tasks/search_keyword.rake

+    # Schedule the SearchKeywordJob for background processing
+    Google::SearchKeywordJob.perform_later(search_stat_id: 1)
+
+    puts 'SearchKeywordJob scheduled successfully.'


Just curiosity why do need this rake task? 🤔

I am trying to make present? work with filter. But I keep getting an error (attached).

Here, if I use filter, I have to chain another map method to return only 'href' values. Would it be worth it 😬 ?

Also, if I use select-pluck combo, would that be okay 😀 ?

Here, if present? doesn't provide significant performance improvement, wouldn't filter_map be with presence be better in that case? considering the conciseness 🤔

Just curiosity why do need this rake task? 🤔

Removing 👍🏼

…f13/google-scrapper-ruby into backend/scrap-google-search-page

github-actions · 2023-06-29T04:18:37Z

spec/services/google/client_service.rb

+      @uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}")
+    end
+
+    def call


⚠️ Has approx 6 statements

github-actions · 2023-06-29T04:18:38Z

spec/services/google/client_service.rb

+      return false unless valid_result? result
+
+      result
+    rescue HTTParty::Error, Timeout::Error, SocketError => e


⚠️ Has the variable name 'e'

github-actions · 2023-06-29T04:57:02Z

app/jobs/google/search_keyword_job.rb

+      update_search_stat_status search_stat, :failed
+    end
+
+    def update_search_stat(search_stat, attributes)


⚠️ Doesn't depend on instance state (maybe move it to another class?)

github-actions · 2023-06-29T04:57:03Z

app/jobs/google/search_keyword_job.rb

+      end
+    end
+
+    def update_search_stat_status(search_stat, status)


⚠️ Doesn't depend on instance state (maybe move it to another class?)

github-actions · 2023-06-29T04:57:05Z

app/services/google/client_service.rb

+      raise ClientServiceError unless valid_result? result
+
+      result
+    rescue HTTParty::Error, Timeout::Error, SocketError, ClientServiceError => e


⚠️ Has the variable name 'e'

github-actions · 2023-06-29T11:15:56Z

app/services/google/parser_service.rb

+      results
+    end
+
+    def result_link_map(urls, type)


⚠️ Doesn't depend on instance state (maybe move it to another class?)

mosharaf13 added 2 commits June 12, 2023 15:16

[#21] Add job to scrap search result from google

20b611e

Merge branches 'ui/list-keywords' and 'backend/scrap-google-search-pa…

10c5a6e

…ge' of github.com:mosharaf13/google-scrapper-ruby into backend/scrap-google-search-page

mosharaf13 added the feature label Jun 12, 2023

mosharaf13 added this to the 0.5.0 milestone Jun 12, 2023

mosharaf13 self-assigned this Jun 12, 2023

mosharaf13 requested a review from a user June 12, 2023 09:13

mosharaf13 requested review from longnd and sanG-github as code owners June 12, 2023 09:13

sanG-github reviewed Jun 12, 2023

View reviewed changes

mosharaf13 and others added 9 commits June 13, 2023 12:17

[#21] Use keyword argument for search keyword job

45c5304

[#21] Fix search stat job spec

4e60183

[#21] Remove blank lines from parser service

7ee3d01

[#21] Update url count methods of parser service

f8e2338

Merge branch 'ui/list-keywords' of github.com:mosharaf13/google-scrap…

c1669cf

…per-ruby into backend/scrap-google-search-page

[#21] Seed search stat with result links

9aa642f

[#21] Update result link type enum in parser service

3d70f82

[#21] Refactor parser service

8a96872

Co-authored-by: Sang Huynh Thanh <63148598+sanG-github@users.noreply.github.com>

[#21] Refactor search keyword job

122de2f

sanG-github reviewed Jun 16, 2023

View reviewed changes

longnd reviewed Jun 16, 2023

View reviewed changes

mosharaf13 added 5 commits June 20, 2023 12:50

[#21] Fabricate random user while fabricating search stat

5c5eaad

[#21] Refactor search keyword job

40d3740

[#21] Refactor search keyword job

8051e7c

[#21] Refactor client service

bce3b3e

[#21] Fix failing unit tests

e584866

github-actions bot reviewed Jun 20, 2023

View reviewed changes

[#21] In search keyword job handle case of non existant search stat

b1fde06

github-actions bot reviewed Jun 21, 2023

View reviewed changes

mosharaf13 added 6 commits June 21, 2023 15:29

[#21] Handle exeption during search keyword job

582a136

[#21] Add explicit class name for search stat fabricator

5860585

[#21] Add test for parser service top ad count

ef9224b

[#21] Fix bugs in parser service

57afc7e

[#21] Add tests for parser service

c190d21

[#21] Refactor parser service

7dda9bb

mosharaf13 requested review from longnd and sanG-github June 22, 2023 09:36

Base automatically changed from ui/list-keywords to develop June 23, 2023 03:42

Merge branch 'develop' into backend/scrap-google-search-page

131259d

sanG-github requested changes Jun 28, 2023

View reviewed changes

mosharaf13 added 2 commits June 29, 2023 09:45

[#21] Add tests for failing scenarios for google search keyword job

cf825ff

Merge branch 'backend/scrap-google-search-page' of github.com:moshara…

278be86

…f13/google-scrapper-ruby into backend/scrap-google-search-page

ghost approved these changes Jun 29, 2023

View reviewed changes

mosharaf13 requested a review from sanG-github June 29, 2023 04:09

[#21] Add tests for client service

9691391

github-actions bot reviewed Jun 29, 2023

View reviewed changes

[#21] Refactor search keyword job and client service

ced1a0e

github-actions bot reviewed Jun 29, 2023

View reviewed changes

mosharaf13 added 3 commits June 29, 2023 12:02

[#21] Rescue active record transaction exception in search keyword job

6b8502e

[#21] Remove unnecessary task class

8f40769

[#21] Reorder parser service methods

bd1df63

github-actions bot reviewed Jun 29, 2023

View reviewed changes

sanG-github removed their request for review September 26, 2023 02:41

	document.css(".#{ADWORDS_CLASS}").filter_map { \|a_tag\| a_tag['href'].presence }
	document.css(".#{ADWORDS_CLASS}").filter_map { \|a_tag\| a_tag['href'].present? }

	user { User.create(email: 'user@demo.com', password: 'Secret@11') }
	user { demo_user }

	user { User.create(email: 'user@demo.com', password: 'Secret@11') }
	user { Fabricate(:user) }


		html_result = Google::ClientService.new(keyword: search_stat.keyword).call

		raise ClientServiceError unless html_result

		Rails.logger.error("Error while fetching HTML result: #{e.message}")
		raise ClientServiceError, 'Error fetching HTML result'

	return true if result&.response&.code == '200'
	return false unless result
	return true if result.response.code == SUCCESS_STATUS_CODE

Conversation

mosharaf13 commented Jun 12, 2023

What happened 👀

Insight 📝

Proof Of Work 📹

Uh oh!

github-actions bot commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanG-github left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mosharaf13 Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanG-github left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 12, 2023 •

edited

Loading

mosharaf13 Jun 22, 2023 •

edited

Loading