Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
20b611e
[#21] Add job to scrap search result from google
mosharaf13 Jun 12, 2023
10c5a6e
Merge branches 'ui/list-keywords' and 'backend/scrap-google-search-pa…
mosharaf13 Jun 12, 2023
45c5304
[#21] Use keyword argument for search keyword job
mosharaf13 Jun 13, 2023
4e60183
[#21] Fix search stat job spec
mosharaf13 Jun 13, 2023
7ee3d01
[#21] Remove blank lines from parser service
mosharaf13 Jun 13, 2023
f8e2338
[#21] Update url count methods of parser service
mosharaf13 Jun 14, 2023
c1669cf
Merge branch 'ui/list-keywords' of github.com:mosharaf13/google-scrap…
mosharaf13 Jun 14, 2023
9aa642f
[#21] Seed search stat with result links
mosharaf13 Jun 14, 2023
3d70f82
[#21] Update result link type enum in parser service
mosharaf13 Jun 14, 2023
8a96872
[#21] Refactor parser service
mosharaf13 Jun 15, 2023
122de2f
[#21] Refactor search keyword job
mosharaf13 Jun 15, 2023
5c5eaad
[#21] Fabricate random user while fabricating search stat
mosharaf13 Jun 20, 2023
40d3740
[#21] Refactor search keyword job
mosharaf13 Jun 20, 2023
8051e7c
[#21] Refactor search keyword job
mosharaf13 Jun 20, 2023
bce3b3e
[#21] Refactor client service
mosharaf13 Jun 20, 2023
e584866
[#21] Fix failing unit tests
mosharaf13 Jun 20, 2023
b1fde06
[#21] In search keyword job handle case of non existant search stat
mosharaf13 Jun 21, 2023
582a136
[#21] Handle exeption during search keyword job
mosharaf13 Jun 21, 2023
5860585
[#21] Add explicit class name for search stat fabricator
mosharaf13 Jun 22, 2023
ef9224b
[#21] Add test for parser service top ad count
mosharaf13 Jun 22, 2023
57afc7e
[#21] Fix bugs in parser service
mosharaf13 Jun 22, 2023
c190d21
[#21] Add tests for parser service
mosharaf13 Jun 22, 2023
7dda9bb
[#21] Refactor parser service
mosharaf13 Jun 22, 2023
131259d
Merge branch 'develop' into backend/scrap-google-search-page
mosharaf13 Jun 23, 2023
cf825ff
[#21] Add tests for failing scenarios for google search keyword job
mosharaf13 Jun 29, 2023
278be86
Merge branch 'backend/scrap-google-search-page' of github.com:moshara…
mosharaf13 Jun 29, 2023
9691391
[#21] Add tests for client service
mosharaf13 Jun 29, 2023
ced1a0e
[#21] Refactor search keyword job and client service
mosharaf13 Jun 29, 2023
6b8502e
[#21] Rescue active record transaction exception in search keyword job
mosharaf13 Jun 29, 2023
8f40769
[#21] Remove unnecessary task class
mosharaf13 Jun 29, 2023
bd1df63
[#21] Reorder parser service methods
mosharaf13 Jun 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ gem 'tzinfo-data', platforms: [:mingw, :mswin, :x64_mingw, :jruby] # Windows doe
# gem 'kredis' # Use Kredis to get higher-level data types in Redis
# gem 'bcrypt' # Use Active Model has_secure_password
gem 'devise' # Flexible authentication solution for Rails with Warden
gem 'httparty' # A HTTP client for Ruby.

# Authentications & Authorizations
gem 'pundit' # Minimal authorization through OO design and pure Ruby classes
Expand Down
5 changes: 5 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,9 @@ GEM
globalid (1.1.0)
activesupport (>= 5.0)
hashdiff (1.0.1)
httparty (0.21.0)
mini_mime (>= 1.0.0)
multi_xml (>= 0.5.2)
i18n (1.13.0)
concurrent-ruby (~> 1.0)
i18n-js (3.9.0)
Expand Down Expand Up @@ -226,6 +229,7 @@ GEM
mini_mime (1.1.2)
minitest (5.18.0)
msgpack (1.7.1)
multi_xml (0.6.0)
nap (1.1.0)
net-imap (0.3.4)
date
Expand Down Expand Up @@ -468,6 +472,7 @@ DEPENDENCIES
ffaker
figaro
foreman
httparty
i18n-js (= 3.9.0)
jsbundling-rails
json_matchers
Expand Down
31 changes: 31 additions & 0 deletions app/jobs/google/search_keyword_job.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# frozen_string_literal: true

module Google
class ClientServiceError < StandardError; end

class SearchKeywordJob < ApplicationJob
queue_as :default

def perform(search_stat_id:)
search_stat = SearchStat.find search_stat_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a possibility 🤔? Should this case be handled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Mosharaf, we must handle this case. 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b1fde06

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the fix, it will never reach the line return unless search_stat when there is no search_stat.
You can take a look at the difference between find and find_by.

html_result = Google::ClientService.new(keyword: search_stat.keyword).call
parsed_attributes = ParserService.new(html_response: html_result).call

update_search_stat(search_stat, parsed_attributes)
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid
update_search_stat_status search_stat, :failed
end

def update_search_stat(search_stat, attributes)
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

SearchStat.transaction do
search_stat.result_links.create(attributes[:result_links])

search_stat.update! attributes.except(:result_links)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are some implicit values returned by the Service. Sometimes it can be false, right? 💭

end
end

def update_search_stat_status(search_stat, status)
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

search_stat.update! status: status
end
end
end
36 changes: 36 additions & 0 deletions app/services/google/client_service.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# frozen_string_literal: true

module Google
class ClientService
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a more meaningful name for this service. At first glance, no one knows what does the Client Service will do.

Suggested change
class ClientService
class FetchHtmlByKeywordService

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '\
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'

BASE_SEARCH_URL = 'https://www.google.com/search'

SUCCESS_STATUS_CODE = '200'

def initialize(keyword:, lang: 'en')
@escaped_keyword = CGI.escape(keyword)
@uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}")
end

def call
Copy link

@github-actions github-actions bot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has approx 6 statements

result = HTTParty.get(@uri, { headers: { 'User-Agent' => USER_AGENT } })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more readable if we can separate it to a method 😉

Suggested change
result = HTTParty.get(@uri, { headers: { 'User-Agent' => USER_AGENT } })
result = HTTParty.get(@uri, headers})
def headers
{ headers: { 'User-Agent' => USER_AGENT }
end


raise ClientServiceError unless valid_result? result

result
rescue HTTParty::Error, Timeout::Error, SocketError, ClientServiceError => e
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has the variable name 'e'

Rails.logger.error "Error: Query Google with '#{@escaped_keyword}' thrown an error: #{e}"

raise ClientServiceError, 'Error fetching HTML result'
end

private

def valid_result?(result)
Copy link

@github-actions github-actions bot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

return false unless result
return true if result.response.code == SUCCESS_STATUS_CODE
end
end
end
94 changes: 94 additions & 0 deletions app/services/google/parser_service.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# frozen_string_literal: true

module Google
class ParserService
NON_ADS_RESULT_SELECTOR = 'a[data-ved]:not([role]):not([jsaction]):not(.adwords):not(.footer-links)'
AD_CONTAINER_ID = 'tads'
ADWORDS_CLASS = 'adwords'
Comment on lines +6 to +7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that if we use ID or CLASS, we need to manually add a corresponding prefix # or . for it.
So, what do you think about defining it as a SELECTOR as you have done with NON_ADS_RESULT_SELECTOR

Suggested change
AD_CONTAINER_ID = 'tads'
ADWORDS_CLASS = 'adwords'
AD_CONTAINER_ID = '#tads'
ADWORDS_CLASS = '.adwords'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      # Add a class to all AdWords link for easier manipulation
      document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)

ADWORDS_CLASS is being added for easier manipulation. Adding . to this constant would result in stripping . before using it in this line.


def initialize(html_response:)
@html = html_response

@document = Nokogiri::HTML.parse(html_response) if html_response.body
end

# Parse html data and return a hash with the results
def call
return unless valid?

mark_adword_links
mark_footer_links

present_parsed_data
end

private

attr_reader :html, :document

def valid?
html.present? && document.present?
end

def mark_adword_links
# Add a class to all AdWords link for easier manipulation
document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)
end

def mark_footer_links
# Mark footer links to identify them
document.css('#footcnt a').add_class('footer-links')
end

def present_parsed_data
{
top_ad_count: ads_top_count,
ad_count: ads_page_count,
non_ad_count: non_ads_result_count,
total_result_count: total_link_count,
raw_response: html,
result_links: result_links,
status: :completed
}
end

def ads_top_count
document.css("##{AD_CONTAINER_ID} .#{ADWORDS_CLASS}").count
end

def ads_page_count
document.css(".#{ADWORDS_CLASS}").count
end

def ads_top_urls
document.css("##{AD_CONTAINER_ID} .#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
end

def ads_page_urls
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to return true/false.

Suggested change
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].present? }

Copy link
Contributor Author

@mosharaf13 mosharaf13 Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to presence here 57afc7e as present? returns a collection of only true values, not actual urls

Screen Shot 2023-06-22 at 11 02 51 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, we can use filter with present?,

Suggested change
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? }

end

def non_ads_result_count
document.css(NON_ADS_RESULT_SELECTOR).count { |a_tag| a_tag['href'].presence }
end

def non_ads_urls
document.css(NON_ADS_RESULT_SELECTOR).filter_map { |a_tag| a_tag['href'].presence }
end

def total_link_count
document.css('a').count
end

def result_links
results = result_link_map(ads_top_urls, :ads_top)
results += result_link_map(non_ads_urls, :non_ads)

results
end

def result_link_map(urls, type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

urls.map { |url| { url: url, link_type: type } }
end
end
end
2 changes: 1 addition & 1 deletion db/seeds.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@
end

10.times do
Fabricate.times(100, :search_stat, user: user)
Fabricate.times(100, :search_stat_parsed_with_links, user: user)
end
10 changes: 6 additions & 4 deletions spec/fabricators/search_stat_fabricator.rb
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# frozen_string_literal: true

demo_user = User.create(email: 'user@demo.com', password: 'Secret@11')

Fabricator(:search_stat) do
Fabricator(:search_stat, class_name: SearchStat) do
keyword { FFaker::Lorem.word }
ad_count { rand(1..10) }
link_count { rand(1..60) }
Expand All @@ -11,5 +9,9 @@
top_ad_count { rand(1..5) }
status { rand(1..3) }
raw_response { FFaker::HTMLIpsum.body }
user_id { demo_user.id }
user { Fabricate(:user) }
end

Fabricator(:search_stat_parsed_with_links, from: :search_stat) do
result_links(count: FFaker.rand(10) + 1)
end
63 changes: 63 additions & 0 deletions spec/fixtures/vcr/google_search/too_many_requests.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading