-
Notifications
You must be signed in to change notification settings - Fork 0
[#21] [Backend] As a system, I can scrap data from Google search result page #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
20b611e
10c5a6e
45c5304
4e60183
7ee3d01
f8e2338
c1669cf
9aa642f
3d70f82
8a96872
122de2f
5c5eaad
40d3740
8051e7c
bce3b3e
e584866
b1fde06
582a136
5860585
ef9224b
57afc7e
c190d21
7dda9bb
131259d
cf825ff
278be86
9691391
ced1a0e
6b8502e
8f40769
bd1df63
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,31 @@ | ||||
| # frozen_string_literal: true | ||||
|
|
||||
| module Google | ||||
| class ClientServiceError < StandardError; end | ||||
|
|
||||
| class SearchKeywordJob < ApplicationJob | ||||
| queue_as :default | ||||
|
|
||||
| def perform(search_stat_id:) | ||||
| search_stat = SearchStat.find search_stat_id | ||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this really a possibility 🤔? Should this case be handled?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure Mosharaf, we must handle this case. 🙏
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in b1fde06
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||
| html_result = Google::ClientService.new(keyword: search_stat.keyword).call | ||||
| parsed_attributes = ParserService.new(html_response: html_result).call | ||||
|
|
||||
| update_search_stat(search_stat, parsed_attributes) | ||||
| rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid | ||||
| update_search_stat_status search_stat, :failed | ||||
| end | ||||
|
|
||||
| def update_search_stat(search_stat, attributes) | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||
| SearchStat.transaction do | ||||
| search_stat.result_links.create(attributes[:result_links]) | ||||
|
|
||||
| search_stat.update! attributes.except(:result_links) | ||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like there are some implicit values returned by the Service. Sometimes it can be |
||||
| end | ||||
| end | ||||
|
|
||||
| def update_search_stat_status(search_stat, status) | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||
| search_stat.update! status: status | ||||
| end | ||||
| end | ||||
| end | ||||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,36 @@ | ||||||||||||||
| # frozen_string_literal: true | ||||||||||||||
|
|
||||||||||||||
| module Google | ||||||||||||||
| class ClientService | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need a more meaningful name for this service. At first glance, no one knows what does the
Suggested change
|
||||||||||||||
| USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '\ | ||||||||||||||
| 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' | ||||||||||||||
|
|
||||||||||||||
| BASE_SEARCH_URL = 'https://www.google.com/search' | ||||||||||||||
|
|
||||||||||||||
| SUCCESS_STATUS_CODE = '200' | ||||||||||||||
|
|
||||||||||||||
| def initialize(keyword:, lang: 'en') | ||||||||||||||
| @escaped_keyword = CGI.escape(keyword) | ||||||||||||||
| @uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}") | ||||||||||||||
| end | ||||||||||||||
|
|
||||||||||||||
| def call | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||
| result = HTTParty.get(@uri, { headers: { 'User-Agent' => USER_AGENT } }) | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's more readable if we can separate it to a method 😉
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| raise ClientServiceError unless valid_result? result | ||||||||||||||
|
|
||||||||||||||
| result | ||||||||||||||
| rescue HTTParty::Error, Timeout::Error, SocketError, ClientServiceError => e | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||
| Rails.logger.error "Error: Query Google with '#{@escaped_keyword}' thrown an error: #{e}" | ||||||||||||||
|
|
||||||||||||||
| raise ClientServiceError, 'Error fetching HTML result' | ||||||||||||||
| end | ||||||||||||||
|
|
||||||||||||||
| private | ||||||||||||||
|
|
||||||||||||||
| def valid_result?(result) | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||
| return false unless result | ||||||||||||||
| return true if result.response.code == SUCCESS_STATUS_CODE | ||||||||||||||
| end | ||||||||||||||
| end | ||||||||||||||
| end | ||||||||||||||
longnd marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,94 @@ | ||||||||||
| # frozen_string_literal: true | ||||||||||
|
|
||||||||||
| module Google | ||||||||||
| class ParserService | ||||||||||
| NON_ADS_RESULT_SELECTOR = 'a[data-ved]:not([role]):not([jsaction]):not(.adwords):not(.footer-links)' | ||||||||||
| AD_CONTAINER_ID = 'tads' | ||||||||||
| ADWORDS_CLASS = 'adwords' | ||||||||||
|
Comment on lines
+6
to
+7
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that if we use
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ADWORDS_CLASS is being added for easier manipulation. Adding |
||||||||||
|
|
||||||||||
| def initialize(html_response:) | ||||||||||
| @html = html_response | ||||||||||
|
|
||||||||||
| @document = Nokogiri::HTML.parse(html_response) if html_response.body | ||||||||||
| end | ||||||||||
|
|
||||||||||
| # Parse html data and return a hash with the results | ||||||||||
| def call | ||||||||||
| return unless valid? | ||||||||||
|
|
||||||||||
| mark_adword_links | ||||||||||
| mark_footer_links | ||||||||||
|
|
||||||||||
| present_parsed_data | ||||||||||
| end | ||||||||||
|
|
||||||||||
| private | ||||||||||
|
|
||||||||||
| attr_reader :html, :document | ||||||||||
|
|
||||||||||
| def valid? | ||||||||||
| html.present? && document.present? | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def mark_adword_links | ||||||||||
| # Add a class to all AdWords link for easier manipulation | ||||||||||
| document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS) | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def mark_footer_links | ||||||||||
| # Mark footer links to identify them | ||||||||||
| document.css('#footcnt a').add_class('footer-links') | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def present_parsed_data | ||||||||||
| { | ||||||||||
| top_ad_count: ads_top_count, | ||||||||||
| ad_count: ads_page_count, | ||||||||||
| non_ad_count: non_ads_result_count, | ||||||||||
| total_result_count: total_link_count, | ||||||||||
| raw_response: html, | ||||||||||
| result_links: result_links, | ||||||||||
| status: :completed | ||||||||||
| } | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def ads_top_count | ||||||||||
| document.css("##{AD_CONTAINER_ID} .#{ADWORDS_CLASS}").count | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def ads_page_count | ||||||||||
| document.css(".#{ADWORDS_CLASS}").count | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def ads_top_urls | ||||||||||
| document.css("##{AD_CONTAINER_ID} .#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def ads_page_urls | ||||||||||
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only need to return true/false.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reverting back to
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry about that, we can use
Suggested change
|
||||||||||
| end | ||||||||||
|
|
||||||||||
| def non_ads_result_count | ||||||||||
| document.css(NON_ADS_RESULT_SELECTOR).count { |a_tag| a_tag['href'].presence } | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def non_ads_urls | ||||||||||
| document.css(NON_ADS_RESULT_SELECTOR).filter_map { |a_tag| a_tag['href'].presence } | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def total_link_count | ||||||||||
| document.css('a').count | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def result_links | ||||||||||
| results = result_link_map(ads_top_urls, :ads_top) | ||||||||||
| results += result_link_map(non_ads_urls, :non_ads) | ||||||||||
|
|
||||||||||
| results | ||||||||||
| end | ||||||||||
|
|
||||||||||
| def result_link_map(urls, type) | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||
| urls.map { |url| { url: url, link_type: type } } | ||||||||||
| end | ||||||||||
| end | ||||||||||
| end | ||||||||||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.