[#21] [Backend] As a system, I can scrap data from Google search result page#43
[#21] [Backend] As a system, I can scrap data from Google search result page#43mosharaf13 wants to merge 31 commits intodevelopfrom
Conversation
…ge' of github.com:mosharaf13/google-scrapper-ruby into backend/scrap-google-search-page
|
Code coverage is now at 0.00% (0/204 lines) Generated by 🚫 Danger |
sanG-github
left a comment
There was a problem hiding this comment.
Some early suggestions 😉 Please take a look at your code to re-format it, I saw some empty lines in the service. 🙇
Hope that is useful for you.
| class SearchKeywordJob < ApplicationJob | ||
| queue_as :default | ||
|
|
||
| def perform(search_stat_id) |
There was a problem hiding this comment.
We should better make the call explicit with the keyword arguments.
| def perform(search_stat_id) | |
| def perform(search_stat_id:) |
| end | ||
|
|
||
| def ads_page_urls | ||
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } |
There was a problem hiding this comment.
We only need to return true/false.
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].present? } |
There was a problem hiding this comment.
Reverting back to presence here 57afc7e as present? returns a collection of only true values, not actual urls
There was a problem hiding this comment.
Sorry about that, we can use filter with present?,
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? } |
| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
Why don't we do it explicitly as previously? 💭
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } | |
| user { demo_user } |
There was a problem hiding this comment.
let's specify the class name explicitly
Fabricator(:search_stat, class_name: SearchStat) doThere was a problem hiding this comment.
why do you need to use a specific user for the search_stat, how about generating randomly, e.g.
Fabricator(:search_stat, class_name: SearchStat) do
...
user { Fabricate(:user) }
endwhen you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this
john = User.find(1)
search_stat = Fabricate(:search_stat, user: john)There was a problem hiding this comment.
Sorry about that, I thought we were in the seeds file. How about using Fabricate instead of creating a new one manually?
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } | |
| user { Fabricate(:user) } |
|
|
||
| html_result = Google::ClientService.new(keyword: search_stat.keyword).call | ||
|
|
||
| raise ClientServiceError unless html_result |
There was a problem hiding this comment.
It does NOT make total sense as the returned data can be blank, right? 🙊
IMO, I think we can separate it into a method, and then use the begin/rescue to catch the errors.
There was a problem hiding this comment.
@mosharaf13 Don't know why the changes don't reflect here.
There was a problem hiding this comment.
I am not quite sure about that. This is how the job looks currently.
…per-ruby into backend/scrap-google-search-page
Co-authored-by: Sang Huynh Thanh <63148598+sanG-github@users.noreply.github.com>
sanG-github
left a comment
There was a problem hiding this comment.
Some suggestions, please update tests for the missing service.
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id | ||
| html_result = fetch_html_result(search_stat.keyword) | ||
| update_search_stat search_stat, ParserService.new(html_response: html_result).call |
There was a problem hiding this comment.
I think it's more explicit and readable.
First, we call the service and assign the returned data to the variable, then, we pass the variable to the update method.
| update_search_stat search_stat, ParserService.new(html_response: html_result).call | |
| parsed_attributes = ParserService.new(html_response: html_result).call | |
| update_search_stat(search_stat, parsed_attributes) |
| queue_as :default | ||
|
|
||
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id |
There was a problem hiding this comment.
What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)
There was a problem hiding this comment.
Is this really a possibility 🤔? Should this case be handled?
There was a problem hiding this comment.
Sure Mosharaf, we must handle this case. 🙏
| Rails.logger.error("Error while fetching HTML result: #{e.message}") | ||
| raise ClientServiceError, 'Error fetching HTML result' |
There was a problem hiding this comment.
We never reach that codes as we already rescue the errors in the ClientService and then return false.
Otherwise, if there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.
There was a problem hiding this comment.
So, should this section be removed?
There was a problem hiding this comment.
If there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.
We can remove it in this service, but please handle it in the ClientService alternately.
| # rubocop:disable Rails/SkipsModelValidations | ||
| search_stat.result_links.insert_all attributes[:result_links] | ||
| # rubocop:enable Rails/SkipsModelValidations |
There was a problem hiding this comment.
Why do we need to disable this rule, usually we find a way to work around this instead of disabling the rule each time we faced it.
There was a problem hiding this comment.
I think we should better use create instead of insert_all, then we won't miss any validations as we don't want to create a record with an url.
validates :url, presence: true| # Inspect Http response status code | ||
| # Any non 200 response code will be logged | ||
| def valid_result?(result) | ||
| return true if result&.response&.code == '200' |
There was a problem hiding this comment.
We can early return false if there is no result, so we don't need the safe navigator anymore. ✨
| return true if result&.response&.code == '200' | |
| return false unless result | |
| return true if result.response.code == SUCCESS_STATUS_CODE |
There was a problem hiding this comment.
Also, please define a constant for SUCCESS_STATUS_CODE = '200'
| def valid_result?(result) | ||
| return true if result&.response&.code == '200' | ||
|
|
||
| Rails.logger.warn "Warning: Query Google with '#{@escaped_keyword}' return status code #{result.response.code}" |
There was a problem hiding this comment.
It should not belongs to the valid_result? method, we only expect it returns the valid? property of the result, not to write the log. (We can move it to the call method)
| AD_CONTAINER_ID = 'tads' | ||
| ADWORDS_CLASS = 'adwords' |
There was a problem hiding this comment.
I see that if we use ID or CLASS, we need to manually add a corresponding prefix # or . for it.
So, what do you think about defining it as a SELECTOR as you have done with NON_ADS_RESULT_SELECTOR
| AD_CONTAINER_ID = 'tads' | |
| ADWORDS_CLASS = 'adwords' | |
| AD_CONTAINER_ID = '#tads' | |
| ADWORDS_CLASS = '.adwords' |
There was a problem hiding this comment.
# Add a class to all AdWords link for easier manipulation
document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)
ADWORDS_CLASS is being added for easier manipulation. Adding . to this constant would result in stripping . before using it in this line.
| raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank? | ||
|
|
||
| @html = html_response | ||
|
|
||
| @document = Nokogiri::HTML.parse(html_response) | ||
|
|
||
| # Add a class to all AdWords link for easier manipulation | ||
| document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS) | ||
|
|
||
| # Mark footer links to identify them | ||
| document.css('#footcnt a').add_class('footer-links') |
There was a problem hiding this comment.
In this scope, we should better not raise any errors or customize the document, just initialize some values that we need before moving on.
| raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank? | |
| @html = html_response | |
| @document = Nokogiri::HTML.parse(html_response) | |
| # Add a class to all AdWords link for easier manipulation | |
| document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS) | |
| # Mark footer links to identify them | |
| document.css('#footcnt a').add_class('footer-links') | |
| @html = html_response | |
| @document = Nokogiri::HTML.parse(html_response) if html_response.body |
| end | ||
|
|
||
| # Parse html data and return a hash with the results | ||
| def call |
There was a problem hiding this comment.
And then in this method, we will early return if it does not pass the valid? check.
| def call | |
| def call | |
| return unless valid? | |
def valid?
html.present? && document.present?
endThere was a problem hiding this comment.
7dda9bb. Did a bit of refactoring based on your suggestions. Please let me know if this covers everything.
There was a problem hiding this comment.
Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.
| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
let's specify the class name explicitly
Fabricator(:search_stat, class_name: SearchStat) do| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
why do you need to use a specific user for the search_stat, how about generating randomly, e.g.
Fabricator(:search_stat, class_name: SearchStat) do
...
user { Fabricate(:user) }
endwhen you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this
john = User.find(1)
search_stat = Fabricate(:search_stat, user: john)
sanG-github
left a comment
There was a problem hiding this comment.
Please help to resolve all the remaining comments, and add the missing tests (client_service.rb)
| queue_as :default | ||
|
|
||
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id |
|
|
||
| html_result = Google::ClientService.new(keyword: search_stat.keyword).call | ||
|
|
||
| raise ClientServiceError unless html_result |
There was a problem hiding this comment.
@mosharaf13 Don't know why the changes don't reflect here.
|
|
||
| update_search_stat(search_stat, parsed_attributes) | ||
| rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError | ||
| update_keyword_status search_stat, :failed |
There was a problem hiding this comment.
Moreover, what happens if there is an error that is out of these errors? 🤔 Do we need to roll it back too?
There was a problem hiding this comment.
ced1a0e Refactored search_keyword_job and client_service as ClientServiceError exception wasn't handled properly.
We can try wrapping up parser_service call method in a rescue and handle exception here. In that case, too many params in rescue method.
current
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid, ParserServiceError
What we keep this many params or should we create a generic exception class to handle these?
| end | ||
|
|
||
| def ads_page_urls | ||
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } |
There was a problem hiding this comment.
Sorry about that, we can use filter with present?,
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? } |
| end | ||
|
|
||
| # Parse html data and return a hash with the results | ||
| def call |
There was a problem hiding this comment.
Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.
lib/tasks/search_keyword.rake
Outdated
| # Schedule the SearchKeywordJob for background processing | ||
| Google::SearchKeywordJob.perform_later(search_stat_id: 1) | ||
|
|
||
| puts 'SearchKeywordJob scheduled successfully.' |
There was a problem hiding this comment.
Just curiosity why do need this rake task? 🤔
There was a problem hiding this comment.
I am trying to make present? work with filter. But I keep getting an error (attached).
Here, if I use filter, I have to chain another map method to return only 'href' values. Would it be worth it 😬 ?
Also, if I use select-pluck combo, would that be okay 😀 ?
Here, if present? doesn't provide significant performance improvement, wouldn't filter_map be with presence be better in that case? considering the conciseness 🤔
There was a problem hiding this comment.
Just curiosity why do need this rake task? 🤔
Removing 👍🏼
…f13/google-scrapper-ruby into backend/scrap-google-search-page
closes #21
What happened 👀
Insight 📝
Proof Of Work 📹
Testsuite run stat