Conversation
* Create tfidf method for search engine * Pylint fixes * Add testing * Fix linting * Add numpy to requirements.txt * Adapt to articles/keyword object return * Change naming to agreed db convention * disable numpy import error Co-authored-by: Domene99 <jadomene99@gmail.com>
Domene99
left a comment
There was a problem hiding this comment.
Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay
| if doc is not None: | ||
| return doc.to_dict() | ||
| else: | ||
| return None No newline at end of file |
There was a problem hiding this comment.
Maybe we should return some info for debugging purposes
There was a problem hiding this comment.
Something like your last PR where you added more error handling?
I don´t know either, but it is probably not on my end since it is the same code as the in-memory implementation. |
|
Would like to comment that the PR is still in draft form so there might be a lot of prints there and comments too, those are being addressed. |
ssmall
left a comment
There was a problem hiding this comment.
Some preliminary feedback to guide your implementation, hope it helps!
| from connector import get_article_by_id_db | ||
| from connector import get_articles_that_match_keywords_db |
There was a problem hiding this comment.
As @anniefu mentioned, you'll probably want to just have a db_connector.py that defines methods with the same names that you already use in the code, so that you can just change the import statement and not have to change any code.
| def save_keywords_in_db(keywords, article): | ||
| """Saves the keywords from an article in memory | ||
|
|
||
| Args: | ||
| keywords (JSON): contains keywords | ||
| article (Article): article object | ||
| """ | ||
| for keyword in keywords: | ||
| frequency = article["content"].count(keyword) | ||
|
|
||
| doc_ref = db.collection(u'keywords').where('keyword', '==', keyword) | ||
| doc = doc_ref.get() | ||
|
|
||
| if len(doc) != 0 and doc[0] is not None: | ||
| from_db = doc[0].to_dict() | ||
| print(from_db) | ||
| from_db["matching_articles"][article["id"]] = frequency | ||
| #print(from_db) | ||
| db.collection(u'keywords').document(doc[0].id).set(from_db) | ||
| else: | ||
| to_send = {"keyword": keyword, "matching_articles": {article["id"]: frequency}} | ||
| db.collection(u'keywords').add(to_send) |
There was a problem hiding this comment.
It looks like you're trying to mirror the same structure that you had for the in-memory implementation, which is not necessarily the best implementation. For example, keywords could be a field of the article document that gets queried directly, which eliminates the need to store keywords as a separate collection. See this guide on how to perform different queries against Firestore, particularly the example for array membership.
| self.number = number | ||
| self.content = content | ||
| self.id = str(number) | ||
| self.id = 'monterrey'+str(number) |
There was a problem hiding this comment.
Could this ID be generated by Firebase instead?
Started implementing the database without unit testing. While probably it won't get merged but here is the code.