Fix of requests propagation to Scheduler from middlewares and engine#276
Fix of requests propagation to Scheduler from middlewares and engine#276sibiryakov wants to merge 5 commits intomasterfrom
Conversation
|
Guys @ZipFile, @voith, @isra17 what do you think? Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler. I think that makes things clearer. |
Codecov Report
@@ Coverage Diff @@
## master #276 +/- ##
==========================================
+ Coverage 70.18% 70.19% +<.01%
==========================================
Files 68 68
Lines 4722 4730 +8
Branches 633 634 +1
==========================================
+ Hits 3314 3320 +6
+ Misses 1270 1268 -2
- Partials 138 142 +4
Continue to review full report at Codecov.
|
| def from_crawler(cls, crawler): | ||
| return cls(crawler) | ||
|
|
||
| def enqueue_request(self, request): |
There was a problem hiding this comment.
Would it be even clearer to tag request that are enqueued in the Frontier through process_spider_output -> links_extracted (storing the tag in meta like you did for seed) and add any other non-tagged request to the local queue? This way you wouldn't need to check for redirect and other middleware could themselves chose between local scheduling or remote scheduling (Such as a middleware that logs in and try the request again without going through the frontier queue again).
There was a problem hiding this comment.
I believe adding middleware which is bypassing the frontier and want's to get it's requests in the local queue requires planning from the user operating the crawler. He would need to understand all the consequences of this. If we take your login example, the user would need to deal with response that comes after previously unseen login request. Frontera custom scheduler will crash on this, because it lacks of frontier_request in meta. Therefore the middleware which logs in would need to intercept this response or figure out something else.
Frontera already has some ancient mechanism of tagging the requests
https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/converters.py#L41
at the moment it's not used.
There was a problem hiding this comment.
I see also other cases: redirects or website requires obtaining session. In any of these cases I expect user to look into Frontera custom scheduler code and make appropriate changes. I see such customization as advanced topic, requiring some expert knowledge.
|
Quite a mess with merge requests you have, I think. Code-wise looks ok for me. I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them? |
0281e2f to
31874dc
Compare
31874dc to
c7c3faa
Compare
|
@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion. |
Currently (master version) will transform every request coming from engine or middlewares to Frontera |
|
@ZipFile see above comment to Israel for suggestions of possible middlewares. |
This has these goals: