parsers: create an NLM parser#209
Conversation
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
|
|
||
| Args: | ||
| nlm_records (Union[string, scrapy.selector.Selector]): records | ||
| source (Optional[string]): source passed to `__init__` |
There was a problem hiding this comment.
please document return value
hepcrawl/parsers/nlm.py
Outdated
| day = node.xpath('./Day/text()').extract_first() | ||
| month = node.xpath('./Month/text()').extract_first() | ||
| year = node.xpath('./Year/text()').extract_first() | ||
| return PartialDate( |
There was a problem hiding this comment.
It's better to use PartialDate.from_parts, which handles empty values and non-numeric months just fine:
In [1]: from inspire_utils.date import PartialDate
In [2]: PartialDate.from_parts(2017, 'Jan')
Out[2]: PartialDate(year=2017, month=1, day=None)
| pub_type = self.root.xpath('./PublicationType/text()').extract_first() | ||
|
|
||
| if 'Conference' in pub_type or pub_type == 'Congresses': | ||
| return 'proceedings' |
There was a problem hiding this comment.
I think this is conference paper rather than proceedings, but would need to look at some examples.
There was a problem hiding this comment.
I got an example IOP update from @david-caro with a few records, but unfortunately none of them actually have the <PublicationType> set. Maybe when we get access, there will be more records, or maybe IOP don't use the field at all... Meanwhile I found a few in this at NLM, so I think that means you are right?
There was a problem hiding this comment.
Looks like it. But I would not be surprised if IOP put its own values there anyway, that have nothing to do with those in the spec.
| authors = self.root.xpath('./AuthorList/Author') | ||
| authors_in_collaborations = self.root.xpath( | ||
| './GroupList/Group' | ||
| '[GroupName/text()=../../AuthorList/Author/CollectiveName/text()]' |
There was a problem hiding this comment.
what's the purpose of this?
There was a problem hiding this comment.
<CollectiveName> inside the <Author> acts as sort of a pointer to the <Group> of the same name, where the actual people of the group are listed, like here: https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Can_Collaborator_Names_be. So this gets all the people from groups referenced in <Authors>. Though maybe it is too strict, now that I think about it, I don't think there is a use case for an "unreferenced" group?
| return self.root.xpath('./Journal/Volume/text()').extract_first() | ||
|
|
||
| @property | ||
| def material(self): |
There was a problem hiding this comment.
PublicationType may also contain Published Erratum, which maps to erratum. Don't know how this relates to the NLM field you are reading here. Maybe you should link to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Object_O, here or close to the NLM_OBJECT_TYPE_TO_HEP_MAP definition.
|
|
||
| NLM_OBJECT_TYPE_TO_HEP_MAP = { | ||
| 'Erratum': 'erratum', | ||
| 'Reprint': 'reprint', |
There was a problem hiding this comment.
'Republished': 'reprint' also
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Description
This is an implementation of a parser for the NLM format, it takes a very similar approach to the JATS parser which we already have, using LiteratureBuilder to build HEP records.
Related Issue
This is a step towards refreshing the IOP spider (#205)
Motivation and Context
IOP uses NLM format to publish their citation records. Currently the IOP spider uses web-scraping, however we will move to using OAI-PMH and this instead.
Checklist:
RFCand look for it).