Conversation
michamos
left a comment
There was a problem hiding this comment.
didn't have time to look at everything before going home, I'll continue tomorrow, but here are already a few comments.
hepcrawl/parsers/osti.py
Outdated
| """ | ||
| return [t[4:] if t.startswith(u'The ') else t for t in | ||
| [c.replace(u'Collaboration', '').strip() for c in | ||
| self.record.get(u'contributing_org', '').split(u';')]] |
There was a problem hiding this comment.
The stripping of the and collaboration is already done in the builder (which calls https://github.com/inspirehep/inspire-schemas/blob/965302b1062f1fc10a046a1ab99fcd08084b0439/inspire_schemas/utils.py#L719 ), so no need to do it here. Splitting on ; could be added there too if needed.
There was a problem hiding this comment.
agreed. _RE_AND could be augmented to split on ; in addition.
however on things like
'LUX Collaboration; Nuclear Science and Security Consortium'
the string should not be split on and
it's the only outlier I see in 3k records I just checked, though.
another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?
There was a problem hiding this comment.
There is definitely mixed content with ; and and as separator
The DES Collaboration; The LIGO Scientific Collaboration and the Virgo Collaboration
There was a problem hiding this comment.
There is also repetition after splitting and removing things, e.g. in
LIGO Scientific Collaboration; Virgo Collaboration; Fermi GBM; INTEGRAL; IceCube Collaboration; AstroSat Cadmium Zinc Telluride Imager Team; IPN Collaboration; The Insight-Hxmt Collaboration; ANTARES Collaboration; The Swift Collaboration; AGILE Team; The 1M2H Team; The Dark Energy Camera GW-EM Collaboration; the DES Collaboration; The DLT40 Collaboration; LIGO Scientific Collaboration and Virgo Collaboration; The Insight-HXMT Collaboration; The Dark Energy Camera GW-EM Collaboration and the DES Collaboration
are the schema/utils deduping the list ?
There was a problem hiding this comment.
another problem I see with splitting at
;in general is that it might mess up HTML escapes like&Is there a step to clean those before splitting? are you going to augment schema utils?
I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.
are the schema/utils deduping the list ?
Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).
|
|
||
| author_re = re.compile(r""" | ||
| ^(?:(?P<surname>[\w.']+(?:\s*[\w.'-]+)*)(?:\s*,\s* | ||
| (?P<given_names>\w+(\s*[\w.'-]+)*))?\s* |
There was a problem hiding this comment.
I'm not sure you actually need to parse the name here. The builder already performs name normalization, so whatever name you throw at it should work. If it doesn't work correctly, it would probably be worthwile to improve name normalization in https://github.com/inspirehep/inspire-utils/blob/master/inspire_utils/name.py.
There was a problem hiding this comment.
the incoming data is unreliable. there are unmatched ] or missing [
I could leave the name part alone and separate out [affiliations] and (ORCID:1234567890123456)
either way, a firstname(s) or initial(s), lastname(s) split on a comma appears to be the best to go by
|
very good comments @michamos thanks |
1a8705f to
a7bbd97
Compare
michamos
left a comment
There was a problem hiding this comment.
Some more comments. I think it's important to add tests for schema validation, because some of the things you're doing seem not to be valid.
| Returns: | ||
| str: | ||
| """ | ||
| return self.__product_types |
There was a problem hiding this comment.
the __ is intentional here, see e.g. https://www.python-course.eu/python3_properties.php
because I want to enforce checks on setting
| Returns: | ||
| str: | ||
| """ | ||
| return self.__journal_types |
hepcrawl/parsers/osti.py
Outdated
| """ | ||
| return [t[4:] if t.startswith(u'The ') else t for t in | ||
| [c.replace(u'Collaboration', '').strip() for c in | ||
| self.record.get(u'contributing_org', '').split(u';')]] |
There was a problem hiding this comment.
another problem I see with splitting at
;in general is that it might mess up HTML escapes like&Is there a step to clean those before splitting? are you going to augment schema utils?
I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.
are the schema/utils deduping the list ?
Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).
|
right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source. on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !? So I think LiteratureBuilder should ensure deduping of lists like I don't feel strongly about |
* use API at OSTI to harvest records associated with SLAC Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>
5b838e3 to
80d34e4
Compare
Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>
Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com
Description
This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.
Related Issue
Motivation and Context
Checklist:
RFCand look for it).