hepcrawl: add crawler for OSTI by tsgit · Pull Request #276 · inspirehep/hepcrawl

tsgit · 2019-09-16T03:55:27Z

* use API at OSTI to harvest records associated with SLAC

Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com

Description

This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.

Related Issue

Motivation and Context

Checklist:

[x ] I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
[ x] I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
[x ] I have added tests to cover my changes.
[ x] All new and existing tests passed.

michamos

didn't have time to look at everything before going home, I'll continue tomorrow, but here are already a few comments.

hepcrawl/parsers/osti.py

michamos · 2019-09-16T15:58:54Z

hepcrawl/parsers/osti.py

+        """
+        return [t[4:] if t.startswith(u'The ') else t for t in
+                [c.replace(u'Collaboration', '').strip() for c in
+                 self.record.get(u'contributing_org', '').split(u';')]]


The stripping of the and collaboration is already done in the builder (which calls https://github.com/inspirehep/inspire-schemas/blob/965302b1062f1fc10a046a1ab99fcd08084b0439/inspire_schemas/utils.py#L719 ), so no need to do it here. Splitting on ; could be added there too if needed.

agreed. _RE_AND could be augmented to split on ; in addition.
however on things like
'LUX Collaboration; Nuclear Science and Security Consortium'
the string should not be split on and

it's the only outlier I see in 3k records I just checked, though.

another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?

There is definitely mixed content with ; and and as separator
The DES Collaboration; The LIGO Scientific Collaboration and the Virgo Collaboration

There is also repetition after splitting and removing things, e.g. in

LIGO Scientific Collaboration; Virgo Collaboration; Fermi GBM; INTEGRAL; IceCube Collaboration; AstroSat Cadmium Zinc Telluride Imager Team; IPN Collaboration; The Insight-Hxmt Collaboration; ANTARES Collaboration; The Swift Collaboration; AGILE Team; The 1M2H Team; The Dark Energy Camera GW-EM Collaboration; the DES Collaboration; The DLT40 Collaboration; LIGO Scientific Collaboration and Virgo Collaboration; The Insight-HXMT Collaboration; The Dark Energy Camera GW-EM Collaboration and the DES Collaboration

are the schema/utils deduping the list ?

another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?

I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.

are the schema/utils deduping the list ?

Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).

hepcrawl/parsers/osti.py

michamos · 2019-09-16T16:18:00Z

hepcrawl/parsers/osti.py

+
+        author_re = re.compile(r"""
+        ^(?:(?P<surname>[\w.']+(?:\s*[\w.'-]+)*)(?:\s*,\s*
+        (?P<given_names>\w+(\s*[\w.'-]+)*))?\s*


I'm not sure you actually need to parse the name here. The builder already performs name normalization, so whatever name you throw at it should work. If it doesn't work correctly, it would probably be worthwile to improve name normalization in https://github.com/inspirehep/inspire-utils/blob/master/inspire_utils/name.py.

the incoming data is unreliable. there are unmatched ] or missing [
I could leave the name part alone and separate out [affiliations] and (ORCID:1234567890123456)
either way, a firstname(s) or initial(s), lastname(s) split on a comma appears to be the best to go by

hepcrawl/parsers/osti.py

tsgit · 2019-09-16T23:45:09Z

very good comments @michamos thanks

michamos

Some more comments. I think it's important to add tests for schema validation, because some of the things you're doing seem not to be valid.

tests/unit/test_osti_single.py

hepcrawl/spiders/osti_spider.py

michamos · 2019-09-17T08:12:17Z

hepcrawl/spiders/osti_spider.py

+        Returns:
+            str:
+        """
+        return self.__product_types


one _ is probably enough

the __ is intentional here, see e.g. https://www.python-course.eu/python3_properties.php

because I want to enforce checks on setting

michamos · 2019-09-17T08:12:52Z

hepcrawl/spiders/osti_spider.py

+        Returns:
+            str:
+        """
+        return self.__journal_types


hepcrawl/spiders/osti_spider.py

tests/unit/test_osti_single.py

hepcrawl/parsers/osti.py

michamos · 2019-09-17T08:18:38Z

hepcrawl/parsers/osti.py

+        """
+        return [t[4:] if t.startswith(u'The ') else t for t in
+                [c.replace(u'Collaboration', '').strip() for c in
+                 self.record.get(u'contributing_org', '').split(u';')]]


another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?

I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.

are the schema/utils deduping the list ?

Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).

tsgit · 2019-09-18T04:07:05Z

right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.

on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !?
if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations then the collaborations end up replicated

So I think LiteratureBuilder should ensure deduping of lists like collection and collaborations among others. That's beyond this PR, though.

I don't feel strongly about __method vs. _method, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.

* use API at OSTI to harvest records associated with SLAC Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>

Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>

tsgit requested a review from michamos September 16, 2019 03:55

tsgit force-pushed the osti4 branch from 0b9dfde to e10aff0 Compare September 16, 2019 15:00

michamos reviewed Sep 16, 2019

View reviewed changes

tsgit force-pushed the osti4 branch 2 times, most recently from 1a8705f to a7bbd97 Compare September 17, 2019 05:00

michamos reviewed Sep 17, 2019

View reviewed changes

tsgit force-pushed the osti4 branch from a7bbd97 to 449bf79 Compare September 17, 2019 19:17

tsgit force-pushed the osti4 branch from 449bf79 to 2ec5a3c Compare September 18, 2019 04:10

hepcrawl: add crawler for OSTI

9b0bdce

* use API at OSTI to harvest records associated with SLAC Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>

tsgit force-pushed the osti4 branch 3 times, most recently from 5b838e3 to 80d34e4 Compare October 22, 2019 23:02

docker/testing: pin py3 image

97414f0

Signed-off-by: Thorsten Schwander <thorsten.schwander@gmail.com>

tsgit force-pushed the osti4 branch from 80d34e4 to 97414f0 Compare October 23, 2019 15:40

Merge branch 'master' into osti4

d41c7f5

Conversation

tsgit commented Sep 16, 2019

Description

Related Issue

Motivation and Context

Checklist:

Uh oh!

michamos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tsgit commented Sep 16, 2019

Uh oh!

michamos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsgit commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsgit commented Sep 18, 2019 •

edited

Loading