Hi Again :)
I was checking my own implementation of the factCC scoring you described in the paper against your data, and noticed that for 90 cases we derived different scores
I suspect this is due to difference in how we split summaries into their individual sentences prior to classification and scoring.
How did you split summary sentences for factCC scoring?
(I use nltk sent_tokenize function)