Skip to content

Stopwords being ignored #70

@chaturv3di

Description

@chaturv3di

I am passing the set of English stopwords which I create from yake/StopwordsList/stopwords_en.txt.

text = "YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app."

language = "en"
max_ngram_size = 5
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 5

# Location of the file downloaded from https://github.com/LIAAD/yake/blob/master/yake/StopwordsList/stopwords_en.txt
stopwords_file = os.path.join(home_dir, "data_txt", "yake_stopwords_en.txt")
with open(stopwords_file, 'r') as sw_f:
    yake_stopwords = set(sw_f.read().lower().split("\n"))

yake_kw_extractor = yake.KeywordExtractor(lan=language, 
                                          n=max_ngram_size, 
                                          dedupLim=deduplication_thresold, 
                                          dedupFunc=deduplication_algo, 
                                          windowsSize=windowSize, 
                                          top=numOfKeywords, 
                                          features=None, 
                                          stopwords=yake_stopwords)

yake_kw_extractor.extract_keywords(text)

And the results end up containing stopwords like of, a, from, etc.

[('trained on a particular set', -60.326928913747196),
 ('keywords of a text', -0.665864990295941),
 ('important keywords of a text', -0.31206738772455755),
 ('light-weight unsupervised automatic keyword extraction', 0.00029233948201177757),
 ('statistical features extracted from single', 0.0008477866813335354)]

If I invoke the method with parameter stopwords=None, the results don't change. Am I doing something silly here?

Thanks a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions