Skip to content

Question about running extract_docs_from_index.py  #37

@yiyaxiaozhi

Description

@yiyaxiaozhi

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error:
image
and I do not change any code in the file.

my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?

Could you give some advice on this error?
Thanks a lot!


I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions