Skip to content

Incomplete commit to Elasticsearch #3

@wolverline

Description

@wolverline

@essiembre Hi Pascal, I have faced a couple of issues with Elasticserach Importer. First off, I found the commit count is way off as seen in the following log segment. The site ran previously and was committed to file system. Later I added Elasticsearch committer and ran it after I removed the output folder.

  • I wonder what are the relations between reference count and actual commit count. Don't they have to match?
  • It seems the previous crawl was cached somewhere. How can I clean it up?
big-site: 2018-02-07 03:39:29 INFO - big-site: Crawler finishing: committing documents.
big-site: 2018-02-07 03:39:29 INFO - Committing 181 files
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 31 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Elasticsearch RestClient closed.
big-site: 2018-02-07 03:39:29 INFO - big-site: 10195 reference(s) processed.

The committer config:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
  <nodes>somewhere in the jungle</nodes>
  <indexName>big-site-index</indexName>
  <queueDir>$workdir/commit</queueDir>
  <connectionTimeout>5 minutes</connectionTimeout>
  <socketTimeout>5 minutes</socketTimeout>
  <typeName>Documents</typeName>
  <commitBatchSize>50</commitBatchSize>
  <maxRetries>1</maxRetries>
</committer>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions