Skip to content

Crawler getting stuck (lots of "Still waiting to process downloaded pages..." msgs) #199

@Stanxy

Description

@Stanxy

After a while (maybe half hour) Ache stops crawling and gives lots of "Still waiting to process downloaded pages..." messages, I have checked the load of all CPUs with htop and just found there's no busy worker.

I'm experimenting with Ache. I've written config ache like what has been mentioned in the guide and use the config file in ./config/config__website_crawl/ache.yml. The parts I've changed contain only two properties:

target_storage.visited_page_limit: 50

crawler_manager.downloader.download_thread_pool_size: 4

I've played around with -XX:+UseG1GC and -Xmx4g to get enough capability for my project. Also, the running environment is an unix server with a constraint on the maximum number of process for each user at 20.
My jdk version:
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
JVM:
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)

So after about half an hour (I think - whenever I screen back into the running window) I see lots of msgs (pasted below) and it seems it is trapped in an infinite loop.

[2021-04-11 04:03:02,565] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:07,613] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:12,661] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:17,708] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:22,757] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:27,805] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:32,853] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:37,901] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:42,948] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:47,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:53,045] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:03:58,092] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:03,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:08,187] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:13,236] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:18,284] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:23,332] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:28,379] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:33,426] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:38,474] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:43,522] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:48,568] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:53,616] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:04:58,664] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:03,711] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:08,759] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:13,806] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:18,854] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:23,902] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:28,949] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:33,997] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:39,043] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:44,091] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:49,139] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:05:54,186] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...

I tried to use ctrl+c to sent SIGINT but got OOM error:

^C^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:04,281] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2021-04-11 04:06:09,329] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...
[2021-04-11 04:06:14,376] INFO [AsyncCrawler] (HttpDownloader.java:232) - Still waiting to process downloaded pages...

Has anyone seen this before ?

Thanks,

Stan

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions