Skip to content

Consumer status is unpredictable when multiple topics are consumed #796

@ashi009

Description

@ashi009

for topic, partitions := range topics {
for partitionID, partition := range partitions {
partitionStatus := evaluatePartitionStatus(partition, module.minimumComplete, module.allowedLag)
partitionStatus.Topic = topic
partitionStatus.Partition = int32(partitionID)
partitionStatus.Owner = partition.Owner
partitionStatus.ClientID = partition.ClientID
if partitionStatus.Status > status.Status {
// If the partition status is greater than StatusError, we just mark it as StatusError
if partitionStatus.Status > protocol.StatusError {
status.Status = protocol.StatusError
} else {
status.Status = partitionStatus.Status
}
}
if (status.Maxlag == nil) || (partitionStatus.CurrentLag > status.Maxlag.CurrentLag) {
status.Maxlag = partitionStatus
}
if partitionStatus.Complete == 1.0 {
completePartitions++
}
status.Partitions[count] = partitionStatus
count++
}
}

This piece of code loops over a map of topics, and if the last topic's last partition is reporting ok, the consumer status will be ok.

Given that the map iteration in go is randomized, the consumer status is unpredictable.

The following are the real world effect from this:

  1. The metric from burrow of a consumer when scraping at 2m interval:
    image

  2. The metric from burrow-exporter which requests burrow at 30s interval, and then being scrapped at 2m interval:
    image

The more frequently we query (as burrow uses 30s cache expiration by default), the more likely to see non-OK consumer status.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions