troubleshooting, elasticsearch, opensearch, graylog

Elasticsearch/Opensearch troubleshooting

If graylog stops showing message streams it could be an issue with indexes.

checksum failed

Health status will report red in the web UI or with API:

# curl -X GET "localhost:9200/_cluster/health?pretty"

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 97,
  "active_shards" : 97,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.9795918367347
}

Check the shards status on the server with:

# curl -X GET "localhost:9200/_cat/shards?v"

index              shard prirep state         docs   store ip              node
...
myindex_154        2     p      UNASSIGNED                                 
...

Check the reason for the unassigned index:

# curl -XGET localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "myindex_154",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2022-01-12T13:15:29.713Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "yusko5YtSRSs9QOtVjHutg",
      "node_name" : "yusko5Y",
      "transport_address" : "<some_public_ip>:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "Mr0BOW0RQqKdh-iTqDkjBw",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum failed (hardware problem?) : expected=6fb91e47 actual=9406d419 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/Qlh2XztFSIWu65B7nhmktQ/2/index/_3nhz.cfs\") [slice=_3nhz.fdt]))"
            }
          }
        }
      }
    }
  ]
}

You could delete the index but some messages will be lost!

curl -XDELETE 'localhost:9200/myindex_154'

Then you might have to recalculate the index ranges (System > Indices > index set > Maintenance > Recalculate index ranges) and/or manually rotate the write index (System > Indices > index set > Maintenance > Rotate active write index)

Tested on

Graylog 3.3.16
Debian 9.13 Stretch

Unable to write to elasticsearch

Trying to GET some data from the ES works but POST does not. The issue could be that the ES was put into read-only mode. It does this if the free space on server starts getting low. In that case you'll get this warning:

[4:39 PM] {
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_block_exception",
        "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
      }
    ],
    "type" : "cluster_block_exception",
    "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
  },
  "status" : 403
}

If you are running in docker you might see a less useful message like:

[2022-04-21T13:26:04,269][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ddbAopn] low disk watermark [85%] exceeded on [ddbAopnMTL2VKLZs_zM6bQ][ddbAopn][/usr/share/elasticsearch/data/nodes/0] free: 117.5gb[12.9%], replicas will not be assigned to this node

Free some disk space for example delete an old index (see howto for graylog index management)

curl -X DELETE -u undefined:$ESPASS "localhost:9200/my-index?pretty"

and run this:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

You can also change the watermark threshold e.g.

curl -X PUT -u undefined:$ESPASS "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient": {
  "cluster.routing.allocation.disk.watermark.low": "100gb",
  "cluster.routing.allocation.disk.watermark.high": "50gb",
  "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
  "cluster.info.update.interval": "1m"
}
}'

Check the docs for more info.

snapshot missing exception

If you get an error like:

"snapshot_missing_exception"

Delete the snapshot repo

curl -X DELETE -u undefined:$ESPASS "localhost:9200/_snapshot/es_backup?pretty"

and try listing again.

index ... is the write index for the datastream

When trying to delete the index like

curl -XDELETE 'localhost:9200/.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001?pretty'

you get

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "index [.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001] is the write index for data stream [.logs-deprecation.elasticsearch-default] and cannot be deleted"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "index [.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001] is the write index for data stream [.logs-deprecation.elasticsearch-default] and cannot be deleted"
  },
  "status" : 400
}

you need to rollover to the new index, e.g.

curl -s -X POST "localhost:9200/.logs-deprecation.elasticsearch-default/_rollover"

and run delete command again.

curl (52) empty reply from server

Happened with OpenSearch docker compose installation trying this:

curl -u admin:Antekante_1 -XGET "http://localhost:9200/_cluster/health?pretty"

It needs the certificate file in the command but if you are testing, easiest is just to disable the ssl. Add the following line in docker-compose.yml

  - plugins.security.ssl.http.enabled=false

and rerun

docker-compose up -d

Tested on

Debian 10
Elastic search docker container ver. 6.8.16

Table of Contents

Elasticsearch/Opensearch troubleshooting

checksum failed

Tested on

Unable to write to elasticsearch

snapshot missing exception

index ... is the write index for the datastream

curl (52) empty reply from server

Tested on

See also

References