{{tag>troubleshooting elasticsearch opensearch graylog}}

====== Elasticsearch/Opensearch troubleshooting ======
If graylog stops showing message streams it could be an issue with indexes.

===== checksum failed =====


Health status will report **red** in the web UI or with API:
<code>
# curl -X GET "localhost:9200/_cluster/health?pretty"

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 97,
  "active_shards" : 97,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.9795918367347
}
</code>

Check the shards status on the server with:

  # curl -X GET "localhost:9200/_cat/shards?v"
  
  index              shard prirep state         docs   store ip              node
  ...
  myindex_154        2     p      UNASSIGNED                                 
  ...

Check the reason for the unassigned index:

<code>
# curl -XGET localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "myindex_154",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2022-01-12T13:15:29.713Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "yusko5YtSRSs9QOtVjHutg",
      "node_name" : "yusko5Y",
      "transport_address" : "<some_public_ip>:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "Mr0BOW0RQqKdh-iTqDkjBw",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum failed (hardware problem?) : expected=6fb91e47 actual=9406d419 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/Qlh2XztFSIWu65B7nhmktQ/2/index/_3nhz.cfs\") [slice=_3nhz.fdt]))"
            }
          }
        }
      }
    }
  ]
}
</code>

You could delete the index but some <wrap em>messages will be lost!</wrap>
  curl -XDELETE 'localhost:9200/myindex_154'

Then you might have to recalculate the index ranges (''System > Indices > index set > Maintenance > Recalculate index ranges'') and/or manually rotate the write index (''System > Indices > index set > Maintenance > Rotate active write index'')


===== Tested on =====
  * Graylog 3.3.16
  * Debian 9.13 Stretch

===== Unable to write to elasticsearch =====
Trying to GET some data from the ES works but POST does not. The issue could be that the ES was put into read-only mode. It does this if the free space on server starts getting low. In that case you'll get this warning:

<code>
[4:39 PM] {
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_block_exception",
        "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
      }
    ],
    "type" : "cluster_block_exception",
    "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
  },
  "status" : 403
}
</code>

If you are running in docker you might see a less useful message like:
<code>
[2022-04-21T13:26:04,269][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ddbAopn] low disk watermark [85%] exceeded on [ddbAopnMTL2VKLZs_zM6bQ][ddbAopn][/usr/share/elasticsearch/data/nodes/0] free: 117.5gb[12.9%], replicas will not be assigned to this node 
</code>


Free some disk space for example delete an old index (see howto for [[wiki:graylog_troubleshooting#elasticsearch_nodes_disk_usage_above_low_watermark|graylog index management]]) 

  curl -X DELETE -u undefined:$ESPASS "localhost:9200/my-index?pretty"

 and run this:
  curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

You can also change the watermark threshold e.g.


  curl -X PUT -u undefined:$ESPASS "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
  {
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "100gb",
    "cluster.routing.allocation.disk.watermark.high": "50gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
    "cluster.info.update.interval": "1m"
  }
  }'

[[https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html|Check]] the docs for more info.

===== snapshot missing exception =====
If you get an error like:

  "snapshot_missing_exception"

Delete the snapshot repo

  curl -X DELETE -u undefined:$ESPASS "localhost:9200/_snapshot/es_backup?pretty"

and try listing again.

===== index ... is the write index for the datastream =====
When trying to delete the index like

  curl -XDELETE 'localhost:9200/.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001?pretty'

you get

<code>
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "index [.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001] is the write index for data stream [.logs-deprecation.elasticsearch-default] and cannot be deleted"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "index [.ds-.logs-deprecation.elasticsearch-default-2022.11.15-000001] is the write index for data stream [.logs-deprecation.elasticsearch-default] and cannot be deleted"
  },
  "status" : 400
}

</code>

you need to rollover to the new index, e.g.

  curl -s -X POST "localhost:9200/.logs-deprecation.elasticsearch-default/_rollover"

and run delete command again.

===== curl (52) empty reply from server =====
Happened with OpenSearch docker compose installation trying this:

  curl -u admin:Antekante_1 -XGET "http://localhost:9200/_cluster/health?pretty"
  
It needs the certificate file in the command but if you are testing, easiest is just to disable the ssl. Add the following line in ''docker-compose.yml''
<code>
  - plugins.security.ssl.http.enabled=false
</code> 

and rerun
  docker-compose up -d

===== Tested on =====
  * Debian 10
  * Elastic search docker container ver. 6.8.16

====== See also ======
  * [[wiki:graylog_troubleshooting|Graylog troubleshooting]]
  * [[wiki:elasticsearch_commands|Elasticsearch commands]]
  * [[wiki:kibana_troubleshooting|Kibana troubleshooting]]
====== References ======
  * https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/
  * https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html#reason-unassigned
  * https://discuss.elastic.co/t/restore-of-elasticsearch-data-fails-with-corruptindexexception-checksum-failed-hardware-problem/261619/3
  * https://stackoverflow.com/questions/50609417/elasticsearch-error-cluster-block-exception-forbidden-12-index-read-only-all
  * https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html