Common Problems

How to diagnose and resolve common problems in Seldon Deploy.

Insufficient ephemeral storage in EKS clusters

When using eksctl, the volume size for each node will be of 20Gb by default. However, with large images this may not be enough. This is discussed at length on this thread in the eksctl repository.

When this happens, pods usually start to get evicted. If you run kubectl describe on any of these pods, you should be able to see errors about not enough ephemeral storage. You should also be able to see some DiskPressure events on the output of kubectl describe nodes.

To fix it, it should be enough to increase the available space. With eksctl, you can do so by tweaking the nodeGroups config and adding a volumeSize and volumeType keys. For instance, to change the volume to 100Gb you could do the following in your ClusterConfig spec:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

...

nodeGroups:
  - volumeSize: 100
    volumeType: gp2
    ...

Elastic Queue Capacity

If request logging is used with a high throughput then it’s possible to hit a rejected execution of processing error in the logger. This comes with a queue capacity message. To address this the thread_pool.write.queue_size needs to be increased. For example, with the elastic helm chart this could be:

esConfig:
  elasticsearch.yml: |
    thread_pool.write.queue_size: 2000

Last modified July 15, 2020: elastic queue depth (db335ee)