Heartbeat failure over and over again (failover does not help since they die both at the same time)

Discussion:

Brecht De Rooms

2017-08-17 08:40:14 UTC

Hi,

I am trying to fix a problem with heartbeat failures. We are running on AWS
with DynamoDB.
We have set it up in such a way that the old machine terminates and a new
VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some restarts
and they seem to be at regular intervals.

<Loading Image...

>
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the DynamoDB
limit, this does not change anything.

<Loading Image...

>

In the logs we always see the same errors coming back

*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*

*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*

*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*

*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*

*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*

we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.

- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g

*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.

--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adam Lewis

2017-08-17 14:01:13 UTC

Permalink

It seems like you've narrowed the problem down to DDB. Since heartbeat is
written to and read from DDB, any persistent I/O errors are going to lead
to this sort of issue. 19 out of 20 times when we see this sort of issue
it is due to DDB under-provisioning, however once in a while we see
systematic DDB issues. AWS support may be one route, but in the self serve
category, I've had success scaling DDB down to minimal provisioning and
then scaling it up to something like 2-4x our production throughput, and
then back down to normal. I believe this forces DDB to re-shard your data
internally and can help migrate data away from problematic hosts. Probably
no guarantee it would solve your problem, but it is easy enough to try.

Post by Brecht De Rooms
Hi,
I am trying to fix a problem with heartbeat failures. We are running on
AWS with DynamoDB.
We have set it up in such a way that the old machine terminates and a new
VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some restarts
and they seem to be at regular intervals.
<https://lh3.googleusercontent.com/-Sbx_tEV6wF4/WZVQPHUidnI/AAAAAAAADcM/TgSGPYpvoO070NXIOY9JJ9nShgXbwk7DACLcBGAs/s1600/heartbeatfailure.png>
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the DynamoDB
limit, this does not change anything.
<https://lh3.googleusercontent.com/-_vb82sUjyfI/WZVUh_vrMiI/AAAAAAAADcY/YU1leIfJ6FssOsgCtvLa6v2k53Ylk9K8QCLcBGAs/s1600/RESTART.png>
In the logs we always see the same errors coming back
*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*
*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*
we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.
- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g
*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.

Brecht De Rooms

2017-08-25 08:27:55 UTC

Permalink

Thanks for the answer Adam,

I tried to rescale the DDB with no success and did a AWS Support call which
I am waiting on. I will let the datomic community know what they respond.
Is it absolutely certain that this is a DDB issue or could it be something
to have to do with the latest datomic version. Do you have any idea whether
there is more information available about this "*com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException"
error that I could find somewhere (a specific message from Dynamodb or
anyhthing?)*

Post by Adam Lewis
It seems like you've narrowed the problem down to DDB. Since heartbeat is
written to and read from DDB, any persistent I/O errors are going to lead
to this sort of issue. 19 out of 20 times when we see this sort of issue
it is due to DDB under-provisioning, however once in a while we see
systematic DDB issues. AWS support may be one route, but in the self serve
category, I've had success scaling DDB down to minimal provisioning and
then scaling it up to something like 2-4x our production throughput, and
then back down to normal. I believe this forces DDB to re-shard your data
internally and can help migrate data away from problematic hosts. Probably
no guarantee it would solve your problem, but it is easy enough to try.

Post by Brecht De Rooms
Hi,
I am trying to fix a problem with heartbeat failures. We are running on
AWS with DynamoDB.
We have set it up in such a way that the old machine terminates and a new
VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some restarts
and they seem to be at regular intervals.
<https://lh3.googleusercontent.com/-Sbx_tEV6wF4/WZVQPHUidnI/AAAAAAAADcM/TgSGPYpvoO070NXIOY9JJ9nShgXbwk7DACLcBGAs/s1600/heartbeatfailure.png>
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the
DynamoDB limit, this does not change anything.
<https://lh3.googleusercontent.com/-_vb82sUjyfI/WZVUh_vrMiI/AAAAAAAADcY/YU1leIfJ6FssOsgCtvLa6v2k53Ylk9K8QCLcBGAs/s1600/RESTART.png>
In the logs we always see the same errors coming back
*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*
*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*
we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.
- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g
*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.

Brecht De Rooms

2017-08-25 11:04:07 UTC

Permalink

*Update*

AWS support is looking into this. They decided to try and set up a
transactor to reproduce the problem and see how they can offer better
support for such services (which is quite awesome I think).
They promised to get back to me with advice.

Post by Brecht De Rooms
Thanks for the answer Adam,
I tried to rescale the DDB with no success and did a AWS Support call
which I am waiting on. I will let the datomic community know what they
respond.
Is it absolutely certain that this is a DDB issue or could it be something
to have to do with the latest datomic version. Do you have any idea whether
there is more information available about this "*com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException"
error that I could find somewhere (a specific message from Dynamodb or
anyhthing?)*

Post by Adam Lewis
It seems like you've narrowed the problem down to DDB. Since heartbeat
is written to and read from DDB, any persistent I/O errors are going to
lead to this sort of issue. 19 out of 20 times when we see this sort of
issue it is due to DDB under-provisioning, however once in a while we see
systematic DDB issues. AWS support may be one route, but in the self serve
category, I've had success scaling DDB down to minimal provisioning and
then scaling it up to something like 2-4x our production throughput, and
then back down to normal. I believe this forces DDB to re-shard your data
internally and can help migrate data away from problematic hosts. Probably
no guarantee it would solve your problem, but it is easy enough to try.

Post by Brecht De Rooms
Hi,
I am trying to fix a problem with heartbeat failures. We are running on
AWS with DynamoDB.
We have set it up in such a way that the old machine terminates and a
new VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some
restarts and they seem to be at regular intervals.
<https://lh3.googleusercontent.com/-Sbx_tEV6wF4/WZVQPHUidnI/AAAAAAAADcM/TgSGPYpvoO070NXIOY9JJ9nShgXbwk7DACLcBGAs/s1600/heartbeatfailure.png>
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the
DynamoDB limit, this does not change anything.
<https://lh3.googleusercontent.com/-_vb82sUjyfI/WZVUh_vrMiI/AAAAAAAADcY/YU1leIfJ6FssOsgCtvLa6v2k53Ylk9K8QCLcBGAs/s1600/RESTART.png>
In the logs we always see the same errors coming back
*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*
*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*
we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.
- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g
*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.