Brecht De Rooms
2017-08-17 08:40:14 UTC
Hi,
I am trying to fix a problem with heartbeat failures. We are running on AWS
with DynamoDB.
We have set it up in such a way that the old machine terminates and a new
VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some restarts
and they seem to be at regular intervals.
<Loading Image...
>
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the DynamoDB
limit, this does not change anything.
<Loading Image...
>
In the logs we always see the same errors coming back
*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*
*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*
we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.
- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g
*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.
I am trying to fix a problem with heartbeat failures. We are running on AWS
with DynamoDB.
We have set it up in such a way that the old machine terminates and a new
VM starts whenever the java process fails.
As you can see in the cloudwatch graph below, we have quite some restarts
and they seem to be at regular intervals.
<Loading Image...
We do see in dynamodb some throttling on the reads (see image below) but
they seem to happen *after *the restart, which makes sense since the
datomic peer cache is invalidated I assume (does that actually happen on a
transactor restart?)
There are no write throttles whatsoever. We tried to increase the DynamoDB
limit, this does not change anything.
<Loading Image...
In the logs we always see the same errors coming back
*2017-08-17 05:40:18.942 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 0, :attempts 0, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.006 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 50, :attempts 1, :max-retries 20,
:cause "com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException",
:pid 9386, :tid 355}*
*2017-08-17 05:40:19.119 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 100, :attempts 2, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.283 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 150, :attempts 3, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.497 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 200, :attempts 4, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:19.761 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 250, :attempts 5, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.075 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 300, :attempts 6, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.438 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 350, :attempts 7, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:20.852 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 400, :attempts 8, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.315 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 450, :attempts 9, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:21.829 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 500, :attempts 10, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:22.392 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 550, :attempts 11, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.020 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 600, :attempts 12, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:23.684 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 650, :attempts 13, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:24.397 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 700, :attempts 14, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.161 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 750, :attempts 15, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:25.975 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 800, :attempts 16, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:26.838 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 850, :attempts 17, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:27.751 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 900, :attempts 18, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.715 INFO default datomic.kv-cluster - {:event
:kv-cluster/retry, :StoragePutBackoffMsec 950, :attempts 19, :max-retries
20, :cause
"com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException", :pid
9386, :tid 355}*
*2017-08-17 05:40:28.908 INFO default datomic.lifecycle - {:event
:transactor/heartbeat-failed, :cause :timeout, :pid 9386, :tid 17}*
*2017-08-17 05:40:28.910 ERROR default datomic.process - {:message
"Critical failure, cannot continue: Heartbeat failed", :pid 9386, :tid 394}*
we currently use datomic-pro-0.9.5561.50 and use an ami we created
ourselves on a t2-medium with 4Gb ram.
- the xmx and xms settings are set to 2625m.
- transactor_memory_index_max = 512m
- transactor_memory_index_threshold = 32m
- transactor_object_cache_max= 1g
*Note: *failover does not help, we have set failover and if we do use 2
instances for failover then they both fail at the same time.
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.