steps to recover from transactor failure

Discussion:

Tijs Mallaerts

2015-11-24 09:15:02 UTC

We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
failures of the transactor:
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message "Critical
failure, cannot continue: Heartbeat failed"}

We have not been able yet to identify the cause of these failures.

What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could this
cause data loss?

Thanks in advance!
Tijs

--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Linus Ericsson

2015-11-25 16:33:42 UTC

Permalink

There is no risk of data loss (except that you have to make sure you don't
miss that transactions cancel and throws exceptions under certain
circumstances). The transactor will never return a result thats not
persisted to the storage system.

The heartbeat system uses the backing storage. The transactor in charge
updates a timestamp at regular* intervals. The Peers and other transactors
reads the timestamp and takes action accordingly should the transactor in
charge fail to update the timestamp.

This rely on that the system clocks are quite well in sync. They should
diff no more than a few seconds*.

The reason for heartbeat failures (according to the previous posts on this
list) is almost always GC-settings (and their pauses) in the transactor.

How much memory (+XmX) does your transactors JVM process have availiable?
And how much memory has the environment really availiable? And do the
transactor and the Peer(s) run on different machines? Otherwise there could
be two processes starting to gc in the same time.

Often more memory gives longer pauses on the full gc-pauses, so much more
memory is not nescessarily better.

You can also raise the transactors timeout-settings. This wont have any
particular effects on the system except that it takes longer to notice a
transactor is down "for real" and some other fail-over action should occur.

Does your application loads suggests that the system could be overloaded?

* = afaik.

/Linus

Post by Tijs Mallaerts
We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message "Critical
failure, cannot continue: Heartbeat failed"}
We have not been able yet to identify the cause of these failures.
What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could
this cause data loss?
Thanks in advance!
Tijs
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Tijs Mallaerts

2015-11-25 22:36:34 UTC

Permalink

Hello Linus,

Thank you for the feedback!

The VPS has 16gb's of RAM, and Xmx for the transactor is set to 4gb. The
transactor and the peers are running on the same server.
For the moment the system is only used for some smaller tests, so there are
no signs of heavy load.

Thanks again!
Tijs

Post by Linus Ericsson
There is no risk of data loss (except that you have to make sure you don't
miss that transactions cancel and throws exceptions under certain
circumstances). The transactor will never return a result thats not
persisted to the storage system.
The heartbeat system uses the backing storage. The transactor in charge
updates a timestamp at regular* intervals. The Peers and other transactors
reads the timestamp and takes action accordingly should the transactor in
charge fail to update the timestamp.
This rely on that the system clocks are quite well in sync. They should
diff no more than a few seconds*.
The reason for heartbeat failures (according to the previous posts on this
list) is almost always GC-settings (and their pauses) in the transactor.
How much memory (+XmX) does your transactors JVM process have availiable?
And how much memory has the environment really availiable? And do the
transactor and the Peer(s) run on different machines? Otherwise there could
be two processes starting to gc in the same time.
Often more memory gives longer pauses on the full gc-pauses, so much more
memory is not nescessarily better.
You can also raise the transactors timeout-settings. This wont have any
particular effects on the system except that it takes longer to notice a
transactor is down "for real" and some other fail-over action should occur.
Does your application loads suggests that the system could be overloaded?
* = afaik.
/Linus

Post by Tijs Mallaerts
We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message
"Critical failure, cannot continue: Heartbeat failed"}
We have not been able yet to identify the cause of these failures.
What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could
this cause data loss?
Thanks in advance!
Tijs
--
You received this message because you are subscribed to the Google Groups
"Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Brecht De Rooms

2017-05-15 08:28:05 UTC

Permalink

@Tijs: did you resolve your issue, I am currently experiencing a similar
problem.

Post by Tijs Mallaerts
Hello Linus,
Thank you for the feedback!
The VPS has 16gb's of RAM, and Xmx for the transactor is set to 4gb. The
transactor and the peers are running on the same server.
For the moment the system is only used for some smaller tests, so there
are no signs of heavy load.
Thanks again!
Tijs

Post by Linus Ericsson
There is no risk of data loss (except that you have to make sure you
don't miss that transactions cancel and throws exceptions under certain
circumstances). The transactor will never return a result thats not
persisted to the storage system.
The heartbeat system uses the backing storage. The transactor in charge
updates a timestamp at regular* intervals. The Peers and other transactors
reads the timestamp and takes action accordingly should the transactor in
charge fail to update the timestamp.
This rely on that the system clocks are quite well in sync. They should
diff no more than a few seconds*.
The reason for heartbeat failures (according to the previous posts on
this list) is almost always GC-settings (and their pauses) in the
transactor.
How much memory (+XmX) does your transactors JVM process have availiable?
And how much memory has the environment really availiable? And do the
transactor and the Peer(s) run on different machines? Otherwise there could
be two processes starting to gc in the same time.
Often more memory gives longer pauses on the full gc-pauses, so much more
memory is not nescessarily better.
You can also raise the transactors timeout-settings. This wont have any
particular effects on the system except that it takes longer to notice a
transactor is down "for real" and some other fail-over action should occur.
Does your application loads suggests that the system could be overloaded?
* = afaik.
/Linus

Post by Tijs Mallaerts
We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message
"Critical failure, cannot continue: Heartbeat failed"}
We have not been able yet to identify the cause of these failures.
What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could
this cause data loss?
Thanks in advance!
Tijs
--
You received this message because you are subscribed to the Google
Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Tijs Mallaerts

2017-05-15 19:54:23 UTC

Permalink

Yes, part of the failures could be explained by unattended package upgrades
of the server which caused mysql to restart and the transactor to lose
connection to storage.
After disabling unattended upgrades the failures were less frequent, and
the transactor has been running without any problems for a long time now.

Best regards,
Tijs

Post by Brecht De Rooms
@Tijs: did you resolve your issue, I am currently experiencing a similar
problem.

Post by Linus Ericsson
There is no risk of data loss (except that you have to make sure you
don't miss that transactions cancel and throws exceptions under certain
circumstances). The transactor will never return a result thats not
persisted to the storage system.
The heartbeat system uses the backing storage. The transactor in charge
updates a timestamp at regular* intervals. The Peers and other transactors
reads the timestamp and takes action accordingly should the transactor in
charge fail to update the timestamp.
This rely on that the system clocks are quite well in sync. They should
diff no more than a few seconds*.
The reason for heartbeat failures (according to the previous posts on
this list) is almost always GC-settings (and their pauses) in the
transactor.
How much memory (+XmX) does your transactors JVM process have
availiable? And how much memory has the environment really availiable? And
do the transactor and the Peer(s) run on different machines? Otherwise
there could be two processes starting to gc in the same time.
Often more memory gives longer pauses on the full gc-pauses, so much
more memory is not nescessarily better.
You can also raise the transactors timeout-settings. This wont have any
particular effects on the system except that it takes longer to notice a
transactor is down "for real" and some other fail-over action should occur.
Does your application loads suggests that the system could be overloaded?
* = afaik.
/Linus

Post by Tijs Mallaerts
We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message
"Critical failure, cannot continue: Heartbeat failed"}
We have not been able yet to identify the cause of these failures.
What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could
this cause data loss?
Thanks in advance!
Tijs
--
You received this message because you are subscribed to the Google
Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Daniel Compton

2017-05-21 22:32:17 UTC

Permalink

Does the transactor retry connecting to the storage backend with an
exponential backoff, or does it need a full restart if it loses
connectivity with (say) a MySQL database? The docs don't explicitly say:
http://docs.datomic.com/deployment.html, but seem to imply that transactor
HA failover is the intended method of handling failures.

Post by Tijs Mallaerts
Yes, part of the failures could be explained by unattended package
upgrades of the server which caused mysql to restart and the transactor to
lose connection to storage.
After disabling unattended upgrades the failures were less frequent, and
the transactor has been running without any problems for a long time now.
Best regards,
Tijs

Post by Brecht De Rooms
@Tijs: did you resolve your issue, I am currently experiencing a similar
problem.

Post by Linus Ericsson
There is no risk of data loss (except that you have to make sure you
don't miss that transactions cancel and throws exceptions under certain
circumstances). The transactor will never return a result thats not
persisted to the storage system.
The heartbeat system uses the backing storage. The transactor in charge
updates a timestamp at regular* intervals. The Peers and other transactors
reads the timestamp and takes action accordingly should the transactor in
charge fail to update the timestamp.
This rely on that the system clocks are quite well in sync. They should
diff no more than a few seconds*.
The reason for heartbeat failures (according to the previous posts on
this list) is almost always GC-settings (and their pauses) in the
transactor.
How much memory (+XmX) does your transactors JVM process have
availiable? And how much memory has the environment really availiable? And
do the transactor and the Peer(s) run on different machines? Otherwise
there could be two processes starting to gc in the same time.
Often more memory gives longer pauses on the full gc-pauses, so much
more memory is not nescessarily better.
You can also raise the transactors timeout-settings. This wont have any
particular effects on the system except that it takes longer to notice a
transactor is down "for real" and some other fail-over action should occur.
Does your application loads suggests that the system could be overloaded?
* = afaik.
/Linus

Post by Tijs Mallaerts
We are running datomic 0.9.5206 on an Ubuntu 14.04 VPS with a mysql
database as the storage service, and are experiencing regular critical
INFO default datomic.lifecycle - {:tid 18, :pid 8396, :event
:transactor/heartbeat-failed, :cause :timeout}
ERROR default datomic.process - {:tid 93, :pid 8396, :message
"Critical failure, cannot continue: Heartbeat failed"}
We have not been able yet to identify the cause of these failures.
What are the correct steps to recover from such a failure?
At the moment we just restart the transactor, is this enough? Or could
this cause data loss?
Thanks in advance!
Tijs
--
You received this message because you are subscribed to the Google
Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
Daniel
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.