Regarding memory pressure around 10 billion datoms... Is that an issue that
arises fairly suddenly around 10 billion datoms? Or is this memory pressure
fairly linear with respect to the number of datoms in the database? In
other words, if I take a large database with 18 billion datoms that needs 9
GB of active memory per peer, and divide it into 3 databases with 6 billion
datoms each, do I still need 9GB/3 + 9GB/3 + 9GB/3 = 9 GB of active memory
or did I save a lot of peer memory by performing that division?
A related question regarding memory pressure... I imagine that the amount
of active memory would be much higher on a peer that is performing queries
that involve a large portion of a database's datom's, but that some minimal
amount of active memory will be required just to connect to the database
and fetch a few entities. In the context of memory pressure caused by large
databases, are we talking primarily about the memory pressure associated
with connecting and fetching a few entities?
Of course it would be great if someone could provide ballpark numbers for
the memory needs of peers and transactors at various database sizes. The
capacity doc talks about using 1 GB for development and 4 GB for production
on both the transactor and peers. If 4 GB turns out to be the expected
memory usage for a 10 billion datom database, then 200 billion datoms could
be split across 20 databases, requiring 20*4GB = 80 GB of memory which
would require an r3.4xlarge instance for $1k per month (per peer or
transactor). Yes, that's a lot to pay per peer, but considering the extreme
scale we're discussing... that price seems reasonable enough.
I watched the NuBank video that Mike Haney linked. It is overall a very
glowing review of datomic, a small part of which is about how well Datomic
handles sharding. The comments about sharding are in two places, around
33:43 (superpowers 8 and 9) and a short comment at the end. The video gives
me a lot of hope that Datomic is useable at scale, but also raises some
questions.
http://youtu.be/7lm3K8zVOdY
NuBank's reason for sharding was to overcome the write throughput
limitations of DynamoDB, which they used as Datomic's storage backend. They
sharded into multiple Datomic databases, each using a separate DynamoDB on
a separate ec2 instance as a backend. All of these databases shared the
same transactor. There was also a claim about "inter-shard ACID
transactions" which seems to be partially explained by using one transactor
across all of these databases. They also had multiple transactors for high
availability, so I guess one of those transactors served all of those
databases while another just sat idle to take over if needed.
How can a transactor be configured to serve multiple databases?
Since the usual way to achieve high availability is to point two
transactors at the same database, how can we ensure that one of the
transactors handles all of the databases, rather than the two transactors
dividing up the databases randomly?
Since the transactor is only performing one transaction at a time, across
all those databases, I expect that only one of the databases would actually
be writing to DynamoDB at any given time. So how does this configuration
lead to higher write throughput if only one DynamoDB is actually being
written to at a time?
Transaction functions typically take "the database" as an argument. How
does this work in the context of "inter-shard ACID transactions"? Do they
end up taking multiple databases as arguments but only inserting into one
of them? Take multiple databases as arguments and somehow choose which of
those databases will receive various inserted datoms? Take one database as
argument, and explicitely connect to a few others just to peek?
Followup possibility, inspired by the video... If I have five datomic
databases, can I assign one transactor to three of them for inter-shard
transactions between those shards, and give the other two their own
transactors for increased throughput? Can I also change my mind later about
which shards should be served by a single transactor? That would be an
amazing option to have available if that is really possible.
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.