Thoughts on using Datomic for date-granular time series data

Discussion:

Patrick Rusk

2013-10-07 21:39:02 UTC

I am new to Datomic, but I have a strong interest in the topic previously
discussed at length (
https://groups.google.com/forum/#!topic/datomic/zDoFsxKgARQ) regarding the
difference between the fact that Datomic's "as of" functionality is based
on transaction times, not on when a datum becomes "official" ("t_recorded"
vs. "t_fact", in the terminology of the post linked to above). I can also
appreciate that it would be very hard to support both concepts. I have a
thought to bridge the two, and would be curious for the reactions of people
that have worked extensively with Datomic.

In the financial industry, we are constantly dealing with reference data
that has date granularity. In practice, for security reference data, this
generally means that Bloomberg sends a file every "evening" (for each
region in the world) listing the official data for a universe of
securities. Naturally, the closing price changes daily, but many attributes
are long-lived (company name) and even more are medium-lived (shares
outstanding). Only a few change daily. The data is generally regarded as
read-only, except for corrections.

Generally, the storage of this data in relational databases involves rows
with start_date/end_date values, or as_of_date if the datum is known to
change daily. The schema design ends up being a series of compromises in
how to break things up based on cardinalities and the frequency that values
change. The resulting schema is generally best described as "a mess", and
there are enormous amounts of duplicate data, especially when a record with
30 values needs to be copied with a new date range because 1 value changed.
The data is then duplicated even more massively into star schemas for
analysis and reporting.

It would be a dream to model the data very naturally and just have the
database take care of tracking the dates on which a datum changes its
official value. It *almost* seem like Datomic does that, until you realize
that it only cares (implicitly) about the exact moment you commit a value.
If it, instead, recorded it with the date on which it became official, one
could imagine having all kinds of fun using the time travel features in
Datomic. But I get that it would be hard to support that.

Now imagine, for a moment, that we lived in a weird universe where the
correct, official reference data for a security was known and recorded at
exactly midnight at the *start *of the date on which it was official. Then,
if you wanted to look back and see what the official data was on 9/1, you
would set your time viewpoint back to 9/1 and ask for the data. Since
t_fact == t_recorded in this case, you would get the answers you want and
life would be wonderful. But, clearly, that would be a weird universe.

Along the way in the forums here, though, I've seen indications that you
can prep a new Datomic database with historical data by feeding it data
with explicit transactions times, which I gather need to be monotonically
increasing. If that's true, then it suggests that you could take a set of
historical reference data and populate a fresh Datomic database as though
that data came in the same as in the weird universe (t_fact == t_recorded).
On any given morning you could potentially have such a database fully
current as of the prior end-of-day, which is a sweet spot for financial
reference data.

How might you handle corrections? Suppose that each day we snapshot the
official copy of the Datomic database, so that we have complete official
copies for each day. If a correction comes in for three days ago, we take
the snapshot from four days ago and replay into it the transactions for the
last three days, correcting the offending piece of data in the transaction
stream just beforehand. So, we might have a live copy of Datomic, and a
cold-standby that receives corrections and takes over as live once it is
up-to-date.

If this is viable, then it means Datomic could be used as-is right now for
this purpose. The principal compromises are that the data is read-only and
that a correction would take N minutes to be available to end users, with N
probably being something like 15+. There are many situations in which that
would be acceptable.

Thoughts?

By the way, there are times (like when the SEC comes calling) when you want
to know "At 9/3/2013 2:47 PM, what did this portfolio manager think the
short-term rating was for So-And-So MBS on 9/1?", but those times are very
rare. That's more in Datomic's usual sweet spot. Our nasty
start_date/end_date schemas have separate audit schemas populated via
triggers to capture the intraday changes that a record might undergo,
leading to even more duplication. However, under the scheme I outline
above, we could easily say, "At 9/3/2013 2:47 PM, the live Datomic database
was composed of this snapshot and these transaction logs. Let's recompose
it and see what the PM saw."

--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Bobby Calderwood

2013-10-08 20:30:33 UTC

Permalink

Hi Patrick,

In your case, I think both t_recorded and t_fact are valuable, and both can
be used. Use a :db.type/instant attribute for the t_fact, and you can use
indexRange in combination with query to narrow your query results to a
given time window. Then you retain the built-in notions of "when did we
know what?" and "how did we come to know this?", with the additional
domain-specific ideas of "when did this become official according to
Bloomberg?" and "how and when did the official fact change?". You can
query on both notions of time. For example in the case of corrections,
your t_fact attribute will allow you to query for most recent updates, and
the History APIs will allow you to see how a given closing price was
corrected over time (both DB time and Bloomberg time). The transaction
granularity in this case is identical for either notion of time--the only
thing varying is the wall-clock time.

Datomic gives you lots of other modeling benefits: card/many attributes,
entities that can have any combination of attributes, forward- and
backward-traversable references, etc. that would make it a good fit for
modeling your financial data.

Cheers,
Bobby

Post by Patrick Rusk
I am new to Datomic, but I have a strong interest in the topic previously
discussed at length (
https://groups.google.com/forum/#!topic/datomic/zDoFsxKgARQ) regarding
the difference between the fact that Datomic's "as of" functionality is
based on transaction times, not on when a datum becomes "official"
("t_recorded" vs. "t_fact", in the terminology of the post linked to
above). I can also appreciate that it would be very hard to support both
concepts. I have a thought to bridge the two, and would be curious for the
reactions of people that have worked extensively with Datomic.
In the financial industry, we are constantly dealing with reference data
that has date granularity. In practice, for security reference data, this
generally means that Bloomberg sends a file every "evening" (for each
region in the world) listing the official data for a universe of
securities. Naturally, the closing price changes daily, but many attributes
are long-lived (company name) and even more are medium-lived (shares
outstanding). Only a few change daily. The data is generally regarded as
read-only, except for corrections.
Generally, the storage of this data in relational databases involves rows
with start_date/end_date values, or as_of_date if the datum is known to
change daily. The schema design ends up being a series of compromises in
how to break things up based on cardinalities and the frequency that values
change. The resulting schema is generally best described as "a mess", and
there are enormous amounts of duplicate data, especially when a record with
30 values needs to be copied with a new date range because 1 value changed.
The data is then duplicated even more massively into star schemas for
analysis and reporting.
It would be a dream to model the data very naturally and just have the
database take care of tracking the dates on which a datum changes its
official value. It *almost* seem like Datomic does that, until you
realize that it only cares (implicitly) about the exact moment you commit a
value. If it, instead, recorded it with the date on which it became
official, one could imagine having all kinds of fun using the time travel
features in Datomic. But I get that it would be hard to support that.
Now imagine, for a moment, that we lived in a weird universe where the
correct, official reference data for a security was known and recorded at
exactly midnight at the *start *of the date on which it was official.
Then, if you wanted to look back and see what the official data was on 9/1,
you would set your time viewpoint back to 9/1 and ask for the data. Since
t_fact == t_recorded in this case, you would get the answers you want and
life would be wonderful. But, clearly, that would be a weird universe.
Along the way in the forums here, though, I've seen indications that you
can prep a new Datomic database with historical data by feeding it data
with explicit transactions times, which I gather need to be monotonically
increasing. If that's true, then it suggests that you could take a set of
historical reference data and populate a fresh Datomic database as though
that data came in the same as in the weird universe (t_fact == t_recorded).
On any given morning you could potentially have such a database fully
current as of the prior end-of-day, which is a sweet spot for financial
reference data.
How might you handle corrections? Suppose that each day we snapshot the
official copy of the Datomic database, so that we have complete official
copies for each day. If a correction comes in for three days ago, we take
the snapshot from four days ago and replay into it the transactions for the
last three days, correcting the offending piece of data in the transaction
stream just beforehand. So, we might have a live copy of Datomic, and a
cold-standby that receives corrections and takes over as live once it is
up-to-date.
If this is viable, then it means Datomic could be used as-is right now for
this purpose. The principal compromises are that the data is read-only and
that a correction would take N minutes to be available to end users, with N
probably being something like 15+. There are many situations in which that
would be acceptable.
Thoughts?
By the way, there are times (like when the SEC comes calling) when you
want to know "At 9/3/2013 2:47 PM, what did this portfolio manager think
the short-term rating was for So-And-So MBS on 9/1?", but those times are
very rare. That's more in Datomic's usual sweet spot. Our nasty
start_date/end_date schemas have separate audit schemas populated via
triggers to capture the intraday changes that a record might undergo,
leading to even more duplication. However, under the scheme I outline
above, we could easily say, "At 9/3/2013 2:47 PM, the live Datomic database
was composed of this snapshot and these transaction logs. Let's recompose
it and see what the PM saw."
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Patrick Rusk

2013-10-09 14:27:38 UTC

Permalink

Thanks for your response, Bobby.

Post by Bobby Calderwood
In your case, I think both t_recorded and t_fact are valuable, and both

can be used.

Their relative values are nowhere near comparable, though. We care about
t_fact 99.999% of the time. The only time we care about t_recorded is when
developers are looking into bugs in the data or when some audit is
requiring us to figure out what a portfolio manager was seeing at a
particular point in time. Very rare scenarios for us. It is quite fine for
the act of finding t_recorded to take minutes or hours of someone's time,
because we do it so rarely.

Post by Bobby Calderwood
Use a :db.type/instant attribute for the t_fact, and you can use

Bobby Calderwood

2013-10-10 15:05:00 UTC

Permalink

Hi Patrick,

The key here is that transactions are first-class entities in Datomic, and
that you can add arbitrary attributes to the transaction itself, rather
than to the entities affected by the transaction. Combining this idea with
a few of the other tools in the Datomic arsenal -- the Log [1], Index [2]
and Query [3] APIs -- you can efficiently query for facts that were added
during transactions that took place within a given domain-time interval
(similar to what d/since and d/as-of provide for database-time). Consider
the following:

example transaction (where id "1234" is the security you want to update):

[[:db/add #db/id[:db.part/tx] :domain/time #inst "2013-10-07"] {:db/id 1234
:security/price 42.42M}]

example query to find the name and price of each security changed (in
domain time) between two dates:

(d/q
'[:find ?n ?v
:in [[?tx]] ?log
:where
[(tx-data ?log ?tx) [[?e]]]
[?e :security/price ?v]
[?e :security/name ?n]]
(d/index-range (d/db conn) :domain/time #inst "2013-10-06" #inst
"2013-10-08")
(d/log conn))

In the index-range call, you can pass nil as the start value to get all
facts "as-of" a date, and nil as the end date to get all facts "since" a
date.

Cheers,
Bobby

[1] http://docs.datomic.com/log.html
[2] http://docs.datomic.com/query.html
[3] Java --
http://docs.datomic.com/javadoc/datomic/Database.html#indexRange(java.lang.Object,
java.lang.Object, java.lang.Object)
Clojure --
http://docs.datomic.com/clojure/index.html#datomic.api/index-range

Post by Patrick Rusk
Thanks for your response, Bobby.

Post by Bobby Calderwood
In your case, I think both t_recorded and t_fact are valuable, and both

can be used.
Their relative values are nowhere near comparable, though. We care about
t_fact 99.999% of the time. The only time we care about t_recorded is when
developers are looking into bugs in the data or when some audit is
requiring us to figure out what a portfolio manager was seeing at a
particular point in time. Very rare scenarios for us. It is quite fine for
the act of finding t_recorded to take minutes or hours of someone's time,
because we do it so rarely.

Post by Bobby Calderwood
Use a :db.type/instant attribute for the t_fact, and you can use

indexRange in combination with query to narrow your query results to a
given time window.
Isn't that analogous to putting start_date/end_date on record in a
relational database? Wouldn't we be in the same boat of trying to decide
whether to put it on every fact or just at the entity level? That's what
I'm really trying to avoid.
Thanks again.
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Ron Dagostino

2013-10-11 21:25:57 UTC

Permalink

I am interested in understanding the trade-offs between putting the price
date on the transaction as described below in this discussion thread vs.
putting the price date in the model as an :as-of/priceDate attribute as I
was experimenting with in the thread
https://groups.google.com/forum/#!topic/datomic/mLrC285U3mk. I noticed
today that putting the price date on the transaction is the technique
mentioned by Stuart Halloway in his talk "The Impedance Mismatch is Our Own
Fault" (http://www.infoq.com/presentations/Impedance-Mismatch).

One seeming advantage of putting the price date on the transaction is that
it doesn't "pollute" your model: a security has a bunch of prices, period.
But then don't the transactions themselves become part of my model because
I need the metadata about the price to know the price date? But I guess
you need to hang the metadata somewhere, and as long as it is idiomatic to
put it on the transaction then it only seems weird if you don't know this.
Maybe also "pollute" is a bad word. Perhaps simpler is better -- certainly
the model is simpler if you store the price date as a piece of metadata.

How about performance trade-offs? Is one way clearly better than the other
in Datomic? What about the case of something that doesn't change that
often, like shares outstanding? You want to store prices every day, but
storing the same shares outstanding value every day seems wasteful. Is it
idiomatic to store a date on the transaction in the case where you
determine the value on a particular trade date by looking for the most
recent trade date that has a actual shares outstanding value?

Is there anything else to know about this issue?

Is the answer clearly that it should be done one way or the other in
Datomic, or is it still considered open to debate?

Ron

Post by Bobby Calderwood
Hi Patrick,
The key here is that transactions are first-class entities in Datomic, and
that you can add arbitrary attributes to the transaction itself, rather
than to the entities affected by the transaction. Combining this idea with
a few of the other tools in the Datomic arsenal -- the Log [1], Index [2]
and Query [3] APIs -- you can efficiently query for facts that were added
during transactions that took place within a given domain-time interval
(similar to what d/since and d/as-of provide for database-time). Consider
[[:db/add #db/id[:db.part/tx] :domain/time #inst "2013-10-07"] {:db/id
1234 :security/price 42.42M}]
example query to find the name and price of each security changed (in
(d/q
'[:find ?n ?v
:in [[?tx]] ?log
:where
[(tx-data ?log ?tx) [[?e]]]
[?e :security/price ?v]
[?e :security/name ?n]]
(d/index-range (d/db conn) :domain/time #inst "2013-10-06" #inst
"2013-10-08")
(d/log conn))
In the index-range call, you can pass nil as the start value to get all
facts "as-of" a date, and nil as the end date to get all facts "since" a
date.
Cheers,
Bobby
[1] http://docs.datomic.com/log.html
[2] http://docs.datomic.com/query.html
[3] Java --
http://docs.datomic.com/javadoc/datomic/Database.html#indexRange(java.lang.Object,
java.lang.Object, java.lang.Object)
Clojure --
http://docs.datomic.com/clojure/index.html#datomic.api/index-range

Post by Patrick Rusk
Thanks for your response, Bobby.

Post by Bobby Calderwood
In your case, I think both t_recorded and t_fact are valuable, and both

can be used.
Their relative values are nowhere near comparable, though. We care about
t_fact 99.999% of the time. The only time we care about t_recorded is when
developers are looking into bugs in the data or when some audit is
requiring us to figure out what a portfolio manager was seeing at a
particular point in time. Very rare scenarios for us. It is quite fine for
the act of finding t_recorded to take minutes or hours of someone's time,
because we do it so rarely.

Post by Bobby Calderwood
Use a :db.type/instant attribute for the t_fact, and you can use

indexRange in combination with query to narrow your query results to a
given time window.
Isn't that analogous to putting start_date/end_date on record in a
relational database? Wouldn't we be in the same boat of trying to decide
whether to put it on every fact or just at the entity level? That's what
I'm really trying to avoid.
Thanks again.
--
You received this message because you are subscribed to the Google Groups
"Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Ron Dagostino

2013-10-12 01:07:40 UTC

Permalink

I've been thinking about this some more, and I believe there are some fundamental problems with annotating the transaction entity with certain kinds of additional information. Take, for example, priceDate. If you depend on the transaction to hold this information then you now are no longer able to store data for multiple price dates in a single transaction -- correct? Also, I think Datomic was optimized at one point to not store values that don't change, so if my price is 10 and I try to store the price as 10 again I think Datomic is smart enough to not store the 10 a second time, which saves resources. The problem is that an "unchanged" value may actually need to be stored if you are depending on the transaction to hold the priceDate. Let's say my price is unchanged -- it happens to be 10 both yesterday and today. Will the second value not be stored because Datomic sees that the price isn't changing?

I'm thinking that adding information to a transaction sounds like a good idea but really isn't unless the data is guaranteed to apply to all the data in an arbitrarily complex transaction. In other words, it probably is a good idea to store things like the user who initiated the transaction, but not things like priceDate -- I think that has to be in the model.

I could be totally wrong here; am I?

Ron

Stuart Halloway

2013-10-18 15:17:29 UTC

Permalink

Hi Ron,

As you say, Datomic will not repeatedly store values that have not changed.
If you need to repeatedly say "as of time X, fact Y is still true" then
you will need to add a level of indirection: at time X refer to an entity
that has the Y value you want, instead of directly to Y.

Whether you put something like priceDate on a transaction is a modeling
choice that you can make. If the priceDate applies to the whole
transaction, then it is reasonable. If not, then you should find (or
introduce) an entity of the right size.

Stu

Post by Ron Dagostino
I've been thinking about this some more, and I believe there are some
fundamental problems with annotating the transaction entity with certain
kinds of additional information. Take, for example, priceDate. If you
depend on the transaction to hold this information then you now are no
longer able to store data for multiple price dates in a single transaction
-- correct? Also, I think Datomic was optimized at one point to not store
values that don't change, so if my price is 10 and I try to store the price
as 10 again I think Datomic is smart enough to not store the 10 a second
time, which saves resources. The problem is that an "unchanged" value may
actually need to be stored if you are depending on the transaction to hold
the priceDate. Let's say my price is unchanged -- it happens to be 10 both
yesterday and today. Will the second value not be stored because Datomic
sees that the price isn't changing?
I'm thinking that adding information to a transaction sounds like a good
idea but really isn't unless the data is guaranteed to apply to all the
data in an arbitrarily complex transaction. In other words, it probably is
a good idea to store things like the user who initiated the transaction,
but not things like priceDate -- I think that has to be in the model.
I could be totally wrong here; am I?
Ron
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Ron Dagostino

2013-10-18 17:58:45 UTC

Permalink

Thanks, Stu. It is a very common use case for us to store data that changes every trade date, so we would put the level of indirection directly into the model. I believe this is a common-enough use case in general that I draw your attention to this thread (https://groups.google.com/forum/m/#!topic/datomic/mLrC285U3mk), which tries to discuss some of the performance issues related to querying when you have this level of indirection in the model. I believe the ideal implementation would allow an index to be created on an attribute that had a multiplicity of :db.cardinality/many, and it would be fine if the implementation constrained such an attribute to be immutable. While it is probably true that the scans without the availability of such an index are fast because all the data is in memory, it seems to me that the availability of such an index would eliminate the need for all of the data to be in memory -- the scan would be replaced with an index lookup instead. Maybe my performance-related concerns are unfounded given my current limited understanding, and if I am and you believe you can quickly enlighten me then I'm willing to listen; in any case, I mention this just in case it is helpful feedback.

Thanks for the reply, and thanks for Datomic.

Ron

Stuart Halloway

2013-10-18 20:09:12 UTC

Permalink

Hi Ron,

I am not sure what you mean about constraining the attribute to be
immutable -- everything is immutable in Datomic. :-)

Custom and composite indexes are certainly something we continue to
consider, but not on the short-term road map at the moment. Whether this
becomes a performance issue for you will be driven by the specifics of your
queries and data, and we would be happy to help you assess that.

Best,
Stu

Post by Ron Dagostino
Thanks, Stu. It is a very common use case for us to store data that
changes every trade date, so we would put the level of indirection directly
into the model. I believe this is a common-enough use case in general that
I draw your attention to this thread (
https://groups.google.com/forum/m/#!topic/datomic/mLrC285U3mk), which
tries to discuss some of the performance issues related to querying when
you have this level of indirection in the model. I believe the ideal
implementation would allow an index to be created on an attribute that had
a multiplicity of :db.cardinality/many, and it would be fine if the
implementation constrained such an attribute to be immutable. While it is
probably true that the scans without the availability of such an index are
fast because all the data is in memory, it seems to me that the
availability of such an index would eliminate the need for all of the data
to be in memory -- the scan would be replaced with an index lookup instead.
Maybe my performance-related concerns are unfounded given my current
limited understanding, and if I am and you believe you can quickly
enlighten me then I'm willing to listen; in any case, I mention this just
in case it is helpful feedback.
Thanks for the reply, and thanks for Datomic.
Ron
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Ron Dagostino

2013-10-18 22:49:09 UTC

Permalink

Stu-

I meant immutable across all time; it would be acceptable for this use case if the multi-attribute that was set to some value and then indexed as such was constrained by the system to never be set to any other value at any other time thereafter.

Is the roadmap public, and if not, is it available if we undertake a POC?

Ron

Post by Stuart Halloway
Hi Ron,
I am not sure what you mean about constraining the attribute to be immutable -- everything is immutable in Datomic. :-)
Custom and composite indexes are certainly something we continue to consider, but not on the short-term road map at the moment. Whether this becomes a performance issue for you will be driven by the specifics of your queries and data, and we would be happy to help you assess that.
Best,
Stu

Post by Ron Dagostino
Thanks, Stu. It is a very common use case for us to store data that changes every trade date, so we would put the level of indirection directly into the model. I believe this is a common-enough use case in general that I draw your attention to this thread (https://groups.google.com/forum/m/#!topic/datomic/mLrC285U3mk), which tries to discuss some of the performance issues related to querying when you have this level of indirection in the model. I believe the ideal implementation would allow an index to be created on an attribute that had a multiplicity of :db.cardinality/many, and it would be fine if the implementation constrained such an attribute to be immutable. While it is probably true that the scans without the availability of such an index are fast because all the data is in memory, it seems to me that the availability of such an index would eliminate the need for all of the data to be in memory -- the scan would be replaced with an index lookup instead. Maybe my performance-related concerns are unfounded given my current limited understanding, and if I am and you believe you can quickly enlighten me then I'm willing to listen; in any case, I mention this just in case it is helpful feedback.
Thanks for the reply, and thanks for Datomic.
Ron
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "Datomic" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/datomic/4CNFwg3eD1Q/unsubscribe.
For more options, visit https://groups.google.com/groups/opt_out.