Patrick Rusk
2013-10-07 21:39:02 UTC
I am new to Datomic, but I have a strong interest in the topic previously
discussed at length (
https://groups.google.com/forum/#!topic/datomic/zDoFsxKgARQ) regarding the
difference between the fact that Datomic's "as of" functionality is based
on transaction times, not on when a datum becomes "official" ("t_recorded"
vs. "t_fact", in the terminology of the post linked to above). I can also
appreciate that it would be very hard to support both concepts. I have a
thought to bridge the two, and would be curious for the reactions of people
that have worked extensively with Datomic.
In the financial industry, we are constantly dealing with reference data
that has date granularity. In practice, for security reference data, this
generally means that Bloomberg sends a file every "evening" (for each
region in the world) listing the official data for a universe of
securities. Naturally, the closing price changes daily, but many attributes
are long-lived (company name) and even more are medium-lived (shares
outstanding). Only a few change daily. The data is generally regarded as
read-only, except for corrections.
Generally, the storage of this data in relational databases involves rows
with start_date/end_date values, or as_of_date if the datum is known to
change daily. The schema design ends up being a series of compromises in
how to break things up based on cardinalities and the frequency that values
change. The resulting schema is generally best described as "a mess", and
there are enormous amounts of duplicate data, especially when a record with
30 values needs to be copied with a new date range because 1 value changed.
The data is then duplicated even more massively into star schemas for
analysis and reporting.
It would be a dream to model the data very naturally and just have the
database take care of tracking the dates on which a datum changes its
official value. It *almost* seem like Datomic does that, until you realize
that it only cares (implicitly) about the exact moment you commit a value.
If it, instead, recorded it with the date on which it became official, one
could imagine having all kinds of fun using the time travel features in
Datomic. But I get that it would be hard to support that.
Now imagine, for a moment, that we lived in a weird universe where the
correct, official reference data for a security was known and recorded at
exactly midnight at the *start *of the date on which it was official. Then,
if you wanted to look back and see what the official data was on 9/1, you
would set your time viewpoint back to 9/1 and ask for the data. Since
t_fact == t_recorded in this case, you would get the answers you want and
life would be wonderful. But, clearly, that would be a weird universe.
Along the way in the forums here, though, I've seen indications that you
can prep a new Datomic database with historical data by feeding it data
with explicit transactions times, which I gather need to be monotonically
increasing. If that's true, then it suggests that you could take a set of
historical reference data and populate a fresh Datomic database as though
that data came in the same as in the weird universe (t_fact == t_recorded).
On any given morning you could potentially have such a database fully
current as of the prior end-of-day, which is a sweet spot for financial
reference data.
How might you handle corrections? Suppose that each day we snapshot the
official copy of the Datomic database, so that we have complete official
copies for each day. If a correction comes in for three days ago, we take
the snapshot from four days ago and replay into it the transactions for the
last three days, correcting the offending piece of data in the transaction
stream just beforehand. So, we might have a live copy of Datomic, and a
cold-standby that receives corrections and takes over as live once it is
up-to-date.
If this is viable, then it means Datomic could be used as-is right now for
this purpose. The principal compromises are that the data is read-only and
that a correction would take N minutes to be available to end users, with N
probably being something like 15+. There are many situations in which that
would be acceptable.
Thoughts?
By the way, there are times (like when the SEC comes calling) when you want
to know "At 9/3/2013 2:47 PM, what did this portfolio manager think the
short-term rating was for So-And-So MBS on 9/1?", but those times are very
rare. That's more in Datomic's usual sweet spot. Our nasty
start_date/end_date schemas have separate audit schemas populated via
triggers to capture the intraday changes that a record might undergo,
leading to even more duplication. However, under the scheme I outline
above, we could easily say, "At 9/3/2013 2:47 PM, the live Datomic database
was composed of this snapshot and these transaction logs. Let's recompose
it and see what the PM saw."
discussed at length (
https://groups.google.com/forum/#!topic/datomic/zDoFsxKgARQ) regarding the
difference between the fact that Datomic's "as of" functionality is based
on transaction times, not on when a datum becomes "official" ("t_recorded"
vs. "t_fact", in the terminology of the post linked to above). I can also
appreciate that it would be very hard to support both concepts. I have a
thought to bridge the two, and would be curious for the reactions of people
that have worked extensively with Datomic.
In the financial industry, we are constantly dealing with reference data
that has date granularity. In practice, for security reference data, this
generally means that Bloomberg sends a file every "evening" (for each
region in the world) listing the official data for a universe of
securities. Naturally, the closing price changes daily, but many attributes
are long-lived (company name) and even more are medium-lived (shares
outstanding). Only a few change daily. The data is generally regarded as
read-only, except for corrections.
Generally, the storage of this data in relational databases involves rows
with start_date/end_date values, or as_of_date if the datum is known to
change daily. The schema design ends up being a series of compromises in
how to break things up based on cardinalities and the frequency that values
change. The resulting schema is generally best described as "a mess", and
there are enormous amounts of duplicate data, especially when a record with
30 values needs to be copied with a new date range because 1 value changed.
The data is then duplicated even more massively into star schemas for
analysis and reporting.
It would be a dream to model the data very naturally and just have the
database take care of tracking the dates on which a datum changes its
official value. It *almost* seem like Datomic does that, until you realize
that it only cares (implicitly) about the exact moment you commit a value.
If it, instead, recorded it with the date on which it became official, one
could imagine having all kinds of fun using the time travel features in
Datomic. But I get that it would be hard to support that.
Now imagine, for a moment, that we lived in a weird universe where the
correct, official reference data for a security was known and recorded at
exactly midnight at the *start *of the date on which it was official. Then,
if you wanted to look back and see what the official data was on 9/1, you
would set your time viewpoint back to 9/1 and ask for the data. Since
t_fact == t_recorded in this case, you would get the answers you want and
life would be wonderful. But, clearly, that would be a weird universe.
Along the way in the forums here, though, I've seen indications that you
can prep a new Datomic database with historical data by feeding it data
with explicit transactions times, which I gather need to be monotonically
increasing. If that's true, then it suggests that you could take a set of
historical reference data and populate a fresh Datomic database as though
that data came in the same as in the weird universe (t_fact == t_recorded).
On any given morning you could potentially have such a database fully
current as of the prior end-of-day, which is a sweet spot for financial
reference data.
How might you handle corrections? Suppose that each day we snapshot the
official copy of the Datomic database, so that we have complete official
copies for each day. If a correction comes in for three days ago, we take
the snapshot from four days ago and replay into it the transactions for the
last three days, correcting the offending piece of data in the transaction
stream just beforehand. So, we might have a live copy of Datomic, and a
cold-standby that receives corrections and takes over as live once it is
up-to-date.
If this is viable, then it means Datomic could be used as-is right now for
this purpose. The principal compromises are that the data is read-only and
that a correction would take N minutes to be available to end users, with N
probably being something like 15+. There are many situations in which that
would be acceptable.
Thoughts?
By the way, there are times (like when the SEC comes calling) when you want
to know "At 9/3/2013 2:47 PM, what did this portfolio manager think the
short-term rating was for So-And-So MBS on 9/1?", but those times are very
rare. That's more in Datomic's usual sweet spot. Our nasty
start_date/end_date schemas have separate audit schemas populated via
triggers to capture the intraday changes that a record might undergo,
leading to even more duplication. However, under the scheme I outline
above, we could easily say, "At 9/3/2013 2:47 PM, the live Datomic database
was composed of this snapshot and these transaction logs. Let's recompose
it and see what the PM saw."
--
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
You received this message because you are subscribed to the Google Groups "Datomic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datomic+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.