This is a follow-on to the Datum database schema redesign post, where I outlined a redesign of the main datum database schema used by SolarNetwork.
Another big part of what SolarNetwork does is take all the raw data posted by SolarNode devices and calculate aggregate summaries of common time periods (hours, days, and months) and store them so they don’t need to be calculated every time someone wants to view their data at one of those aggregate levels. This enables SolarNetwork to respond very quickly to queries like “show me the average power per day generated by my solar panels over the past month” or “show me the energy used per hour by my house over the past week”.
What that means is SolarNetwork continuously computes and stores hourly, daily, and monthly aggregate records for every datum stream which contain averages, minimum and maximum values, and other statistical data for each aggregate level.
For example, an hourly node datum computed from a power meter datum stream might have properties like this:
Property | Value |
---|---|
ts_start |
2021-01-01 12:00:00+13 |
node_id |
123 |
source_id |
/meter/1 |
watts |
2400 |
watts_min |
350 |
watts_max |
13100 |
watts_count |
60 |
wattHours |
2411 |
wattHours_start |
340983098 |
wattHours_end |
340985509 |
status |
OK |
Until now, SolarNetwork stored this data as a set of JSON objects, based on a simple classification of instantaneous vs accumulating vs status property types. Along with columns for the timestamp, object ID, and source ID of the datum, the database table had JSON columns for each set of properties, and several JSON columns for statistical metadata, sort of like this:
Column | Datatype | Value |
---|---|---|
ts_start |
timestamp | 2021-01-01 12:00:00+13 |
node_id |
bigint | 123 |
source_id |
text | /meter/1 |
jdata_i |
JSON | {"watts":2400,"watts_min":350,"watts_max":13100} |
jdata_a |
JSON | {"wattHours":2411} |
jdata_s |
JSON | {"status":OK} |
jmeta |
JSON | {"i":{"watts":{"min":350,"max":13100,"count":60}}} |
jdata_as |
JSON | {"wattHours":340983098} |
jdata_af |
JSON | {"wattHours":340985509} |
jdata_ad |
JOSN | {"wattHours":2411} |
The jdata_i
column stored the average of the instantaneous properties. The jmeta
stored
the minimum/maximum values seen for those properties, and the count of properties contributing
to the average.
The jdata_a
column stored the difference of the accumulation properties. The jdata_as
,
jdata_af
, and jdata_ad
stored the starting, final, and difference of the accumulation
properties, using the reading aggregation
logic (that is slightly different from
the logic used to compute the jdata_a
column).
The jdata_s
column stored the last seen status property values.
Although quite flexible (and having served SolarNetwork well for many years), the database design had some notable drawbacks:
_min
and _max
values are
duplicated in the jdata_i
and jmeta
columns. Not only that, since they are part of the
jdata_i
JSON document, every instantaneous property has effectively 3 property names added.
This duplication has the same adverse effects as detailed in the last point.local_date
column was used to be able
to show the “local date” of the associated data, without having to join to another table to
do so. Storing it here increases the amount of storage required to hold the data.To overcome these design problems, we’ve developed a new database design that preserves the general flexibility of how datum aggregates are stored but solves the design issues. Building on the datum stream concept introduced previously a new table design was devised with the following attributes:
A diagram makes this easier to take in, showing just the hourly aggregate tables:
The new agg_datm_hourly
table has a primary key defined by a stream ID and timestamp. The datum
property values are stored as numeric[]
and text[]
array columns. The instantaneous property
statisitc values are stored as the stat_i numeric[]
array column. The accumulating property
reading values are stored as the read_a numeric[]
array column.
Using the same example datum as shown earlier, in the new database design there would be one row in
the da_datm_meta
“stream metadata” table like this:
Column | Datatype | Value |
---|---|---|
stream_id |
UUID | a4840264-2b5b-4673-83f7-66a404d6faf0 |
node_id |
bigint | 123 |
source_id |
text | /meter/1 |
time_zone |
text | Pacific/Auckland |
names_i |
text[] | [watts] |
names_a |
text[] | [wattHours] |
names_s |
text[] | [status] |
Then there would be one row in the agg_datm_hourly
table like this:
Column | Datatype | Value |
---|---|---|
stream_id |
UUID | a4840264-2b5b-4673-83f7-66a404d6faf0 |
ts_start |
timestamp | 2021-01-01 12:00:00+13 |
data_i |
numeric[] | [2400] |
data_a |
numeric[] | [2411] |
data_s |
text[] | [OK] |
stat_i |
numeric[] | [[60,350,13100]] |
read_a |
numeric[] | [[2411,340983098,340985509]] |
The new design solves the first problem of renaming source IDs quite nicely. Now a source ID is
just a bit of metadata for a datum stream: updating a single row in the da_datm_meta
table is all
it takes to update the source ID for an entire stream.
The new design solves the second problem of duplicate property names by turning the property names
into metadata on a datum stream as well. As more and more datum are collected for a given stream,
there will still be only one row in the da_datm_meta
table. SolarNetwork is also flexible
enough that property names can be added over time to a given stream, and NULL
(empty) values
are allowed (and stored efficiently) if a datum doesn’t always contain values for all properties.
The third problem of duplicate statistics is solved by simply not storing the statistics twice. Additionally the property names are not stored with the statistics, so they are not duplicated again.
The forth problem of the duplicate date column is solved by dropping the local date column
and relying on joining to the da_datm_meta.time_zone
stream metadata column to translate the
timestamp values into node-local dates.
The fifth problem of the “last seen” status property values is solved by calculating the “most seen” value instead.
Overall, this change significantly reduced the size of the datum aggregate tables in SolarNetwork, and has resulted in notably faster query times.
Our charitable aim is to protect the environment through supporting energy education and using technology to help people understand energy conservation and renewable energy.
Take control of your energy usage and costs with our powerful platform.