Column HWM#
What is Column HWM?#
Sometimes it’s necessary to read only changed rows from a table.
For example, there is a table with the following schema:
CREATE TABLE mydata (
id integer PRIMARY KEY DEFAULT nextval('serial'),
data varchar(40)
);
id
column is an autoincremental field, that’s why its values are increasing all the time
when new rows are added into the table.
For example, yesterday table contained rows:
1000, "some data"
1001, "some data1"
1002, "some data2"
...
1009, "some data9"
1010, "some data10"
Maximum value in id
column is 1010
.
Today it contains some new rows:
...
1009, "some data9"
1010, "some data10"
----- new rows ----
1011, "some data11"
1012, "some data12"
...
1020, "some data20"
So we can use value of id
column to detect rows added since yesterday, e.g.
using query like `SELECT * FROM table WHERE id > 1010
.
The id
column here is called High WaterMark
or HWM
for short.
It is used by different strategies to implement some complex logic
of filtering source data.
Supported types#
HWM column have to be one of the following types:
Integer (of any length, like
INTEGER
,SHORT
,LONG
).Decimal without fractional part (Oracle specific, because
INTEGER
type here isNUMERIC
)Date
Datetime (a.k.a
TIMESTAMP
)
Other column types (like DOUBLE
or VARCHAR
) are not supported.
But it is possible to use some expression as HWM, like CAST(column as TYPE)
or func(column) as hwm
.
Restrictions#
HWM column values cannot decrease over time, they can only increase
HWM column cannot contain
NULL
values because they cannot be tracked properly, and thus will be skipped
Recommendations#
It is highly recommended for HWM column values to be unique, like a table primary key.
Otherwise, new rows with the same column value as stored in HWM will be skipped, because they will not match the WHERE clause.
It is recommended to add an index for HWM column.
Filtering by non-indexed rows requires sequential scan for all rows in a table, with complexity
O(n)
, which can take a lot of time on large tables, especially if RDBMS is not distributed.Filtering by row index has complexity
O(log n)
(for B-tree), which is very effective.Note
Filtering performance depends on the index implementation and internal RDBMS optimization engine.
Note
Use indexes which support
<
and>
operations. Hash-based indexes, which support only=
andIN
operations, cannot be used