File HWM#
What is File HWM?#
Sometimes it’s necessary to read/download only new files from a source folder.
For example, there is a folder with files:
$ hdfs dfs -ls /path
2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
When new file is being added to this folder:
$ hdfs dfs -ls /path
2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
5Mb 2023-09-09 10:20 /path/my/file345 # new one
To download only new files, if is required to somehow track them, and then filter using the information from a previous run.
This technique is called High WaterMark
or HWM
for short.
It is used by different strategies to implement some complex logic
of filtering source data.
Supported types#
There are a several ways to track HWM value:
Save the entire file list, and then select only files not present in this list (
file_list
)Save max modified time of all files, and then select only files with
modified_time
higher than this valueIf file name contains some incrementing value, e.g. id or datetime, parse it and save max value of all files, then select only files with higher value
and so on
Currently the only HWM type implemented for files is file_list
. Other ones can be implemented on-demand