File HWM¶
What is File HWM?¶
Sometimes it’s necessary to read/download only new files from a source folder.
For example, there is a folder with files:
$ hdfs dfs -ls /path
2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
When new file is being added to this folder:
$ hdfs dfs -ls /path
2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
5Mb 2023-09-09 10:20 /path/my/file345 # new one
To download only new files, if is required to somehow track them, and then filter using the information from a previous run.
This technique is called High WaterMark or HWM for short.
It is used by different strategies to implement some complex logic
of filtering source data.
Supported types¶
There are a several ways to track HWM value:
Save list o file paths, and then select only files not present in this list -
FileListHWMSave max modified time of all files, and then select only files with modified time (
file.stat().st_mtime) -FileModifiedTimeHWMhigher than this valueIf file name contains some incrementing value, e.g. id or datetime, parse it and save max value of all files, then select only files with higher value - not implemented for now.