File HWM

What is File HWM?

Sometimes it’s necessary to read/download only new files from a source folder.

For example, there is a folder with files:

$ hdfs dfs -ls /path

2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234

When new file is being added to this folder:

$ hdfs dfs -ls /path

2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
5Mb 2023-09-09 10:20 /path/my/file345  # new one

To download only new files, if is required to somehow track them, and then filter using the information from a previous run.

This technique is called High WaterMark or HWM for short. It is used by different strategies to implement some complex logic of filtering source data.

Supported types

There are a several ways to track HWM value:

  • Save list o file paths, and then select only files not present in this list - FileListHWM

  • Save max modified time of all files, and then select only files with modified time (file.stat().st_mtime) - FileModifiedTimeHWM higher than this value

  • If file name contains some incrementing value, e.g. id or datetime, parse it and save max value of all files, then select only files with higher value - not implemented for now.