File HWM#

What is File HWM?#

Sometimes it’s necessary to read/download only new files from a source folder.

For example, there is a folder with files:

$ hdfs dfs -ls /path

2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234

When new file is being added to this folder:

$ hdfs dfs -ls /path

2MB 2023-09-09 10:13 /path/my/file123
4Mb 2023-09-09 10:15 /path/my/file234
5Mb 2023-09-09 10:20 /path/my/file345  # new one

To download only new files, if is required to somehow track them, and then filter using the information from a previous run.

This technique is called High WaterMark or HWM for short. It is used by different strategies to implement some complex logic of filtering source data.

Supported types#

There are a several ways to track HWM value:

  • Save the entire file list, and then select only files not present in this list (file_list)

  • Save max modified time of all files, and then select only files with modified_time higher than this value

  • If file name contains some incrementing value, e.g. id or datetime, parse it and save max value of all files, then select only files with higher value

  • and so on

Currently the only HWM type implemented for files is file_list. Other ones can be implemented on-demand