-- Scheduler - Polling for files
Author Nigel Rivett
Scheduler and Polling for files
A lot of systems are built around files being available from a source system.
Note that this applies equally to the availability of data outside files e.g. due to a process completing.
Often this is implemented by liaising with the source system to agree a time for daily files.
The process is then kicked off to process that data at some time after this.
The process will complete all processing that involves that data and there may be other time scheduled processes following
This causes a number of problems
The source file can be late and so will not be processed. Hopefully this will be presented as an error.
Dependent processes may run not realising the data is incomplete.
If the data is urgent then all processes must be run manually.
If the file is not processed one day it may be overwritten the next and incremental data lost.
For more frequent processes the file may be locked for processing when the next file upload is attempted.
A lot of processing is included in a single control batch.
For an error all the processing is repeated and the whole batch needs to be coded to be repeatable – possibly incorporating out of order files..
Most of these issues can be avoided by separating out the process control.
Instead of a scheduled process to process a file we control, via a independent scheduler which just schedules tasks for a batch type.
One of these tasks is to detect a file and create a batch for that file.
Further processing is controlled by the scheduler for that batch and batch type.
Processing for a batch type is halted on failure and so files are always processed in order.
Alerting is implemented in the scheduler – this means that the called tasks just need to report an error, the alerting is central and easily changed.
Generic tasks (most likely a few) are built to handle the file polling.
These are parameterised with the folder location and file mask.
These then poll the source folder for the presence of a file.
If the file has a unique name (preferably including a timestamp) then a batch is created for that file.
If the file does not include a timestamp then it should be copied or renamed to include a timestamp.
I recommend to suffix yyyymmdd_hhmmss to the name, this is the filename which is used to create a batch.
I would suggest that polling is made at most at 5 minute intervals, usually more frequently.
This gives a good indication of when the file is available independent of when it is processed.
To create a batch an entry is added to a table. The structure of this table is something like
Create table FileProcess
FileProcess_id int identity primary key ,
Entity varchar(100) not null ,
Filename varchar(1000) null ,
Status varchar(100) not null ,
Data varchar(max) null ,
z_inserted datetime not null default getdate() ,
z_updated datetime not null default getdate()
Now this file can be processed by a separate process at any time.
Often this involves a time window but the actual task that carries out the process does not need any knowledge of this.
Note that we have added the file with an entity attribute rather than a batch type.
This is to allow several batch types to be triggered by the same file source.
Often the entity will correspond to a batch type and be given the same name.
There are a number of different situations to consider with processing source files.
Especially with FTP a file is not complete when it is detected and unlocked.
It is better for the delivery system to either rename the file after delivery or to creae a control file when the source file is available.
Those that are included here are
Processing file when detected
Processing file after a dely period
Processing file when a control file is detected
An SP polls the source folder with a filemask and for each file detected calls an SP s_Fileprocess for insert
When complete it again calls the sp S_FileProces for a file list.
This then returns all filenames that need to be processes in order.
In this way the logic for processing the files is in s_FileProcess and can be reused by source type.
This method is also useful if the file detection is via SSIS packages as it moves control out of the package.