Batch jobs are like unsung heroes. They run in middle of the night when no one (almost no one except that one support guy) is watching them. They take all the load (system resources) and do all hard work but people remember them only when they fail (oooops too cheesy..lets come to the point).While writing any batch job there are certain aspects that any system analyst need to take care of. Following are few aspects that I found quite important. Note that this in addition to have proper document about the job written.
- Transactions: First and foremost always plan your transactions properly. When you are loading records in million you need to make a choice do you want to commit for every record or you want to insert all records and single commit. If you are using hibernate then set its property for batch operations correctly. If it is JDBC use addBatch and executeBatch properly.
- Logging: Logging is always required. Ensure that all important steps are generating enough information in the log file. Ensure that proper logging level (trace, debug, info, warn or error) is used while logging the information. If there are Exception condition ensure that the message logged provides sufficient information instead of generic information like "Error occurred while getting data from server."
- Log file for each run: Ensured that the job creates a separate log file for each run. In a way this help to compare the logs generated and at a quick glance any abnormalities can be noticed.
- Externalising configurations: Changes are bound to happen so try to externalise anything that can vary over the period of time e.g. IP Address/Hostname/ID/password of database connection, URL to be hit for retrieving some information etc. Since jobs written once runs for years last thing one would expect to recompile the code just to change configuration.
- Multiple entry points for the job: If your code contains some information that may needs to be regenerated or certain piece of code needs to be run stand alone then ensure that existing code has some entry point for this and this entry point is documented properly for an example Main Job, Part of the job to generate encrypted password and part of the job just to check if all configurations are correct or not.
- Flexibility to rerun the job: There are chances that when the job is running there will be some exceptional conditions and certain part of the job is not executed e.g. data is loaded from legacy system in rational DB but reports that needs to be generated from DB are not generates since file system ran out of space. In similar scenario you do not want to run the entire job again and you may want to run only a part of the job. But at the same time as an architect you may also want to consider a case where some tables are loaded partially and remaining data was not loaded so how best the business case will be? Do you want to delete all data that was loaded in failed job or you want to start from last point of failure? This is one of the most important aspect for any batch job.
- Flexibility to run the job for particular period: There will be a time in a life time of a batch job when job did not run for certain period of time (say it failed for couple of days and no one was able to fix) and hence there will be a loss of data for this period. Considering this scenario always ensure that your job has flexibility to run for specific start date to end date. No would like to run a job for a particular date for number of times.
- Housekeeping of log files and Data: If your job is generating logs for each run half battle is won but when your job is running for months and years it keep creating logs and it will keep eating space on your NAS or on your servers. You need to clean it. So ensure that you have proper mechanism to archive your logs. At least move them in a zip file every month end job. Same holds true for data loaded in tables (or tables generated in DB). Once you don't need them move them or archive them.
- Status update of the job: In most of the cases there will be a job depending on status returned of the other job. If not when you are designing your system but highly possible some time at later stage of the project this needs to be done so always ensure that your job returns (notifies) the status of the job run. Few options are generating zero byte success or failure file, sending MQ message, updating some flag in DB and so on.
- Flag when it fail
- Where the Data comes from and where it goes: Person who is designing this job need to understand end to end flow e.g. Who is providing this data? In what all formats? What will happen if data sent once is lost, can we get the same message/file again? If so, how? Who consumes data this job has loaded? What is sanity check at downstream for data this job has stored?