Reproducible builds

See Reproducible builds.

Task developers should strive to make their results reproducible. In order to do that, we recommend the following advices.

Delete stale outputs

When a task is being run in an incremental manner, make sure to delete any previous output resources which are no longer part of the result. This can happen if an input file or data was modified/removed and an output file would no longer be generated by the task. In this case the build system will not automatically delete the output file, and the task should manually handle it.

You can use the SakerFile.remove() function to delete a file, and make sure to synchronize the parent to actually persist the file deletion.

While file system directories are usually don't convey any data, it is recommended to delete any stale empty directories.

In general, tasks should only delete files from explicit output locations, or the build directory. It is recommended not to delete any stale output that is not in the build directory of the build execution.

Timestamps

Tasks should not put any time related information in the build output. This means that if the build time, file modification time, or any other time related information is conveyed in the result files, then the build may not be bit by bit reproducible.

While in some cases this might be acceptable, we still recommend not to use timestamps, or use them judiciously.

Zipping

An important use-case of timestamps is when creating zip archives. Zip files may contain last modification time, access time, creationg time related information which can easily cause the zip archive to be different when a build is executed at different times.

When creating zip files, make sure to set the time related attributes of zip entries to a specific time. (Epoch 0, or any user specified static value should be acceptable.)

Predictable ordering

Tasks are strongly recommended to use collections which have a predictable iteration order. Predictable ordering can mitigate accidental reproducability issues that surface when data is serialized in an unpredicable order. This may be the case when serializing streams, archiving zip files, and in other cases.

This means that instead of HashMap and HashSet or related collections, task implementations should strive to use TreeMap, ConcurrentSkipListMap, ConcurrentSkipListSet or other sorted collections. Of course this requires the elements to be comparable, but this is usually achievable.

If elements cannot be made comparable, the LinkedHashMap and LinkedHashSet collections may provide a sufficient implementation.

When executing work in a parallel way and producing an output collected in a common collection, make sure to always have the same order of the elements when iterating over that collection. It often includes sorting the result collection in some way.

Using sorted collections can also significantly improve performance. They can be deserialized more efficiently (See SerialUtils).

Working with sorted collections can also be more efficient when the collections are sorted by path. The SakerPathFiles utility class provides various functions that provide a view to the collections based on a given condition. E.g. finding the children of a directory path can be done much efficiently in a sorted map, as it doesn't need to be iterated over.

IDE configuration

Best practices

Reproducible builds