Hunting down and exterminating uncompressed TIFFs

It seems that some of our constituients have not been paying overly much attention to the settings on their scanners. We have over 40Gb of black-and-white, text-only documents scanned at 24 BPP, uncompressed, consuming 10 Mb each!

This happened once before. My colleague Warren licensed a product called “2TIFF” to shrink the files in question. This works well, except in his case ALL of the images in an Application folder needed to be compressed. I only need to shrink SOME of them.

After much fooling around and wasting of time, I was able to use a win32 port of the UNIX “find” command to hunt down all of the large files, dump the list to a file, and then use this file as a source for 2TIFF. The big mess of images now occupies only about 30 Mb of space.

Here are the sommand syntax details:
> find.exe “I:OBJECTSPURCHASE_ORDERS” -size +3M -fprint bigfiles.txt
(searches the PURCHASE_ORDERS document tree for all files larger then 3 Mb, dumps results to the text file “bigfiles.txt)

> FOR /f %F in (bigfiles.txt) DO ( “C:Program Files2TIFF2tiff” s=%F d=%~dFshrink%~pF -namegen=”[name].[srcext]” -quantize8 -ct4 -cd4 -keepexif)
(Perform a loop operation. For each loop, set the next line in bigfiles.txt to the variable %F. Run the 2Tiff program using %F as the source file. Use shrink as the output directory (example: when %F=”c:objectsprocurement1163.bin, the output directory will be “c:shrinkProcurement1″).)

Here is what the 2Tiff arguments mean:
-namegen=”[name].[srcext]” -> The name of the destination file is the same as that of the source ([name] is a built in variable equal to the source file name. [srcext] equals the source file’s extention)
-quantize=8 -> sets the “quantization” level of the TIFF. This value effects the “sampling rate” and affects image quality. Eight is the maximum value, for best quality.
-ct4 -> Compression type “LZW” is used. This is the default type for color scans. We are using LZW rather than the standard “type 3” for B/W documents because tests showed that reducing these images to monochrome yielded very low quality in some cases. We are keeping some color information to allow anti-aliasing and thus better letter quality.
-cd4 -> Sets the color depth down to 4 BPP from the source 24 BPP. CD1 would be better, but as mentioned above, this results in poor readability of the destination TIFF.
-keepexif -> preserves EXIF tags in the destination file from the source. Probably there is no EXIF info in these files, but I thought we would keep it in case I am wrong.

Warren had used the “dither” switch, but IMNSHO this makes the target document look worse and also results in larger files.

Advertisements