The problem often is the diversity of problems
and you can really only fix that by looking at them one by one and seeing what the possible fixes are. Some thoughts:
- Random page sizes can be detected automatically, but not necessarily fixed automatically as it depends on how you want to deal with them. Is it fine to look at what's on the page and then scale that to be visible on a letter-sized document for example? Does it need to be cut to size? If you can formulate a strategy, it's likely that it can be implemented in a manual and batch way.
- Things outside of the margins can be deleted - again if you can determine what the margins should be.
- You can't really do much with low-quality scans (well, you can upscale them and the results aren't always going to be much better). You should be able to remove the invisible OCR'd text from the document though.
The bottom line is that a number of these things can be detected in preflight, whether it's Acrobat, pdfToolbox or PitStop and some of them could be fixed automatically if you can come up with a strategy.
What you can't do in Acrobat for example, but you can do in pdfToolbox for example is a strategy where you go from solution to solution in order to solve the problem. pdfToolbox is used by some people who
have to have a "good-enough" file for example to put in an archive. In those cases something like the following can be used:
- See if the file is good with preflight. Good? We're done.
- Re-distill the file (convert to PostScript and back to PDF).
- See if the file is good with preflight. Good? We're done.
- Convert each page in the PDF file to an image and put it back in the PDF.
- See if the file is good with preflight. Good? We're done.
- Not good now? Have to look at it manually.
This kind of thing can be automated / batched so it makes for kind of a fall back scenario that takes care of things until it finds something that works.
But essentially you would need to start looking at individual files with problems and then seeing what the best way is to solve that particular problem. Given that they're legal documents you might not be able to share anything, but if you can I'm certainly willing to take a look and see what is going wrong and how it can be fixed.