Easiest way to fix broken/corrupt/weird PDFs?

akalaray · Jun 29, 2020

bcr said:
Hi folks,

We receive lots of 'office' documents which we have to print, either through Command Workstation or a KM controller/driver.

We frequently get weird PDF's, stuff like: scanned documents with irregular page sizes, comment boxes sitting outside the page dimensions, weird fonts in foreign languages. This often makes the fiery or the printer driver throw a wobbly and refuse to process them.

For our new printers i am looking for a simple solution to 'fix' these PDF's so that they will print and i'm looking for suggestions. Ricoh suggested Pitstop Server, but it's quite expensive for something that will probably only be used once every few weeks or even months. It should also be easy to use. In the past I've sometimes been able to re-print as PDF's, or print on a basic office printer and scan them back in. Quality is not an issue - as long as the documents are legible.

Try PitStop Pro - Single seat, much more affordable

Magnus59 · Jun 30, 2020

Get Pitstop Pro, if your documents are coming from Office, use the built in Preflight profile "Just make my Office documents work" This will fix 95% of your Office doc problems

jwheeler · Jun 30, 2020

bcr said:
Hi folks,

We receive lots of 'office' documents which we have to print, either through Command Workstation or a KM controller/driver.

We frequently get weird PDF's, stuff like: scanned documents with irregular page sizes, comment boxes sitting outside the page dimensions, weird fonts in foreign languages. This often makes the fiery or the printer driver throw a wobbly and refuse to process them.

For our new printers i am looking for a simple solution to 'fix' these PDF's so that they will print and i'm looking for suggestions. Ricoh suggested Pitstop Server, but it's quite expensive for something that will probably only be used once every few weeks or even months. It should also be easy to use. In the past I've sometimes been able to re-print as PDF's, or print on a basic office printer and scan them back in. Quality is not an issue - as long as the documents are legible.

We get these from our customers too...and we have KM printers. I have 2 quick fixes that work for 99% of the jobs and they don't require PitStop or any other 3rd party software, just Acrobat Pro. Go to the Advanced menu and select "Preflight". Assuming you're running digitally, select the "Digital Printing" category, then "Digital BW" or "Digital Color". Then select "Analyze and Fix". This usually does the trick. If it still doesn't work, then in the Adobe Acrobat Print dialogue box, click on "Advanced" (bottom left), then check the "Print As Image" box at the top. For small text, I usually change it to 600dpi. It will take a bit longer to RIP this way, but it's essentially the same as rasterizing/flattening in photoshop.

tngcas · Jun 30, 2020

jwheeler said:
We get these from our customers too...and we have KM printers. I have 2 quick fixes that work for 99% of the jobs and they don't require PitStop or any other 3rd party software, just Acrobat Pro. Go to the Advanced menu and select "Preflight". Assuming you're running digitally, select the "Digital Printing" category, then "Digital BW" or "Digital Color". Then select "Analyze and Fix". This usually does the trick. If it still doesn't work, then in the Adobe Acrobat Print dialogue box, click on "Advanced" (bottom left), then check the "Print As Image" box at the top. For small text, I usually change it to 600dpi. It will take a bit longer to RIP this way, but it's essentially the same as rasterizing/flattening in photoshop.

That's helpful. Thanks!

bcr · Jul 1, 2020

jwheeler said:
We get these from our customers too...and we have KM printers. I have 2 quick fixes that work for 99% of the jobs and they don't require PitStop or any other 3rd party software, just Acrobat Pro. Go to the Advanced menu and select "Preflight". Assuming you're running digitally, select the "Digital Printing" category, then "Digital BW" or "Digital Color". Then select "Analyze and Fix". This usually does the trick. If it still doesn't work, then in the Adobe Acrobat Print dialogue box, click on "Advanced" (bottom left), then check the "Print As Image" box at the top. For small text, I usually change it to 600dpi. It will take a bit longer to RIP this way, but it's essentially the same as rasterizing/flattening in photoshop.

thanks!

michaelejahn · Jul 2, 2020

or, you can educate your customer. If they want a reliable printed result, they need to make a reliable PDF.

Save as PDF/X

bcr · Jul 2, 2020

michaelejahn said:
or, you can educate your customer. If they want a reliable printed result, they need to make a reliable PDF.

Save as PDF/X

not everyone works in a commercial print environment. i mentioned above we are printing evidence in lawsuits. the evidence comes how it comes.

Shawn · Jul 6, 2020

For our customers with legal documents we always find it helpful to remind them about how a missing comma recently affected a nearby $5 million dollar legal case. Oxford Comma Dispute Is Settled as Maine Drivers Get $5 Million

Then we suggest PDF-A as the industry standard format for a legal document to avoid any missing font issues.

That usually does the trick. Lawyers are funny about language that way...

EEM · Jul 7, 2020

Puch said:
Callas pdfToolbox Desktop can be a solution. It was around 500 EUR/USD last time I've checked.

I can endorse Puch's reply.

I have also been an Affinity user since the beginning. Affinity products do ask for fonts used in PDFs that most of the time open correctly in Acrobat/Reader and Illustrator without any warnings. Callas pdfToolbox Desktop is a good solution for you.

michaelejahn · Jul 9, 2020

bcr said:
not everyone works in a commercial print environment. i mentioned above we are printing evidence in lawsuits. the evidence comes how it comes.

You want reliable PDF that print, welp, you might need to open then in an app that can save them out a PDF/X. I am quite familiar with Scanned documents and horrible word documents- gotta make 'em reliable.

PitStop is one way to accomplish that.

anthonyminchinton@yahoo.c · Jul 12, 2020

Either in Acrobat or Command Workstation there is an option to print as image (a check box).
Does this have the desired effect on output?

Joe · Jul 12, 2020

[email protected] said:
Either in Acrobat or Command Workstation there is an option to print as image (a check box).
Does this have the desired effect on output?

If the desired effect is lower quality then yes it will have that effect.

banksie_fourpees · Sep 10, 2020

@Puch your absolutely correct more information about this can be found here:: Requirements for conversions to PDF. I've attached screenshots where it can be found. For a free 30day trial of pdfToolbox Dekstop or Server please click this link:: callas | pdfToolbox | Desktop

bcr · Mar 22, 2021

jwheeler said:
We get these from our customers too...and we have KM printers. I have 2 quick fixes that work for 99% of the jobs and they don't require PitStop or any other 3rd party software, just Acrobat Pro. Go to the Advanced menu and select "Preflight". Assuming you're running digitally, select the "Digital Printing" category, then "Digital BW" or "Digital Color". Then select "Analyze and Fix". This usually does the trick. If it still doesn't work, then in the Adobe Acrobat Print dialogue box, click on "Advanced" (bottom left), then check the "Print As Image" box at the top. For small text, I usually change it to 600dpi. It will take a bit longer to RIP this way, but it's essentially the same as rasterizing/flattening in photoshop.

Hi Jwheeler,

Thanks for this. It has worked with some files, but many not.

What seems to be working now is that I am using Print Conductor (a batch printing application) to re-print them as images, but then using the Microsoft Print to PDF Driver instead of the Adobe Driver.

The Adobe driver was throwing up distiller errors such as "%%[ Error: ioerror; OffendingCommand: imageDistiller ]%%"

But so far, the basic MS driver seems to be doing the trick. Frustrating part is that I can't get it to automatically name the files when saving - i have to manually type in the filename each time it moves to the next file. Kinda defeats the object of doing it in a batch.

I've tried loads of different preflight tricks in Acrobat, and also experimented a bit with the trial of Pitstop Pro, but honestly there are so many errors thrown up by these documents, that I don't know where to begin really in selecting the right preflight profiles. I also can't afford to mess around too much with things that might introduce errors into the documents by converting fonts etc, as these documents are legal evidence and must be printed with as little modification as possible.

The main issues seem to be around:

- random page sizes (scanned files)
- things outside of the margins of the page (auto headers etc inserted by document management systems)
- OCR errors - the files are low quality scans and OCR has been applied before we get them, there is no actual text there though, just garbage and it confuses the hell out of Adobe.

bcr · Mar 22, 2021

other issue being that it takes a LONG time for the files to re-print using the MS driver, even though I am using a beast of a PC to do it.

Purple Penguin · Mar 22, 2021

bcr said:
Hi Jwheeler,

The main issues seem to be around:

- random page sizes (scanned files)
- things outside of the margins of the page (auto headers etc inserted by document management systems)
- OCR errors - the files are low quality scans and OCR has been applied before we get them, there is no actual text there though, just garbage and it confuses the hell out of Adobe.

The problem often is the diversity of problems

and you can really only fix that by looking at them one by one and seeing what the possible fixes are. Some thoughts:

Random page sizes can be detected automatically, but not necessarily fixed automatically as it depends on how you want to deal with them. Is it fine to look at what's on the page and then scale that to be visible on a letter-sized document for example? Does it need to be cut to size? If you can formulate a strategy, it's likely that it can be implemented in a manual and batch way.
Things outside of the margins can be deleted - again if you can determine what the margins should be.
You can't really do much with low-quality scans (well, you can upscale them and the results aren't always going to be much better). You should be able to remove the invisible OCR'd text from the document though.

The bottom line is that a number of these things can be detected in preflight, whether it's Acrobat, pdfToolbox or PitStop and some of them could be fixed automatically if you can come up with a strategy.
What you can't do in Acrobat for example, but you can do in pdfToolbox for example is a strategy where you go from solution to solution in order to solve the problem. pdfToolbox is used by some people who have to have a "good-enough" file for example to put in an archive. In those cases something like the following can be used:

See if the file is good with preflight. Good? We're done.
Re-distill the file (convert to PostScript and back to PDF).
See if the file is good with preflight. Good? We're done.
Convert each page in the PDF file to an image and put it back in the PDF.
See if the file is good with preflight. Good? We're done.
Not good now? Have to look at it manually.

This kind of thing can be automated / batched so it makes for kind of a fall back scenario that takes care of things until it finds something that works.

But essentially you would need to start looking at individual files with problems and then seeing what the best way is to solve that particular problem. Given that they're legal documents you might not be able to share anything, but if you can I'm certainly willing to take a look and see what is going wrong and how it can be fixed.

bcr · Mar 25, 2021

Purple Penguin said:
The problem often is the diversity of problems and you can really only fix that by looking at them one by one and seeing what the possible fixes are. Some thoughts:

Random page sizes can be detected automatically, but not necessarily fixed automatically as it depends on how you want to deal with them. Is it fine to look at what's on the page and then scale that to be visible on a letter-sized document for example? Does it need to be cut to size? If you can formulate a strategy, it's likely that it can be implemented in a manual and batch way.

Things outside of the margins can be deleted - again if you can determine what the margins should be.

You can't really do much with low-quality scans (well, you can upscale them and the results aren't always going to be much better). You should be able to remove the invisible OCR'd text from the document though.

The bottom line is that a number of these things can be detected in preflight, whether it's Acrobat, pdfToolbox or PitStop and some of them could be fixed automatically if you can come up with a strategy.
What you can't do in Acrobat for example, but you can do in pdfToolbox for example is a strategy where you go from solution to solution in order to solve the problem. pdfToolbox is used by some people who have to have a "good-enough" file for example to put in an archive. In those cases something like the following can be used:

See if the file is good with preflight. Good? We're done.

Re-distill the file (convert to PostScript and back to PDF).

See if the file is good with preflight. Good? We're done.

Convert each page in the PDF file to an image and put it back in the PDF.

See if the file is good with preflight. Good? We're done.

Not good now? Have to look at it manually.

This kind of thing can be automated / batched so it makes for kind of a fall back scenario that takes care of things until it finds something that works.

But essentially you would need to start looking at individual files with problems and then seeing what the best way is to solve that particular problem. Given that they're legal documents you might not be able to share anything, but if you can I'm certainly willing to take a look and see what is going wrong and how it can be fixed.

Thanks for this comprehensive response!

We use Ricoh Total Flow Prep and dump pdfs straight into it.

Most of what we print are binders of legal evidence with tabs. So we import the pdfs into Total Flow, and then scale and auto rotate to either A4 or A5, add some extra binding margin, and then add chapter headers and consecutive page numbering.

Most of the time this works fine. When we get problematic files it generally either goes wrong when processing in CWS or mid-print.

So I then try to work backwards to figure out the problems... Usually when we get one bad file from a case, we get multiple with similar problems.

The page size issue is only when we get a scanned document with freakishly huge dimensions unexpectedly.

Biggest issue I think is corrupt OCR.

I've done some further testing this week on the problem files and have found:

- almost all problem docs will print as image successfully using the Microsoft Print to PDF driver out of Print Conductor
- it seems they they also convert ok using the MS driver but without printing as images - but I have to try printing the files first to see if they will actually process
- some of the docs will print as image using the Adobe PDF print driver but not all.

- i today tried "remove hidden information" in Acrobat on one of the problem files to try and remove the problematic OCR. It seemed to process ok afterwards but I guess I'm worried about potentially losing some of the actual text of the document.

-print as image is the 'safest' way of fixing the documents I guess from my point of view - I cannot afford to accidentally alter the text or content of the documents as it could have legal ramifications. But print as image makes the resulting hundreds or thousands of pages extremely slow to process in TF Prep and CWS.

Any further suggestions?

Purple Penguin · Mar 26, 2021

bcr said:
- i today tried "remove hidden information" in Acrobat on one of the problem files to try and remove the problematic OCR. It seemed to process ok afterwards but I guess I'm worried about potentially losing some of the actual text of the document.

-print as image is the 'safest' way of fixing the documents I guess from my point of view - I cannot afford to accidentally alter the text or content of the documents as it could have legal ramifications. But print as image makes the resulting hundreds or thousands of pages extremely slow to process in TF Prep and CWS.

Any further suggestions?

Two things I think.

First of all, there are probably safer ways than "print as image", Acrobat Preflight and callas pdfToolbox give you a built-in fixup to convert certain or all pages in a PDF to an image, while it remains a PDF file ("Convert page content into image").

The second tip would be that both Acrobat Preflight and callas pdfToolbox have a preflight check called "Result file different from original (visual comparison)". If you add that into a preflight profile, the fixes you have in this profile will be done, but the tool will also compare the "before" and "after" file and will give you an error if there are visual differences. So you can use this put an additional safeguard into a profile and this might allow you to remove the invisible OCR text for example while making sure you're not losing any of the visible elements from the document.

bcr · Mar 26, 2021

Purple Penguin said:
Two things I think.

First of all, there are probably safer ways than "print as image", Acrobat Preflight and callas pdfToolbox give you a built-in fixup to convert certain or all pages in a PDF to an image, while it remains a PDF file ("Convert page content into image").

The second tip would be that both Acrobat Preflight and callas pdfToolbox have a preflight check called "Result file different from original (visual comparison)". If you add that into a preflight profile, the fixes you have in this profile will be done, but the tool will also compare the "before" and "after" file and will give you an error if there are visual differences. So you can use this put an additional safeguard into a profile and this might allow you to remove the invisible OCR text for example while making sure you're not losing any of the visible elements from the document.

thanks so much! this is all VERY helpful, I'll go and do some testing!

Easiest way to fix broken/corrupt/weird PDFs?

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Member

Well-known member

Registered Users

Well-known member

Member

Attachments

Well-known member

Well-known member

Member

Well-known member

Member

Well-known member

Similar threads

InSoft Automation