Standard Finishing
4Over

Announcement

Collapse
No announcement yet.

How to locate italics in PDF files?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to locate italics in PDF files?

    Anyone know of a way to locate all italics in a PDF? We have a large number of legacy files we're converting to digital by OCR, save as .rtf, place into Indesign and format. Works well but all formatting has to re-applied to prep them for importing into our system and the most time consuming part of that is having to manually scan all the pages for any italics and apply that back to the text. It's scattered throughout so you have to look closely and that takes a lot of time.

    Or any other ideas? Possible to locate it in the .rtf files somehow but those aren't the originals, the originals are PDFs in this case. For info sake, these are old Pagemaker files, most of which we were able to export to .rtf and convert pretty straight forward, but we have some that would not work to do that on because of font issues, we're working with many languages, Russian being this particular one.

    Thanks for any input

  • #2
    There is no simple way of locating (i.e., searching for) italicized text in a PDF file. Unlike an editable document (such as a Word document), PDF doesn't have attributes such as italic, bold, etc. associated with text. For that matter, unless you have a tagged PDF file, there isn't even any information about the document's logical structure in the PDF either!

    Conceivably, one could write software that would scan the PDF file and look for text formatted in fonts that are known to be italic faces (and/or might have an italic attribute in the font definition's header or the word “italic” is in the font's name - neither of which are guaranteed). This would be a very non-trivial task! I know of no such existing application (at least yet )

    - Dov

    Comment


    • #3
      Enfocus PitStop Pro to the rescue, with it’s support for finding fonts using a regular expression:



      One possible valid regex that attempts to account for variables such as leading/trailing text and upper/lowercase (how robust this expression is will depend on the font):


      .*(O|o)blique.*

      .*(I|i)talic.*



      regex.png




      found.png







      Stephen Marsh
      Last edited by Stephen Marsh; 03-18-2017, 08:25 PM.
      Comments are personal and my views may not be shared by my employer or partners.

      Comment


      • #4
        Ok, thanks. I did find a work around of sorts, if I export to Word from Acrobat, I can search for italic format and locate where it is that way. Even though fonts aren't right in our case it still locates italics for me without me having to scan every line visually for it.

        Comment

        UltimateDuploSmartsoft (Presswise)4OverStandard FinishingKBA
        UltimateDuplo4OverStandard FinishingKBAKBA

        What's Going On

        Collapse

        There are currently 4756 users online. 101 members and 4655 guests.

        Most users ever online was 5,872 at 11:39 AM on 09-30-2016.

        Working...
        X