Class PDFMarkedContentExtractor


public class PDFMarkedContentExtractor extends LegacyPDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
  • Field Details

    • suppressDuplicateOverlappingText

      private boolean suppressDuplicateOverlappingText
    • markedContents

      private final List<PDMarkedContent> markedContents
    • currentMarkedContents

      private final Deque<PDMarkedContent> currentMarkedContents
    • characterListMapping

      private final Map<String,List<TextPosition>> characterListMapping
  • Constructor Details

    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor() throws IOException
      Instantiate a new PDFTextStripper object.
      Throws:
      IOException
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(String encoding) throws IOException
      Constructor. Will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException
  • Method Details

    • isSuppressDuplicateOverlappingText

      public boolean isSuppressDuplicateOverlappingText()
      Returns:
      the suppressDuplicateOverlappingText setting.
    • setSuppressDuplicateOverlappingText

      public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
      By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
      Parameters:
      suppressDuplicateOverlappingText - The suppressDuplicateOverlappingText setting to set.
    • within

      private boolean within(float first, float second, float variance)
      This will determine of two floating point numbers are within a specified variance.
      Parameters:
      first - The first number to compare to.
      second - The second number to compare to.
      variance - The allowed variance.
    • beginMarkedContentSequence

      public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
      Description copied from class: PDFStreamEngine
      Called when a marked content group begins
      Overrides:
      beginMarkedContentSequence in class PDFStreamEngine
      Parameters:
      tag - indicates the role or significance of the sequence
      properties - optional properties
    • endMarkedContentSequence

      public void endMarkedContentSequence()
      Description copied from class: PDFStreamEngine
      Called when a marked content group ends
      Overrides:
      endMarkedContentSequence in class PDFStreamEngine
    • xobject

      public void xobject(PDXObject xobject)
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Overrides:
      processTextPosition in class LegacyPDFStreamEngine
      Parameters:
      text - The text to process.
    • getMarkedContents

      public List<PDMarkedContent> getMarkedContents()