Class PDFMarkedContentExtractor


  • public class PDFMarkedContentExtractor
    extends LegacyPDFStreamEngine
    This is an stream engine to extract the marked content of a pdf.
    • Field Detail

      • suppressDuplicateOverlappingText

        private final boolean suppressDuplicateOverlappingText
        See Also:
        Constant Field Values
      • markedContents

        private final java.util.List<PDMarkedContent> markedContents
      • currentMarkedContents

        private final java.util.Deque<PDMarkedContent> currentMarkedContents
      • characterListMapping

        private final java.util.Map<java.lang.String,​java.util.List<TextPosition>> characterListMapping
    • Constructor Detail

      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor()
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object.
        Throws:
        java.io.IOException
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.lang.String encoding)
                                  throws java.io.IOException
        Constructor. Will apply encoding-specific conversions to the output text.
        Parameters:
        encoding - The encoding that the output will be written in.
        Throws:
        java.io.IOException
    • Method Detail

      • within

        private boolean within​(float first,
                               float second,
                               float variance)
        This will determine of two floating point numbers are within a specified variance.
        Parameters:
        first - The first number to compare to.
        second - The second number to compare to.
        variance - The allowed variance.
      • xobject

        public void xobject​(PDXObject xobject)
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class LegacyPDFStreamEngine
        Parameters:
        text - The text to process.
      • getMarkedContents

        public java.util.List<PDMarkedContent> getMarkedContents()