Class PDFTextStripper
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.LegacyPDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- Direct Known Subclasses:
AngleCollector
,FilteredTextStripper
,PDFText2HTML
,PDFTextStripperByArea
public class PDFTextStripper extends LegacyPDFStreamEngine
This class will take a pdf document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. The basic flow of this process is that we get a document and use a series of processXXX() functions that work on smaller and smaller chunks of the page. Eventually, we fully process each page and then print it.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
PDFTextStripper.LineItem
internal marker class.private static class
PDFTextStripper.PositionWrapper
wrapper of TextPosition that adds flags to track status as linestart and paragraph start positions.private static class
PDFTextStripper.WordWithTextPositions
Internal class that maps strings to lists ofTextPosition
arrays.
-
Field Summary
Fields Modifier and Type Field Description private boolean
addMoreFormatting
private java.lang.String
articleEnd
private java.lang.String
articleStart
private float
averageCharTolerance
private java.util.List<PDRectangle>
beadRectangles
private java.util.Map<java.lang.String,java.util.TreeMap<java.lang.Float,java.util.TreeSet<java.lang.Float>>>
characterListMapping
protected java.util.ArrayList<java.util.List<TextPosition>>
charactersByArticle
The charactersByArticle is used to extract text by article divisions.private int
currentPageNo
private static float
defaultDropThreshold
private static float
defaultIndentThreshold
protected PDDocument
document
private float
dropThreshold
private static float
END_OF_LAST_TEXT_X_RESET_VALUE
private PDOutlineItem
endBookmark
private int
endBookmarkPageNumber
private int
endPage
private static float
EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
private float
indentThreshold
private boolean
inParagraph
True if we started a paragraph but haven't ended it yet.private static float
LAST_WORD_SPACING_RESET_VALUE
protected java.lang.String
LINE_SEPARATOR
The platform's line separator.private java.lang.String
lineSeparator
private static java.lang.String[]
LIST_ITEM_EXPRESSIONS
a list of regular expressions that match commonly used list item formats, i.e.private java.util.List<java.util.regex.Pattern>
listOfPatterns
private static org.apache.commons.logging.Log
LOG
private static float
MAX_HEIGHT_FOR_LINE_RESET_VALUE
private static float
MAX_Y_FOR_LINE_RESET_VALUE
private static float
MIN_Y_TOP_FOR_LINE_RESET_VALUE
private static java.util.Map<java.lang.Character,java.lang.Character>
MIRRORING_CHAR_MAP
protected java.io.Writer
output
private java.lang.String
pageEnd
private java.lang.String
pageStart
private java.lang.String
paragraphEnd
private java.lang.String
paragraphStart
private boolean
shouldSeparateByBeads
private boolean
sortByPosition
private float
spacingTolerance
private PDOutlineItem
startBookmark
private int
startBookmarkPageNumber
private int
startPage
private boolean
suppressDuplicateOverlappingText
private static boolean
useCustomQuickSort
private java.lang.String
wordSeparator
-
Constructor Summary
Constructors Constructor Description PDFTextStripper()
Instantiate a new PDFTextStripper object.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private PDFTextStripper.WordWithTextPositions
createWord(java.lang.String word, java.util.List<TextPosition> wordPositions)
Used withinnormalize(List)
to create a singlePDFTextStripper.WordWithTextPositions
entry.protected void
endArticle()
End an article.protected void
endDocument(PDDocument document)
This method is available for subclasses of this class.protected void
endPage(PDPage page)
End a page.private void
fillBeadRectangles(PDPage page)
boolean
getAddMoreFormatting()
This will tell if the text stripper should add some more text formatting.java.lang.String
getArticleEnd()
Returns the string which will be used at the end of an article.java.lang.String
getArticleStart()
Returns the string which will be used at the beginning of an article.float
getAverageCharTolerance()
Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added.protected java.util.List<java.util.List<TextPosition>>
getCharactersByArticle()
Character strings are grouped by articles.protected int
getCurrentPageNo()
Get the current page number that is being processed.float
getDropThreshold()
the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.PDOutlineItem
getEndBookmark()
Get the bookmark where text extraction should end, inclusive.int
getEndPage()
This will get the last page that will be extracted.float
getIndentThreshold()
returns the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.java.lang.String
getLineSeparator()
This will get the line separator.protected java.util.List<java.util.regex.Pattern>
getListItemPatterns()
returns a list of regular expression Patterns representing different common list item formats.protected java.io.Writer
getOutput()
The output stream that is being written to.java.lang.String
getPageEnd()
Returns the string which will be used at the end of a page.java.lang.String
getPageStart()
Returns the string which will be used at the beginning of a page.java.lang.String
getParagraphEnd()
Returns the string which will be used at the end of a paragraph.java.lang.String
getParagraphStart()
Returns the string which will be used at the beginning of a paragraph.boolean
getSeparateByBeads()
This will tell if the text stripper should separate by beads.boolean
getSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream.float
getSpacingTolerance()
Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added.PDOutlineItem
getStartBookmark()
Get the bookmark where text extraction should start, inclusive.int
getStartPage()
This is the page that the text extraction will start on.boolean
getSuppressDuplicateOverlappingText()
java.lang.String
getText(PDDocument doc)
This will return the text of a document.java.lang.String
getWordSeparator()
This will get the word separator.private java.lang.String
handleDirection(java.lang.String word)
Handles the LTR and RTL direction of the given words.private PDFTextStripper.PositionWrapper
handleLineSeparation(PDFTextStripper.PositionWrapper current, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine)
handles the line separator for a new line given the specified current and previous TextPositions.private void
isParagraphSeparation(PDFTextStripper.PositionWrapper position, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine)
tests the relationship between the last text position, the current text position and the last text position that followed a line separator to decide if the gap represents a paragraph separation.private java.util.regex.Pattern
matchListItemPattern(PDFTextStripper.PositionWrapper pw)
returns the list item Pattern object that matches the text at the specified PositionWrapper or null if the text does not match such a pattern.protected static java.util.regex.Pattern
matchPattern(java.lang.String string, java.util.List<java.util.regex.Pattern> patterns)
iterates over the specified list of Patterns until it finds one that matches the specified string.private float
multiplyFloat(float value1, float value2)
private java.util.List<PDFTextStripper.WordWithTextPositions>
normalize(java.util.List<PDFTextStripper.LineItem> line)
Normalize the given list of TextPositions.private java.lang.StringBuilder
normalizeAdd(java.util.List<PDFTextStripper.WordWithTextPositions> normalized, java.lang.StringBuilder lineBuilder, java.util.List<TextPosition> wordPositions, PDFTextStripper.LineItem item)
Used withinnormalize(List)
to handle aTextPosition
.private java.lang.String
normalizeWord(java.lang.String word)
Normalize certain Unicode characters.private boolean
overlap(float y1, float height1, float y2, float height2)
private static void
parseBidiFile(java.io.InputStream inputStream)
This method parses the bidi file provided as inputstream.void
processPage(PDPage page)
This will process the contents of a page.protected void
processPages(PDPageTree pages)
This will process all of the pages and the text that is in them.protected void
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.private void
resetEngine()
void
setAddMoreFormatting(boolean newAddMoreFormatting)
There will some additional text formatting be added if addMoreFormatting is set to true.void
setArticleEnd(java.lang.String articleEndValue)
Sets the string which will be used at the end of an article.void
setArticleStart(java.lang.String articleStartValue)
Sets the string which will be used at the beginning of an article.void
setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used to estimate where spaces in text should be added.void
setDropThreshold(float dropThresholdValue)
sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.void
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop.void
setEndPage(int endPageValue)
This will set the last page to be extracted by this class.void
setIndentThreshold(float indentThresholdValue)
sets the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.void
setLineSeparator(java.lang.String separator)
Set the desired line separator for output text.protected void
setListItemPatterns(java.util.List<java.util.regex.Pattern> patterns)
use to supply a different set of regular expression patterns for matching list item starts.void
setPageEnd(java.lang.String pageEndValue)
Sets the string which will be used at the end of a page.void
setPageStart(java.lang.String pageStartValue)
Sets the string which will be used at the beginning of a page.void
setParagraphEnd(java.lang.String s)
Sets the string which will be used at the end of a paragraph.void
setParagraphStart(java.lang.String s)
Sets the string which will be used at the beginning of a paragraph.void
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads.void
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.void
setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used to estimate where spaces in text should be added.void
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive.void
setStartPage(int startPageValue)
This will set the first page to be extracted by this class.void
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other.void
setWordSeparator(java.lang.String separator)
Set the desired word separator for output text.protected void
startArticle()
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).protected void
startArticle(boolean isLTR)
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).protected void
startDocument(PDDocument document)
This method is available for subclasses of this class.protected void
startPage(PDPage page)
Start a new page.private boolean
within(float first, float second, float variance)
This will determine of two floating point numbers are within a specified variance.protected void
writeCharacters(TextPosition text)
Write the string in TextPosition to the output stream.private void
writeLine(java.util.List<PDFTextStripper.WordWithTextPositions> line)
Write a list of string containing a whole line of a document.protected void
writeLineSeparator()
Write the line separator value to the output stream.protected void
writePage()
This will print the text of the processed page to "output".protected void
writePageEnd()
Write something (if defined) at the end of a page.protected void
writePageStart()
Write something (if defined) at the start of a page.protected void
writeParagraphEnd()
Write something (if defined) at the end of a paragraph.protected void
writeParagraphSeparator()
writes the paragraph separator string to the output.protected void
writeParagraphStart()
Write something (if defined) at the start of a paragraph.protected void
writeString(java.lang.String text)
Write a Java string to the output stream.protected void
writeString(java.lang.String text, java.util.List<TextPosition> textPositions)
Write a Java string to the output stream.void
writeText(PDDocument doc, java.io.Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer.protected void
writeWordSeparator()
Write the word separator value to the output stream.-
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, showGlyph
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
defaultIndentThreshold
private static float defaultIndentThreshold
-
defaultDropThreshold
private static float defaultDropThreshold
-
useCustomQuickSort
private static final boolean useCustomQuickSort
-
LOG
private static final org.apache.commons.logging.Log LOG
-
LINE_SEPARATOR
protected final java.lang.String LINE_SEPARATOR
The platform's line separator.
-
lineSeparator
private java.lang.String lineSeparator
-
wordSeparator
private java.lang.String wordSeparator
-
paragraphStart
private java.lang.String paragraphStart
-
paragraphEnd
private java.lang.String paragraphEnd
-
pageStart
private java.lang.String pageStart
-
pageEnd
private java.lang.String pageEnd
-
articleStart
private java.lang.String articleStart
-
articleEnd
private java.lang.String articleEnd
-
currentPageNo
private int currentPageNo
-
startPage
private int startPage
-
endPage
private int endPage
-
startBookmark
private PDOutlineItem startBookmark
-
startBookmarkPageNumber
private int startBookmarkPageNumber
-
endBookmarkPageNumber
private int endBookmarkPageNumber
-
endBookmark
private PDOutlineItem endBookmark
-
suppressDuplicateOverlappingText
private boolean suppressDuplicateOverlappingText
-
shouldSeparateByBeads
private boolean shouldSeparateByBeads
-
sortByPosition
private boolean sortByPosition
-
addMoreFormatting
private boolean addMoreFormatting
-
indentThreshold
private float indentThreshold
-
dropThreshold
private float dropThreshold
-
spacingTolerance
private float spacingTolerance
-
averageCharTolerance
private float averageCharTolerance
-
beadRectangles
private java.util.List<PDRectangle> beadRectangles
-
charactersByArticle
protected java.util.ArrayList<java.util.List<TextPosition>> charactersByArticle
The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
-
characterListMapping
private java.util.Map<java.lang.String,java.util.TreeMap<java.lang.Float,java.util.TreeSet<java.lang.Float>>> characterListMapping
-
document
protected PDDocument document
-
output
protected java.io.Writer output
-
inParagraph
private boolean inParagraph
True if we started a paragraph but haven't ended it yet.
-
END_OF_LAST_TEXT_X_RESET_VALUE
private static final float END_OF_LAST_TEXT_X_RESET_VALUE
- See Also:
- Constant Field Values
-
MAX_Y_FOR_LINE_RESET_VALUE
private static final float MAX_Y_FOR_LINE_RESET_VALUE
- See Also:
- Constant Field Values
-
EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
private static final float EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
- See Also:
- Constant Field Values
-
MAX_HEIGHT_FOR_LINE_RESET_VALUE
private static final float MAX_HEIGHT_FOR_LINE_RESET_VALUE
- See Also:
- Constant Field Values
-
MIN_Y_TOP_FOR_LINE_RESET_VALUE
private static final float MIN_Y_TOP_FOR_LINE_RESET_VALUE
- See Also:
- Constant Field Values
-
LAST_WORD_SPACING_RESET_VALUE
private static final float LAST_WORD_SPACING_RESET_VALUE
- See Also:
- Constant Field Values
-
LIST_ITEM_EXPRESSIONS
private static final java.lang.String[] LIST_ITEM_EXPRESSIONS
a list of regular expressions that match commonly used list item formats, i.e. bullets, numbers, letters, Roman numerals, etc. Not meant to be comprehensive.
-
listOfPatterns
private java.util.List<java.util.regex.Pattern> listOfPatterns
-
MIRRORING_CHAR_MAP
private static java.util.Map<java.lang.Character,java.lang.Character> MIRRORING_CHAR_MAP
-
-
Method Detail
-
getText
public java.lang.String getText(PDDocument doc) throws java.io.IOException
This will return the text of a document. See writeText.
NOTE: The document must not be encrypted when coming into this method.IMPORTANT: By default, text extraction is done in the same sequence as the text in the PDF page content stream. PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that text one on page be rendered in a certain order. The order is the one that was determined by the software that created the PDF. To get text sorted from left to right and top to botton, use
setSortByPosition(boolean)
.- Parameters:
doc
- The document to get the text from.- Returns:
- The text of the PDF document.
- Throws:
java.io.IOException
- if the doc state is invalid or it is encrypted.
-
resetEngine
private void resetEngine()
-
writeText
public void writeText(PDDocument doc, java.io.Writer outputStream) throws java.io.IOException
This will take a PDDocument and write the text of that document to the print writer.- Parameters:
doc
- The document to get the data from.outputStream
- The location to put the text.- Throws:
java.io.IOException
- If the doc is in an invalid state.
-
processPages
protected void processPages(PDPageTree pages) throws java.io.IOException
This will process all of the pages and the text that is in them.- Parameters:
pages
- The pages object in the document.- Throws:
java.io.IOException
- If there is an error parsing the text.
-
startDocument
protected void startDocument(PDDocument document) throws java.io.IOException
This method is available for subclasses of this class. It will be called before processing of the document start.- Parameters:
document
- The PDF document that is being processed.- Throws:
java.io.IOException
- If an IO error occurs.
-
endDocument
protected void endDocument(PDDocument document) throws java.io.IOException
This method is available for subclasses of this class. It will be called after processing of the document finishes.- Parameters:
document
- The PDF document that is being processed.- Throws:
java.io.IOException
- If an IO error occurs.
-
processPage
public void processPage(PDPage page) throws java.io.IOException
This will process the contents of a page.- Overrides:
processPage
in classLegacyPDFStreamEngine
- Parameters:
page
- The page to process.- Throws:
java.io.IOException
- If there is an error processing the page.
-
fillBeadRectangles
private void fillBeadRectangles(PDPage page)
-
startArticle
protected void startArticle() throws java.io.IOException
Start a new article, which is typically defined as a column on a single page (also referred to as a bead). This assumes that the primary direction of text is left to right. Default implementation is to do nothing. Subclasses may provide additional information.- Throws:
java.io.IOException
- If there is any error writing to the stream.
-
startArticle
protected void startArticle(boolean isLTR) throws java.io.IOException
Start a new article, which is typically defined as a column on a single page (also referred to as a bead). Default implementation is to do nothing. Subclasses may provide additional information.- Parameters:
isLTR
- true if primary direction of text is left to right.- Throws:
java.io.IOException
- If there is any error writing to the stream.
-
endArticle
protected void endArticle() throws java.io.IOException
End an article. Default implementation is to do nothing. Subclasses may provide additional information.- Throws:
java.io.IOException
- If there is any error writing to the stream.
-
startPage
protected void startPage(PDPage page) throws java.io.IOException
Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.- Parameters:
page
- The page we are about to process.- Throws:
java.io.IOException
- If there is any error writing to the stream.
-
endPage
protected void endPage(PDPage page) throws java.io.IOException
End a page. Default implementation is to do nothing. Subclasses may provide additional information.- Parameters:
page
- The page we are about to process.- Throws:
java.io.IOException
- If there is any error writing to the stream.
-
writePage
protected void writePage() throws java.io.IOException
This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.- Throws:
java.io.IOException
- If there is an error writing the text.
-
overlap
private boolean overlap(float y1, float height1, float y2, float height2)
-
writeLineSeparator
protected void writeLineSeparator() throws java.io.IOException
Write the line separator value to the output stream.- Throws:
java.io.IOException
- If there is a problem writing out the line separator to the document.
-
writeWordSeparator
protected void writeWordSeparator() throws java.io.IOException
Write the word separator value to the output stream.- Throws:
java.io.IOException
- If there is a problem writing out the word separator to the document.
-
writeCharacters
protected void writeCharacters(TextPosition text) throws java.io.IOException
Write the string in TextPosition to the output stream.- Parameters:
text
- The text to write to the stream.- Throws:
java.io.IOException
- If there is an error when writing the text.
-
writeString
protected void writeString(java.lang.String text, java.util.List<TextPosition> textPositions) throws java.io.IOException
Write a Java string to the output stream. The default implementation will ignore thetextPositions
and just callswriteString(String)
.- Parameters:
text
- The text to write to the stream.textPositions
- The TextPositions belonging to the text.- Throws:
java.io.IOException
- If there is an error when writing the text.
-
writeString
protected void writeString(java.lang.String text) throws java.io.IOException
Write a Java string to the output stream.- Parameters:
text
- The text to write to the stream.- Throws:
java.io.IOException
- If there is an error when writing the text.
-
within
private boolean within(float first, float second, float variance)
This will determine of two floating point numbers are within a specified variance.- Parameters:
first
- The first number to compare to.second
- The second number to compare to.variance
- The allowed variance.
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classLegacyPDFStreamEngine
- Parameters:
text
- The text to process.
-
getStartPage
public int getStartPage()
This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.- Returns:
- Value of property startPage.
-
setStartPage
public void setStartPage(int startPageValue)
This will set the first page to be extracted by this class.- Parameters:
startPageValue
- New value of 1-based startPage property.
-
getEndPage
public int getEndPage()
This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.- Returns:
- Value of property endPage.
-
setEndPage
public void setEndPage(int endPageValue)
This will set the last page to be extracted by this class.- Parameters:
endPageValue
- New value of 1-based endPage property.
-
setLineSeparator
public void setLineSeparator(java.lang.String separator)
Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.- Parameters:
separator
- The desired line separator string.
-
getLineSeparator
public java.lang.String getLineSeparator()
This will get the line separator.- Returns:
- The desired line separator string.
-
getWordSeparator
public java.lang.String getWordSeparator()
This will get the word separator.- Returns:
- The desired word separator string.
-
setWordSeparator
public void setWordSeparator(java.lang.String separator)
Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.- Parameters:
separator
- The desired page separator string.
-
getSuppressDuplicateOverlappingText
public boolean getSuppressDuplicateOverlappingText()
- Returns:
- Returns the suppressDuplicateOverlappingText.
-
getCurrentPageNo
protected int getCurrentPageNo()
Get the current page number that is being processed.- Returns:
- A 1 based number representing the current page.
-
getOutput
protected java.io.Writer getOutput()
The output stream that is being written to.- Returns:
- The stream that output is being written to.
-
getCharactersByArticle
protected java.util.List<java.util.List<TextPosition>> getCharactersByArticle()
Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.- Returns:
- A double List of TextPositions for all text strings on the page.
-
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.- Parameters:
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.
-
getSeparateByBeads
public boolean getSeparateByBeads()
This will tell if the text stripper should separate by beads.- Returns:
- If the text will be grouped by beads.
-
setShouldSeparateByBeads
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads. The default value is true!- Parameters:
aShouldSeparateByBeads
- The new grouping of beads.
-
getEndBookmark
public PDOutlineItem getEndBookmark()
Get the bookmark where text extraction should end, inclusive. Default is null.- Returns:
- The ending bookmark.
-
setEndBookmark
public void setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop.- Parameters:
aEndBookmark
- The ending bookmark.
-
getStartBookmark
public PDOutlineItem getStartBookmark()
Get the bookmark where text extraction should start, inclusive. Default is null.- Returns:
- The starting bookmark.
-
setStartBookmark
public void setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive.- Parameters:
aStartBookmark
- The starting bookmark.
-
getAddMoreFormatting
public boolean getAddMoreFormatting()
This will tell if the text stripper should add some more text formatting.- Returns:
- true if some more text formatting will be added
-
setAddMoreFormatting
public void setAddMoreFormatting(boolean newAddMoreFormatting)
There will some additional text formatting be added if addMoreFormatting is set to true. Default is false.- Parameters:
newAddMoreFormatting
- Tell PDFBox to add some more text formatting
-
getSortByPosition
public boolean getSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream.- Returns:
- true If the text tokens will be sorted before being written.
-
setSortByPosition
public void setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
The default is to not sort by position.
A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.- Parameters:
newSortByPosition
- Tell PDFBox to sort the text positions.
-
getSpacingTolerance
public float getSpacingTolerance()
Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.- Returns:
- The current tolerance / scaling factor
-
setSpacingTolerance
public void setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.- Parameters:
spacingToleranceValue
- tolerance / scaling factor to use
-
getAverageCharTolerance
public float getAverageCharTolerance()
Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.- Returns:
- The current tolerance / scaling factor
-
setAverageCharTolerance
public void setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.- Parameters:
averageCharToleranceValue
- average tolerance / scaling factor to use
-
getIndentThreshold
public float getIndentThreshold()
returns the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.- Returns:
- the number of whitespace character widths to use when detecting paragraph indents.
-
setIndentThreshold
public void setIndentThreshold(float indentThresholdValue)
sets the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start. The default value is 2.0.- Parameters:
indentThresholdValue
- the number of whitespace character widths to use when detecting paragraph indents.
-
getDropThreshold
public float getDropThreshold()
the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.- Returns:
- the character height multiple for max allowed whitespace between lines in the same paragraph.
-
setDropThreshold
public void setDropThreshold(float dropThresholdValue)
sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start. The default value is 2.5.- Parameters:
dropThresholdValue
- the character height multiple for max allowed whitespace between lines in the same paragraph.
-
getParagraphStart
public java.lang.String getParagraphStart()
Returns the string which will be used at the beginning of a paragraph.- Returns:
- the paragraph start string
-
setParagraphStart
public void setParagraphStart(java.lang.String s)
Sets the string which will be used at the beginning of a paragraph.- Parameters:
s
- the paragraph start string
-
getParagraphEnd
public java.lang.String getParagraphEnd()
Returns the string which will be used at the end of a paragraph.- Returns:
- the paragraph end string
-
setParagraphEnd
public void setParagraphEnd(java.lang.String s)
Sets the string which will be used at the end of a paragraph.- Parameters:
s
- the paragraph end string
-
getPageStart
public java.lang.String getPageStart()
Returns the string which will be used at the beginning of a page.- Returns:
- the page start string
-
setPageStart
public void setPageStart(java.lang.String pageStartValue)
Sets the string which will be used at the beginning of a page.- Parameters:
pageStartValue
- the page start string
-
getPageEnd
public java.lang.String getPageEnd()
Returns the string which will be used at the end of a page.- Returns:
- the page end string
-
setPageEnd
public void setPageEnd(java.lang.String pageEndValue)
Sets the string which will be used at the end of a page.- Parameters:
pageEndValue
- the page end string
-
getArticleStart
public java.lang.String getArticleStart()
Returns the string which will be used at the beginning of an article.- Returns:
- the article start string
-
setArticleStart
public void setArticleStart(java.lang.String articleStartValue)
Sets the string which will be used at the beginning of an article.- Parameters:
articleStartValue
- the article start string
-
getArticleEnd
public java.lang.String getArticleEnd()
Returns the string which will be used at the end of an article.- Returns:
- the article end string
-
setArticleEnd
public void setArticleEnd(java.lang.String articleEndValue)
Sets the string which will be used at the end of an article.- Parameters:
articleEndValue
- the article end string
-
handleLineSeparation
private PDFTextStripper.PositionWrapper handleLineSeparation(PDFTextStripper.PositionWrapper current, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine) throws java.io.IOException
handles the line separator for a new line given the specified current and previous TextPositions.- Parameters:
current
- the current text positionlastPosition
- the previous text positionlastLineStartPosition
- the last text position that followed a line separator.maxHeightForLine
- max height for positions since lastLineStartPosition- Returns:
- start position of the last line
- Throws:
java.io.IOException
- if something went wrong
-
isParagraphSeparation
private void isParagraphSeparation(PDFTextStripper.PositionWrapper position, PDFTextStripper.PositionWrapper lastPosition, PDFTextStripper.PositionWrapper lastLineStartPosition, float maxHeightForLine)
tests the relationship between the last text position, the current text position and the last text position that followed a line separator to decide if the gap represents a paragraph separation. This should only be called for consecutive text positions that first pass the line separation test.This base implementation tests to see if the lastLineStartPosition is null OR if the current vertical position has dropped below the last text vertical position by at least 2.5 times the current text height OR if the current horizontal position is indented by at least 2 times the current width of a space character.
This also attempts to identify text that is indented under a hanging indent.
This method sets the isParagraphStart and isHangingIndent flags on the current position object.
- Parameters:
position
- the current text position. This may have its isParagraphStart or isHangingIndent flags set upon return.lastPosition
- the previous text position (should not be null).lastLineStartPosition
- the last text position that followed a line separator, or null.maxHeightForLine
- max height for text positions since lasLineStartPosition.
-
multiplyFloat
private float multiplyFloat(float value1, float value2)
-
writeParagraphSeparator
protected void writeParagraphSeparator() throws java.io.IOException
writes the paragraph separator string to the output.- Throws:
java.io.IOException
- if something went wrong
-
writeParagraphStart
protected void writeParagraphStart() throws java.io.IOException
Write something (if defined) at the start of a paragraph.- Throws:
java.io.IOException
- if something went wrong
-
writeParagraphEnd
protected void writeParagraphEnd() throws java.io.IOException
Write something (if defined) at the end of a paragraph.- Throws:
java.io.IOException
- if something went wrong
-
writePageStart
protected void writePageStart() throws java.io.IOException
Write something (if defined) at the start of a page.- Throws:
java.io.IOException
- if something went wrong
-
writePageEnd
protected void writePageEnd() throws java.io.IOException
Write something (if defined) at the end of a page.- Throws:
java.io.IOException
- if something went wrong
-
matchListItemPattern
private java.util.regex.Pattern matchListItemPattern(PDFTextStripper.PositionWrapper pw)
returns the list item Pattern object that matches the text at the specified PositionWrapper or null if the text does not match such a pattern. The list of Patterns tested against is given by thegetListItemPatterns()
method. To add to the list, simply override that method (if sub-classing) or explicitly supply your own list usingsetListItemPatterns(List)
.- Parameters:
pw
- position- Returns:
- the matching pattern
-
setListItemPatterns
protected void setListItemPatterns(java.util.List<java.util.regex.Pattern> patterns)
use to supply a different set of regular expression patterns for matching list item starts.- Parameters:
patterns
- list of patterns
-
getListItemPatterns
protected java.util.List<java.util.regex.Pattern> getListItemPatterns()
returns a list of regular expression Patterns representing different common list item formats. For example numbered items of form:- some text
- more text
- some text
- more text
This method returns a list of such regular expression Patterns.
- Returns:
- a list of Pattern objects.
-
matchPattern
protected static java.util.regex.Pattern matchPattern(java.lang.String string, java.util.List<java.util.regex.Pattern> patterns)
iterates over the specified list of Patterns until it finds one that matches the specified string. Then returns the Pattern.Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.
- Parameters:
string
- the string to be searchedpatterns
- list of patterns- Returns:
- matching pattern
-
writeLine
private void writeLine(java.util.List<PDFTextStripper.WordWithTextPositions> line) throws java.io.IOException
Write a list of string containing a whole line of a document.- Parameters:
line
- a list with the words of the given line- Throws:
java.io.IOException
- if something went wrong
-
normalize
private java.util.List<PDFTextStripper.WordWithTextPositions> normalize(java.util.List<PDFTextStripper.LineItem> line)
Normalize the given list of TextPositions.- Parameters:
line
- list of TextPositions- Returns:
- a list of strings, one string for every word
-
handleDirection
private java.lang.String handleDirection(java.lang.String word)
Handles the LTR and RTL direction of the given words. The whole implementation stands and falls with the given word. If the word is a full line, the results will be the best. If the word contains of single words or characters, the order of the characters in a word or words in a line may wrong, due to RTL and LTR marks and characters! Based on http://www.nesterovsky-bros.com/weblog/2013/07/28/VisualToLogicalConversionInJava.aspx- Parameters:
word
- The word that shall be processed- Returns:
- new word with the correct direction of the containing characters
-
parseBidiFile
private static void parseBidiFile(java.io.InputStream inputStream) throws java.io.IOException
This method parses the bidi file provided as inputstream.- Parameters:
inputStream
- - The bidi file as inputstream- Throws:
java.io.IOException
- if any line could not be read by the LineNumberReader
-
createWord
private PDFTextStripper.WordWithTextPositions createWord(java.lang.String word, java.util.List<TextPosition> wordPositions)
Used withinnormalize(List)
to create a singlePDFTextStripper.WordWithTextPositions
entry.
-
normalizeWord
private java.lang.String normalizeWord(java.lang.String word)
Normalize certain Unicode characters. For example, convert the single "fi" ligature to "f" and "i". Also normalises Arabic and Hebrew presentation forms.- Parameters:
word
- Word to normalize- Returns:
- Normalized word
-
normalizeAdd
private java.lang.StringBuilder normalizeAdd(java.util.List<PDFTextStripper.WordWithTextPositions> normalized, java.lang.StringBuilder lineBuilder, java.util.List<TextPosition> wordPositions, PDFTextStripper.LineItem item)
Used withinnormalize(List)
to handle aTextPosition
.- Returns:
- The StringBuilder that must be used when calling this method.
-
-