Package org.apache.pdfbox.text
Class TextPosition
- java.lang.Object
-
- org.apache.pdfbox.text.TextPosition
-
public final class TextPosition extends java.lang.Object
This represents a string and a position on the screen of those characters.
-
-
Field Summary
Fields Modifier and Type Field Description private int[]
charCodes
private static java.util.Map<java.lang.Integer,java.lang.String>
DIACRITICS
private float
direction
private float
endX
private float
endY
private PDFont
font
private float
fontSize
private int
fontSizePt
private static org.apache.commons.logging.Log
LOG
private float
maxHeight
private float
pageHeight
private float
pageWidth
private int
rotation
private Matrix
textMatrix
private java.lang.String
unicode
private float
widthOfSpace
private float[]
widths
private float
x
private float
y
-
Constructor Summary
Constructors Constructor Description TextPosition(int pageRotation, float pageWidth, float pageHeight, Matrix textMatrix, float endX, float endY, float maxHeight, float individualWidth, float spaceWidth, java.lang.String unicode, int[] charCodes, PDFont font, float fontSize, int fontSizeInPt)
Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private java.lang.String
combineDiacritic(java.lang.String str)
Combine the diacritic, for example, convert non-combining diacritic characters to their combining counterparts.boolean
contains(TextPosition tp2)
Determine if this TextPosition logically contains another (i.e.private static java.util.Map<java.lang.Integer,java.lang.String>
createDiacritics()
boolean
equals(java.lang.Object o)
int[]
getCharacterCodes()
Return the internal PDF character codes of the glyphs in this text.float
getDir()
Return the direction/orientation of the string in this object based on its text matrix.float
getEndX()
This will get the x coordinate of the end position.float
getEndY()
This will get the y coordinate of the end position.PDFont
getFont()
This will get the font for the text being drawn.float
getFontSize()
This will get the font size that has been set with the "Tf" operator (Set text font and size).float
getFontSizeInPt()
This will get the font size in pt.float
getHeight()
This will get the maximum height of all characters in this string.float
getHeightDir()
This will get the maximum height of all characters in this string.float[]
getIndividualWidths()
Get the widths of each individual character.float
getPageHeight()
This will get the height of the page that the text is located in.float
getPageWidth()
This will get the width of the page that the text is located in.int
getRotation()
This will get the rotation of the page that the text is located in.Matrix
getTextMatrix()
The matrix containing the starting text position and scaling.java.lang.String
getUnicode()
Return the string of characters stored in this object.float
getWidth()
This will get the width of the string when page rotation adjusted coordinates are used.float
getWidthDirAdj()
This will get the width of the string when text direction adjusted coordinates are used.float
getWidthOfSpace()
This will get the width of a space character.private float
getWidthRot(float rotation)
Get the length or width of the text, based on a given rotation.float
getX()
This will get the page rotation adjusted x position of the character.float
getXDirAdj()
This will get the text direction adjusted x position of the character.private float
getXRot(float rotation)
Return the X starting coordinate of the text, adjusted by the given rotation amount.float
getXScale()
This will get the X scaling factor.float
getY()
This will get the page rotation adjusted x position of the character.float
getYDirAdj()
This will get the y position of the text, adjusted so that 0,0 is upper left and it is adjusted based on the text direction.private float
getYLowerLeftRot(float rotation)
This will get the y position of the character with 0,0 in lower left.float
getYScale()
This will get the Y scaling factor.int
hashCode()
private void
insertDiacritic(int i, TextPosition diacritic)
Inserts the diacritic TextPosition to the str of this TextPosition and updates the widths array to include the extra character width.boolean
isDiacritic()
void
mergeDiacritic(TextPosition diacritic)
Merge a single character TextPosition into the current object.java.lang.String
toString()
Show the string data for this text position.
-
-
-
Field Detail
-
LOG
private static final org.apache.commons.logging.Log LOG
-
DIACRITICS
private static final java.util.Map<java.lang.Integer,java.lang.String> DIACRITICS
-
textMatrix
private final Matrix textMatrix
-
endX
private final float endX
-
endY
private final float endY
-
maxHeight
private final float maxHeight
-
rotation
private final int rotation
-
x
private final float x
-
y
private final float y
-
pageHeight
private final float pageHeight
-
pageWidth
private final float pageWidth
-
widthOfSpace
private final float widthOfSpace
-
charCodes
private final int[] charCodes
-
font
private final PDFont font
-
fontSize
private final float fontSize
-
fontSizePt
private final int fontSizePt
-
widths
private float[] widths
-
unicode
private java.lang.String unicode
-
direction
private float direction
-
-
Constructor Detail
-
TextPosition
public TextPosition(int pageRotation, float pageWidth, float pageHeight, Matrix textMatrix, float endX, float endY, float maxHeight, float individualWidth, float spaceWidth, java.lang.String unicode, int[] charCodes, PDFont font, float fontSize, int fontSizeInPt)
Constructor.- Parameters:
pageRotation
- rotation of the page that the text is located inpageWidth
- width of the page that the text is located inpageHeight
- height of the page that the text is located intextMatrix
- text rendering matrix for start of text (in display units)endX
- x coordinate of the end positionendY
- y coordinate of the end positionmaxHeight
- Maximum height of text (in display units)individualWidth
- The width of the given character/string. (in text units)spaceWidth
- The width of the space character. (in display units)unicode
- The string of Unicode characters to be displayed.charCodes
- An array of the internal PDF character codes for the glyphs in this text.font
- The current font for this text position.fontSize
- The new font size.fontSizeInPt
- The font size in pt units (seegetFontSizeInPt()
for details).
-
-
Method Detail
-
createDiacritics
private static java.util.Map<java.lang.Integer,java.lang.String> createDiacritics()
-
getUnicode
public java.lang.String getUnicode()
Return the string of characters stored in this object. The length can be different than the CharacterCodes length e.g. if ligatures are used ("fi", "fl", "ffl") where one glyph represents several unicode characters.- Returns:
- The string on the screen.
-
getCharacterCodes
public int[] getCharacterCodes()
Return the internal PDF character codes of the glyphs in this text.- Returns:
- an array of internal PDF character codes
-
getTextMatrix
public Matrix getTextMatrix()
The matrix containing the starting text position and scaling. Despite the name, it is not the text matrix set by the "Tm" operator, it is really the effective text rendering matrix (which is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator), the font size (set by the "Tf" operator) and the page cropbox).- Returns:
- The Matrix containing the starting text position
-
getDir
public float getDir()
Return the direction/orientation of the string in this object based on its text matrix. Only angles of 0, 90, 180, or 270 are supported. To get other angles, use this code:TextPosition text = ... Matrix m = text.getTextMatrix().clone(); m.concatenate(text.getFont().getFontMatrix()); int angle = (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
- Returns:
- The direction of the text (0, 90, 180, or 270).
-
getXRot
private float getXRot(float rotation)
Return the X starting coordinate of the text, adjusted by the given rotation amount. The rotation adjusts where the 0,0 location is relative to the text.- Parameters:
rotation
- Rotation to apply (0, 90, 180, or 270). 0 will perform no adjustments.- Returns:
- X coordinate
-
getX
public float getX()
This will get the page rotation adjusted x position of the character. This is adjusted based on page rotation so that the upper left is 0,0 which is unlike PDF coordinates, which start at the bottom left. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.- Returns:
- The x coordinate of the character.
-
getXDirAdj
public float getXDirAdj()
This will get the text direction adjusted x position of the character. This is adjusted based on text direction so that the first character in that direction is in the upper left at 0,0. This method ignores the page rotation but takes the text rotation (seegetDir()
) and adjusts the coordinates to awt. This is useful when doing text extraction, to compare the glyph positions when imagining these to be horizontal. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.- Returns:
- The x coordinate of the text.
-
getYLowerLeftRot
private float getYLowerLeftRot(float rotation)
This will get the y position of the character with 0,0 in lower left. This will be adjusted by the given rotation.- Parameters:
rotation
- Rotation to apply to text to adjust the 0,0 location (0,90,180,270)- Returns:
- The y coordinate of the text
-
getY
public float getY()
This will get the page rotation adjusted x position of the character. This is adjusted based on page rotation so that the upper left is 0,0 which is unlike PDF coordinates, which start at the bottom left. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.- Returns:
- The adjusted y coordinate of the character.
-
getYDirAdj
public float getYDirAdj()
This will get the y position of the text, adjusted so that 0,0 is upper left and it is adjusted based on the text direction. This method ignores the page rotation but takes the text rotation and adjusts the coordinates to awt. This is useful when doing text extraction, to compare the glyph positions when imagining these to be horizontal. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.- Returns:
- The adjusted y coordinate of the character.
-
getWidthRot
private float getWidthRot(float rotation)
Get the length or width of the text, based on a given rotation.- Parameters:
rotation
- Rotation that was used to determine coordinates (0,90,180,270)- Returns:
- Width of text in display units
-
getWidth
public float getWidth()
This will get the width of the string when page rotation adjusted coordinates are used.- Returns:
- The width of the text in display units.
-
getWidthDirAdj
public float getWidthDirAdj()
This will get the width of the string when text direction adjusted coordinates are used.- Returns:
- The width of the text in display units.
-
getHeight
public float getHeight()
This will get the maximum height of all characters in this string.- Returns:
- The maximum height of all characters in this string.
-
getHeightDir
public float getHeightDir()
This will get the maximum height of all characters in this string.- Returns:
- The maximum height of all characters in this string.
-
getFontSize
public float getFontSize()
This will get the font size that has been set with the "Tf" operator (Set text font and size). When the text is rendered, it may appear bigger or smaller depending on the current transformation matrix (set by the "cm" operator) and the text matrix (set by the "Tm" operator).- Returns:
- The font size.
-
getFontSizeInPt
public float getFontSizeInPt()
This will get the font size in pt. To get this size we have to multiply the font size fromgetFontSize()
with the text matrix (set by the "Tm" operator) horizontal scaling factor and truncate the result to integer. The actual rendering may appear bigger or smaller depending on the current transformation matrix (set by the "cm" operator). To get the size in rendering, usegetXScale()
.- Returns:
- The font size in pt.
-
getFont
public PDFont getFont()
This will get the font for the text being drawn.- Returns:
- The font size.
-
getWidthOfSpace
public float getWidthOfSpace()
This will get the width of a space character. This is useful for some algorithms such as the text stripper, that need to know the width of a space character.- Returns:
- The width of a space character.
-
getXScale
public float getXScale()
This will get the X scaling factor. This is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator) and the font size (set by the "Tf" operator).- Returns:
- The X scaling factor.
-
getYScale
public float getYScale()
This will get the Y scaling factor. This is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator) and the font size (set by the "Tf" operator).- Returns:
- The Y scaling factor.
-
getIndividualWidths
public float[] getIndividualWidths()
Get the widths of each individual character.- Returns:
- An array that has the same length as the CharacterCodes array.
-
contains
public boolean contains(TextPosition tp2)
Determine if this TextPosition logically contains another (i.e. they overlap and should be rendered on top of each other).- Parameters:
tp2
- The other TestPosition to compare against- Returns:
- True if tp2 is contained in the bounding box of this text.
-
mergeDiacritic
public void mergeDiacritic(TextPosition diacritic)
Merge a single character TextPosition into the current object. This is to be used only for cases where we have a diacritic that overlaps an existing TextPosition. In a graphical display, we could overlay them, but for text extraction we need to merge them. Use the contains() method to test if two objects overlap.- Parameters:
diacritic
- TextPosition to merge into the current TextPosition.
-
insertDiacritic
private void insertDiacritic(int i, TextPosition diacritic)
Inserts the diacritic TextPosition to the str of this TextPosition and updates the widths array to include the extra character width.- Parameters:
i
- current characterdiacritic
- The diacritic TextPosition
-
combineDiacritic
private java.lang.String combineDiacritic(java.lang.String str)
Combine the diacritic, for example, convert non-combining diacritic characters to their combining counterparts.- Parameters:
str
- String to normalize- Returns:
- Normalized string
-
isDiacritic
public boolean isDiacritic()
- Returns:
- True if the current character is a diacritic char.
-
toString
public java.lang.String toString()
Show the string data for this text position.- Overrides:
toString
in classjava.lang.Object
- Returns:
- A human readable form of this object.
-
getEndX
public float getEndX()
This will get the x coordinate of the end position. This is the unadjusted value passed into the constructor.- Returns:
- The unadjusted x coordinate of the end position
-
getEndY
public float getEndY()
This will get the y coordinate of the end position. This is the unadjusted value passed into the constructor.- Returns:
- The unadjusted y coordinate of the end position
-
getRotation
public int getRotation()
This will get the rotation of the page that the text is located in. This is the unadjusted value passed into the constructor.- Returns:
- The unadjusted rotation of the page that the text is located in
-
getPageHeight
public float getPageHeight()
This will get the height of the page that the text is located in. This is the unadjusted value passed into the constructor.- Returns:
- The unadjusted height of the page that the text is located in
-
getPageWidth
public float getPageWidth()
This will get the width of the page that the text is located in. This is the unadjusted value passed into the constructor.- Returns:
- The unadjusted width of the page that the text is located in
-
equals
public boolean equals(java.lang.Object o)
- Overrides:
equals
in classjava.lang.Object
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classjava.lang.Object
-
-