Package org.apache.pdfbox.pdfparser
Class BaseParser
- java.lang.Object
-
- org.apache.pdfbox.pdfparser.BaseParser
-
- Direct Known Subclasses:
COSParser
,PDFObjectStreamParser
,PDFStreamParser
,PDFXrefStreamParser
public abstract class BaseParser extends java.lang.Object
This class is used to contain parsing logic that will be used by both the PDFParser and the COSStreamParser.
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
A
protected static byte
ASCII_CR
ASCII code for carriage return.protected static byte
ASCII_LF
ASCII code for line feed.private static byte
ASCII_NINE
private static byte
ASCII_SPACE
private static byte
ASCII_ZERO
protected static int
B
protected static int
D
static java.lang.String
DEF
This is a string constant that will be used for comparisons.protected COSDocument
document
This is the document that will be parsed.protected static int
E
protected static java.lang.String
ENDOBJ_STRING
This is a string constant that will be used for comparisons.protected static java.lang.String
ENDSTREAM_STRING
This is a string constant that will be used for comparisons.private static java.lang.String
FALSE
This is a string constant that will be used for comparisons.private static long
GENERATION_NUMBER_THRESHOLD
protected static int
J
private static org.apache.commons.logging.Log
LOG
Log instance.protected static int
M
(package private) static int
MAX_LENGTH_LONG
protected static int
N
private static java.lang.String
NULL
This is a string constant that will be used for comparisons.protected static int
O
private static long
OBJECT_NUMBER_THRESHOLD
protected static int
R
protected static int
S
(package private) SequentialSource
seqSource
This is the stream that will be read from.protected static java.lang.String
STREAM_STRING
This is a string constant that will be used for comparisons.protected static int
T
private static java.lang.String
TRUE
This is a string constant that will be used for comparisons.private java.nio.charset.CharsetDecoder
utf8Decoder
-
Constructor Summary
Constructors Constructor Description BaseParser(SequentialSource pdfSource)
Default constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private int
checkForEndOfString(int bracesParameter)
This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place.private COSBase
getObjectFromPool(COSObjectKey key)
protected boolean
isClosing()
This will tell if the next character is a closing brace( close of PDF array ).protected boolean
isClosing(int c)
This will tell if the next character is a closing brace( close of PDF array ).private boolean
isCR(int c)
protected boolean
isDigit()
This will tell if the next byte is a digit or not.protected static boolean
isDigit(int c)
This will tell if the given value is a digit or not.protected boolean
isEndOfName(int ch)
Determine if a character terminates a PDF name.protected boolean
isEOL()
This will tell if the next byte to be read is an end of line byte.protected boolean
isEOL(int c)
This will tell if the next byte to be read is an end of line byte.private static boolean
isHexDigit(char ch)
private boolean
isLF(int c)
protected boolean
isSpace()
This will tell if the next byte is a space or not.protected boolean
isSpace(int c)
This will tell if the given value is a space or not.private boolean
isValidUTF8(byte[] input)
Returns true if a byte sequence is valid UTF-8.protected boolean
isWhitespace()
This will tell if the next byte is whitespace or not.protected boolean
isWhitespace(int c)
This will tell if a character is whitespace or not.protected COSBoolean
parseBoolean()
This will parse a boolean object from the stream.protected COSArray
parseCOSArray()
This will parse a PDF array object.protected COSDictionary
parseCOSDictionary()
This will parse a PDF dictionary.private void
parseCOSDictionaryNameValuePair(COSDictionary obj)
private COSBase
parseCOSDictionaryValue()
This will parse a PDF dictionary value.private COSString
parseCOSHexString()
This will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found.protected COSName
parseCOSName()
This will parse a PDF name from the stream.private COSNumber
parseCOSNumber()
protected COSString
parseCOSString()
This will parse a PDF string.protected COSBase
parseDirObject()
This will parse a directory object from the stream.protected void
readExpectedChar(char ec)
Read one char and throw an exception if it is not the expected value.protected void
readExpectedString(char[] expectedString, boolean skipSpaces)
Reads given pattern fromseqSource
.protected void
readExpectedString(java.lang.String expectedString)
Read one String and throw an exception if it is not the expected value.protected int
readGenerationNumber()
This will read a integer from the Stream and throw anIllegalArgumentException
if the integer value has more than the maximum object revision (i.e.protected int
readInt()
This will read an integer from the stream.protected java.lang.String
readLine()
This will read bytes until the first end of line marker occurs.protected long
readLong()
This will read an long from the stream.protected long
readObjectNumber()
This will read a long from the Stream and throw anIOException
if the long value is negative or has more than 10 digits (i.e.protected java.lang.String
readString()
This will read the next string from the stream.protected java.lang.String
readString(int length)
This will read the next string from the stream up to a certain length.protected java.lang.StringBuilder
readStringNumber()
This method is used to read a token by the readInt() and the readLong() method.private boolean
readUntilEndOfCOSDictionary()
Keep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.protected void
skipSpaces()
This will skip all spaces and comments that are present.protected void
skipWhiteSpaces()
-
-
-
Field Detail
-
OBJECT_NUMBER_THRESHOLD
private static final long OBJECT_NUMBER_THRESHOLD
- See Also:
- Constant Field Values
-
GENERATION_NUMBER_THRESHOLD
private static final long GENERATION_NUMBER_THRESHOLD
- See Also:
- Constant Field Values
-
MAX_LENGTH_LONG
static final int MAX_LENGTH_LONG
-
utf8Decoder
private final java.nio.charset.CharsetDecoder utf8Decoder
-
LOG
private static final org.apache.commons.logging.Log LOG
Log instance.
-
E
protected static final int E
- See Also:
- Constant Field Values
-
N
protected static final int N
- See Also:
- Constant Field Values
-
D
protected static final int D
- See Also:
- Constant Field Values
-
S
protected static final int S
- See Also:
- Constant Field Values
-
T
protected static final int T
- See Also:
- Constant Field Values
-
R
protected static final int R
- See Also:
- Constant Field Values
-
A
protected static final int A
- See Also:
- Constant Field Values
-
M
protected static final int M
- See Also:
- Constant Field Values
-
O
protected static final int O
- See Also:
- Constant Field Values
-
B
protected static final int B
- See Also:
- Constant Field Values
-
J
protected static final int J
- See Also:
- Constant Field Values
-
DEF
public static final java.lang.String DEF
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
ENDOBJ_STRING
protected static final java.lang.String ENDOBJ_STRING
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
ENDSTREAM_STRING
protected static final java.lang.String ENDSTREAM_STRING
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
STREAM_STRING
protected static final java.lang.String STREAM_STRING
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
TRUE
private static final java.lang.String TRUE
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
FALSE
private static final java.lang.String FALSE
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
NULL
private static final java.lang.String NULL
This is a string constant that will be used for comparisons.- See Also:
- Constant Field Values
-
ASCII_LF
protected static final byte ASCII_LF
ASCII code for line feed.- See Also:
- Constant Field Values
-
ASCII_CR
protected static final byte ASCII_CR
ASCII code for carriage return.- See Also:
- Constant Field Values
-
ASCII_ZERO
private static final byte ASCII_ZERO
- See Also:
- Constant Field Values
-
ASCII_NINE
private static final byte ASCII_NINE
- See Also:
- Constant Field Values
-
ASCII_SPACE
private static final byte ASCII_SPACE
- See Also:
- Constant Field Values
-
seqSource
final SequentialSource seqSource
This is the stream that will be read from.
-
document
protected COSDocument document
This is the document that will be parsed.
-
-
Constructor Detail
-
BaseParser
BaseParser(SequentialSource pdfSource)
Default constructor.
-
-
Method Detail
-
isHexDigit
private static boolean isHexDigit(char ch)
-
parseCOSDictionaryValue
private COSBase parseCOSDictionaryValue() throws java.io.IOException
This will parse a PDF dictionary value.- Returns:
- The parsed Dictionary object.
- Throws:
java.io.IOException
- If there is an error parsing the dictionary object.
-
getObjectFromPool
private COSBase getObjectFromPool(COSObjectKey key) throws java.io.IOException
- Throws:
java.io.IOException
-
parseCOSDictionary
protected COSDictionary parseCOSDictionary() throws java.io.IOException
This will parse a PDF dictionary.- Returns:
- The parsed dictionary, never null.
- Throws:
java.io.IOException
- If there is an error reading the stream.
-
readUntilEndOfCOSDictionary
private boolean readUntilEndOfCOSDictionary() throws java.io.IOException
Keep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.- Returns:
- true if the end of the object or the file has been found, false if not, i.e. that the caller can continue to parse the dictionary at the current position.
- Throws:
java.io.IOException
- if there is a reading error.
-
parseCOSDictionaryNameValuePair
private void parseCOSDictionaryNameValuePair(COSDictionary obj) throws java.io.IOException
- Throws:
java.io.IOException
-
skipWhiteSpaces
protected void skipWhiteSpaces() throws java.io.IOException
- Throws:
java.io.IOException
-
checkForEndOfString
private int checkForEndOfString(int bracesParameter) throws java.io.IOException
This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place. However it missed the case where the number of opening and closing parenthesis isn't balanced The second bug was in this format /Title (c:\) /Producer This patch moves this code out of the parseCOSString method, so it can be used twice.- Parameters:
bracesParameter
- the number of braces currently open.- Returns:
- the corrected value of the brace counter
- Throws:
java.io.IOException
-
parseCOSString
protected COSString parseCOSString() throws java.io.IOException
This will parse a PDF string.- Returns:
- The parsed PDF string.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
parseCOSHexString
private COSString parseCOSHexString() throws java.io.IOException
This will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found. This is necessary in order to detect malformed input and be able to skip to next object start. We assume starting '<' was already read.- Returns:
- The parsed PDF string.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
parseCOSArray
protected COSArray parseCOSArray() throws java.io.IOException
This will parse a PDF array object.- Returns:
- The parsed PDF array.
- Throws:
java.io.IOException
- If there is an error parsing the stream.
-
isEndOfName
protected boolean isEndOfName(int ch)
Determine if a character terminates a PDF name.- Parameters:
ch
- The character- Returns:
- true if the character terminates a PDF name, otherwise false.
-
parseCOSName
protected COSName parseCOSName() throws java.io.IOException
This will parse a PDF name from the stream.- Returns:
- The parsed PDF name.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isValidUTF8
private boolean isValidUTF8(byte[] input)
Returns true if a byte sequence is valid UTF-8.
-
parseBoolean
protected COSBoolean parseBoolean() throws java.io.IOException
This will parse a boolean object from the stream.- Returns:
- The parsed boolean object.
- Throws:
java.io.IOException
- If an IO error occurs during parsing.
-
parseDirObject
protected COSBase parseDirObject() throws java.io.IOException
This will parse a directory object from the stream.- Returns:
- The parsed object.
- Throws:
java.io.IOException
- If there is an error during parsing.
-
parseCOSNumber
private COSNumber parseCOSNumber() throws java.io.IOException
- Throws:
java.io.IOException
-
readString
protected java.lang.String readString() throws java.io.IOException
This will read the next string from the stream.- Returns:
- The string that was read from the stream, never null.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
readExpectedString
protected void readExpectedString(java.lang.String expectedString) throws java.io.IOException
Read one String and throw an exception if it is not the expected value.- Parameters:
expectedString
- the String value that is expected.- Throws:
java.io.IOException
- if the String char is not the expected value or if an I/O error occurs.
-
readExpectedString
protected final void readExpectedString(char[] expectedString, boolean skipSpaces) throws java.io.IOException
Reads given pattern fromseqSource
. Skipping whitespace at start and end if wanted.- Parameters:
expectedString
- pattern to be skippedskipSpaces
- if set to true spaces before and after the string will be skipped- Throws:
java.io.IOException
- if pattern could not be read
-
readExpectedChar
protected void readExpectedChar(char ec) throws java.io.IOException
Read one char and throw an exception if it is not the expected value.- Parameters:
ec
- the char value that is expected.- Throws:
java.io.IOException
- if the read char is not the expected value or if an I/O error occurs.
-
readString
protected java.lang.String readString(int length) throws java.io.IOException
This will read the next string from the stream up to a certain length.- Parameters:
length
- The length to stop reading at.- Returns:
- The string that was read from the stream of length 0 to length.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isClosing
protected boolean isClosing() throws java.io.IOException
This will tell if the next character is a closing brace( close of PDF array ).- Returns:
- true if the next byte is ']', false otherwise.
- Throws:
java.io.IOException
- If an IO error occurs.
-
isClosing
protected boolean isClosing(int c)
This will tell if the next character is a closing brace( close of PDF array ).- Parameters:
c
- The character to check against end of line- Returns:
- true if the next byte is ']', false otherwise.
-
readLine
protected java.lang.String readLine() throws java.io.IOException
This will read bytes until the first end of line marker occurs. NOTE: The EOL marker may consists of 1 (CR or LF) or 2 (CR and CL) bytes which is an important detail if one wants to unread the line.- Returns:
- The characters between the current position and the end of the line.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isEOL
protected boolean isEOL() throws java.io.IOException
This will tell if the next byte to be read is an end of line byte.- Returns:
- true if the next byte is 0x0A or 0x0D.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isEOL
protected boolean isEOL(int c)
This will tell if the next byte to be read is an end of line byte.- Parameters:
c
- The character to check against end of line- Returns:
- true if the next byte is 0x0A or 0x0D.
-
isLF
private boolean isLF(int c)
-
isCR
private boolean isCR(int c)
-
isWhitespace
protected boolean isWhitespace() throws java.io.IOException
This will tell if the next byte is whitespace or not.- Returns:
- true if the next byte in the stream is a whitespace character.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isWhitespace
protected boolean isWhitespace(int c)
This will tell if a character is whitespace or not. These values are specified in table 1 (page 12) of ISO 32000-1:2008.- Parameters:
c
- The character to check against whitespace- Returns:
- true if the character is a whitespace character.
-
isSpace
protected boolean isSpace() throws java.io.IOException
This will tell if the next byte is a space or not.- Returns:
- true if the next byte in the stream is a space character.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isSpace
protected boolean isSpace(int c)
This will tell if the given value is a space or not.- Parameters:
c
- The character to check against space- Returns:
- true if the next byte in the stream is a space character.
-
isDigit
protected boolean isDigit() throws java.io.IOException
This will tell if the next byte is a digit or not.- Returns:
- true if the next byte in the stream is a digit.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
isDigit
protected static boolean isDigit(int c)
This will tell if the given value is a digit or not.- Parameters:
c
- The character to be checked- Returns:
- true if the next byte in the stream is a digit.
-
skipSpaces
protected void skipSpaces() throws java.io.IOException
This will skip all spaces and comments that are present.- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
readObjectNumber
protected long readObjectNumber() throws java.io.IOException
This will read a long from the Stream and throw anIOException
if the long value is negative or has more than 10 digits (i.e. : bigger thanOBJECT_NUMBER_THRESHOLD
)- Returns:
- the object number being read.
- Throws:
java.io.IOException
- if an I/O error occurs
-
readGenerationNumber
protected int readGenerationNumber() throws java.io.IOException
This will read a integer from the Stream and throw anIllegalArgumentException
if the integer value has more than the maximum object revision (i.e. : bigger thanGENERATION_NUMBER_THRESHOLD
)- Returns:
- the generation number being read.
- Throws:
java.io.IOException
- if an I/O error occurs
-
readInt
protected int readInt() throws java.io.IOException
This will read an integer from the stream.- Returns:
- The integer that was read from the stream.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
readLong
protected long readLong() throws java.io.IOException
This will read an long from the stream.- Returns:
- The long that was read from the stream.
- Throws:
java.io.IOException
- If there is an error reading from the stream.
-
readStringNumber
protected final java.lang.StringBuilder readStringNumber() throws java.io.IOException
This method is used to read a token by the readInt() and the readLong() method. Valid delimiters are any non digit values.- Returns:
- the token to parse as integer or long by the calling method.
- Throws:
java.io.IOException
- throws by theseqSource
methods.
-
-