Parser implementation for PDF documents.
More...
Parser implementation for PDF documents.
The parser takes a PDTwinStreamRef on creation, and uses PDScannerRef instances to interpret the input PDF and to generate the output PDF document. It provides support for fetching objects from the PDF in a number of ways, inserting objects, and finding things out about the input PDF, such as encryption state, or about a given object, such as whether it is still mutable or not.
versioning and indirect stream lengths
[1]: an intriguing problem arises with PDF's that have data appended to them, where multiple instances of the same object with the same generation ID exist, which have streams that have (inhale) lengths which are indirect references rather than sizes (exhale) – is this accepted by the PDF spec, you wonder? Adobe InDesign does it, so I can only presume yes, but it seems rather odd; in any case, the problem is demonstrated by the following PDF fragments:PDF 1 0 obj <</Length 2 0 R>> stream ... endstream endobj 2 0 obj 123 endobj xref 0 3 0000000000 65535 n 0000000005 00000 n <– obj 1 0000000150 00000 n <– obj 2, defining obj 1 stream length trailer ... %EOF % here, appended PDF kicks in PDF 1 0 obj <</Length 2 0 R>> stram ... endstream 2 0 obj 457 endobj xref 0 3 0000000000 65535 n 0000000400 00000 n <– new obj 1 0000000950 00000 n <– new obj 2, defining obj 1 stream length ...When Pajdeg sets up the XREF table, it assumes that appended PDF's behave in a sane fashion, i.e. each append results in a set of replacements for given objects which is indeed the case; however, since Pajdeg iterates over all objects, rather than jumping to objects on demand, it encounters objects that are deprecated, such as 1 0 obj above; when Pajdeg hits the (old) 1 0 obj, it tries to move beyond it, but since it has a Length with a reference, Pajdeg looks up the reference IN THE XREF TABLE, which has been overridden, and would incorrectly presume that the length of the (old) 1 0 obj's stream is 457 bytes, when in reality it is 123 bytes. (This is fixed, but at the cost of multi-layer XREF tables.) Additionally, since each PDF append in fact is a patch operation, each XREF table has to not only be kept separate, but has to be properly patched with the previous tables' content.
Construct a PDObjectRef for the current object.
- Parameters
-
- Note
- Subsequent calls to PDParserConstructObject() if the parser has not iterated does nothing.
Create a new object with an appropriate object id (determined via the XREF table) for appending to the end of the output document.
- Note
- Appended objects are put into a stack awaiting the end of the input PDF. Multiple objects will thus come out in the opposite order in the output PDF file.
- Parameters
-
Create a new object with an appropriate object id (determined via the XREF table), for insertion.
The object is inserted after the current object in the stream, or at the current position if in between objects.
- Parameters
-
Set up a parser with a twin stream
- Parameters
-
Write remaining objects, XREF table, trailer, and end fluff to output PDF.
- Parameters
-
Fetch the current object's stream, reading from the input source if necessary.
Once fetched, this will simply return the stream buffer as is.
- Parameters
-
parser | The parser. |
obid | The object ID of the current object. Assertion is thrown if it does not match the parser's expected ID, or if the current object is not in the original PDF (e.g. from PDParserCreateNewObject()). |
- Returns
- Stream buffer.
Set up (if necessary) and return the PDCatalog object for the current PDF.
- Parameters
-
- Returns
- The catalog containing information about the pages in the PDF.
Determine the object id of the object stream containing the object with the given id, if any.
- Parameters
-
parser | The parser. |
obid | The object whose container object ID is to be determined. |
- Returns
- -1 if the object is not inside an object stream.
Determine if the PDF is encrypted or not.
- Parameters
-
- Returns
- true if encrypted, false if unencrypted.
Get an immutable reference to the info object for the input PDF, or NULL if the input PDF does not contain an info object.
- Parameters
-
- Returns
- NULL or the info object reference
Get an immutable reference to the root object for the input PDF.
- Parameters
-
Get the total number of objects in the input stream.
- Parameters
-
Get a mutable reference to the trailer object for the PDF.
- Parameters
-
- Returns
- The trailer object, which is written at the very end and is thus mutable until the PDF is finalized
Iterate to the next (living) object.
- Parameters
-
- Returns
- false if there are no more objects
Fetch the definition (as a pd_stack) of the object with the given id.
- Note
- Use of PDParserLocateAndCreateObject is recommended, as it is generally faster and will not get confused about objects inserted this session.
- Warning
- This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.
-
This is no longer the recommended way to obtain random access objects. Use PDParserLocateAndCreateObject instead.
- Parameters
-
parser | The parser. |
obid | The object ID |
master | If true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF). |
Fetch an object reference of the object with the given id.
- Warning
- This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.
- Note
- The object is returned as a retained object and must be PDRelease()d or it will leak.
- Parameters
-
parser | The parser. |
obid | The object ID |
master | If true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF). |
Fetch the object stream of the given object.
Once fetched, this will simply return the stream buffer as is.
- Parameters
-
parser | The parser. |
object | The object whose stream is to be fetched. Assertion is thrown if it is not in the original PDF. |
- Returns
- Stream buffer.