Parser implementation for PDF documents. More...

Files
file	PDParser.h

file	PDParserAttachment.h
	An attachment between two parsers.

Typedefs
typedef struct PDParser *	PDParserRef

Functions
PDParserRef	PDParserCreateWithStream (PDTwinStreamRef stream)

PDBool	PDParserIterate (PDParserRef parser)

PDObjectRef	PDParserConstructObject (PDParserRef parser)

PDObjectRef	PDParserCreateNewObject (PDParserRef parser)

PDObjectRef	PDParserCreateAppendedObject (PDParserRef parser)

char *	PDParserFetchCurrentObjectStream (PDParserRef parser, PDInteger obid)

char *	PDParserLocateAndFetchObjectStreamForObject (PDParserRef parser, PDObjectRef object)

PDBool	PDParserGetEncryptionState (PDParserRef parser)

pd_stack	PDParserLocateAndCreateDefinitionForObject (PDParserRef parser, PDInteger obid, PDBool master)

PDObjectRef	PDParserLocateAndCreateObject (PDParserRef parser, PDInteger obid, PDBool master)

void	PDParserDone (PDParserRef parser)

PDBool	PDParserIsObjectStillMutable (PDParserRef parser, PDInteger obid)

PDInteger	PDParserGetContainerObjectIDForObject (PDParserRef parser, PDInteger obid)

PDObjectRef	PDParserGetRootObject (PDParserRef parser)

PDObjectRef	PDParserGetInfoObject (PDParserRef parser)

PDObjectRef	PDParserGetTrailerObject (PDParserRef parser)

PDCatalogRef	PDParserGetCatalog (PDParserRef parser)

PDInteger	PDParserGetTotalObjectCount (PDParserRef parser)

Detailed Description

Parser implementation for PDF documents.

The parser takes a PDTwinStreamRef on creation, and uses PDScannerRef instances to interpret the input PDF and to generate the output PDF document. It provides support for fetching objects from the PDF in a number of ways, inserting objects, and finding things out about the input PDF, such as encryption state, or about a given object, such as whether it is still mutable or not.

versioning and indirect stream lengths

[1]: an intriguing problem arises with PDF's that have data appended to them, where multiple instances of the same object with the same generation ID exist, which have streams that have (inhale) lengths which are indirect references rather than sizes (exhale) – is this accepted by the PDF spec, you wonder? Adobe InDesign does it, so I can only presume yes, but it seems rather odd; in any case, the problem is demonstrated by the following PDF fragments:PDF 1 0 obj <</Length 2 0 R>> stream ... endstream endobj 2 0 obj 123 endobj xref 0 3 0000000000 65535 n 0000000005 00000 n <– obj 1 0000000150 00000 n <– obj 2, defining obj 1 stream length trailer ... %EOF % here, appended PDF kicks in PDF 1 0 obj <</Length 2 0 R>> stram ... endstream 2 0 obj 457 endobj xref 0 3 0000000000 65535 n 0000000400 00000 n <– new obj 1 0000000950 00000 n <– new obj 2, defining obj 1 stream length ...When Pajdeg sets up the XREF table, it assumes that appended PDF's behave in a sane fashion, i.e. each append results in a set of replacements for given objects which is indeed the case; however, since Pajdeg iterates over all objects, rather than jumping to objects on demand, it encounters objects that are deprecated, such as 1 0 obj above; when Pajdeg hits the (old) 1 0 obj, it tries to move beyond it, but since it has a Length with a reference, Pajdeg looks up the reference IN THE XREF TABLE, which has been overridden, and would incorrectly presume that the length of the (old) 1 0 obj's stream is 457 bytes, when in reality it is 123 bytes. (This is fixed, but at the cost of multi-layer XREF tables.) Additionally, since each PDF append in fact is a patch operation, each XREF table has to not only be kept separate, but has to be properly patched with the previous tables' content.

Typedef Documentation

typedef struct PDParser* PDParserRef

A parser.

Function Documentation

PDObjectRef PDParserConstructObject ( PDParserRef parser )

Construct a PDObjectRef for the current object.

Parameters

parser The parser.

Note: Subsequent calls to PDParserConstructObject() if the parser has not iterated does nothing.

PDObjectRef PDParserCreateAppendedObject ( PDParserRef parser )

Create a new object with an appropriate object id (determined via the XREF table) for appending to the end of the output document.

Note: Appended objects are put into a stack awaiting the end of the input PDF. Multiple objects will thus come out in the opposite order in the output PDF file.

Parameters

parser The parser.

PDObjectRef PDParserCreateNewObject ( PDParserRef parser )

Create a new object with an appropriate object id (determined via the XREF table), for insertion.

The object is inserted after the current object in the stream, or at the current position if in between objects.

Parameters

parser The parser.

PDParserRef PDParserCreateWithStream ( PDTwinStreamRef stream )

Set up a parser with a twin stream

Parameters

stream The stream to use.

void PDParserDone ( PDParserRef parser )

Write remaining objects, XREF table, trailer, and end fluff to output PDF.

Parameters

parser The parser.

char* PDParserFetchCurrentObjectStream	(	PDParserRef	parser,
		PDInteger	obid
	)

Fetch the current object's stream, reading from the input source if necessary.

Once fetched, this will simply return the stream buffer as is.

Parameters

parser	The parser.
obid	The object ID of the current object. Assertion is thrown if it does not match the parser's expected ID, or if the current object is not in the original PDF (e.g. from PDParserCreateNewObject()).

Returns: Stream buffer.

PDCatalogRef PDParserGetCatalog ( PDParserRef parser )

Set up (if necessary) and return the PDCatalog object for the current PDF.

Parameters

parser The parser.

Returns: The catalog containing information about the pages in the PDF.

PDInteger PDParserGetContainerObjectIDForObject	(	PDParserRef	parser,
		PDInteger	obid
	)

Determine the object id of the object stream containing the object with the given id, if any.

Parameters

parser	The parser.
obid	The object whose container object ID is to be determined.

Returns: -1 if the object is not inside an object stream.

PDBool PDParserGetEncryptionState ( PDParserRef parser )

Determine if the PDF is encrypted or not.

Parameters

parser The parser.

Returns: true if encrypted, false if unencrypted.

PDObjectRef PDParserGetInfoObject ( PDParserRef parser )

Get an immutable reference to the info object for the input PDF, or NULL if the input PDF does not contain an info object.

Parameters

parser The parser

Returns: NULL or the info object reference

PDObjectRef PDParserGetRootObject ( PDParserRef parser )

Get an immutable reference to the root object for the input PDF.

Parameters

parser The parser.

PDInteger PDParserGetTotalObjectCount ( PDParserRef parser )

Get the total number of objects in the input stream.

Parameters

parser The parser.

PDObjectRef PDParserGetTrailerObject ( PDParserRef parser )

Get a mutable reference to the trailer object for the PDF.

Parameters

parser The parser

Returns: The trailer object, which is written at the very end and is thus mutable until the PDF is finalized

PDBool PDParserIsObjectStillMutable	(	PDParserRef	parser,
		PDInteger	obid
	)

Determine if object with given id has already been written to output stream (i.e. has become immutable).

Warning: This method only tells if the object with the given ID has been iterated past already. Whether a PDObjectRef instance is actually mutable or not depends on whether it was inherently immutable or not (objects based on PDParserLocateAndCreateDefinitionForObject() are inherently immutable, and include the object returned from PDPipeGetRootObject().).

Parameters

parser	The parser.
obid	The object ID.

See also: PDPipeGetRootObject; PDParserLocateAndCreateDefinitionForObject

PDBool PDParserIterate ( PDParserRef parser )

Iterate to the next (living) object.

Parameters

parser The parser.

Returns: false if there are no more objects

pd_stack PDParserLocateAndCreateDefinitionForObject	(	PDParserRef	parser,
		PDInteger	obid,
		PDBool	master
	)

Fetch the definition (as a pd_stack) of the object with the given id.

Note: Use of PDParserLocateAndCreateObject is recommended, as it is generally faster and will not get confused about objects inserted this session.

Warning: This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.; This is no longer the recommended way to obtain random access objects. Use PDParserLocateAndCreateObject instead.

Parameters

parser	The parser.
obid	The object ID
master	If true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF).

PDObjectRef PDParserLocateAndCreateObject	(	PDParserRef	parser,
		PDInteger	obid,
		PDBool	master
	)

Fetch an object reference of the object with the given id.

Warning: This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.

Note: The object is returned as a retained object and must be PDRelease()d or it will leak.

Parameters

parser	The parser.
obid	The object ID
master	If true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF).

char* PDParserLocateAndFetchObjectStreamForObject	(	PDParserRef	parser,
		PDObjectRef	object
	)

Fetch the object stream of the given object.

Once fetched, this will simply return the stream buffer as is.

Parameters

parser	The parser.
object	The object whose stream is to be fetched. Assertion is thrown if it is not in the original PDF.

Returns: Stream buffer.

Files

Typedefs

Functions

Detailed Description

versioning and indirect stream lengths

Typedef Documentation

Function Documentation