Pajdeg  0.2.2
Pajdeg
Files | Typedefs | Functions

Parser implementation for PDF documents. More...

Files

file  PDParser.h
 
file  PDParserAttachment.h
 An attachment between two parsers.
 

Typedefs

typedef struct PDParserPDParserRef
 

Functions

PDParserRef PDParserCreateWithStream (PDTwinStreamRef stream)
 
PDBool PDParserIterate (PDParserRef parser)
 
PDObjectRef PDParserConstructObject (PDParserRef parser)
 
PDObjectRef PDParserCreateNewObject (PDParserRef parser)
 
PDObjectRef PDParserCreateAppendedObject (PDParserRef parser)
 
char * PDParserFetchCurrentObjectStream (PDParserRef parser, PDInteger obid)
 
char * PDParserLocateAndFetchObjectStreamForObject (PDParserRef parser, PDObjectRef object)
 
PDBool PDParserGetEncryptionState (PDParserRef parser)
 
pd_stack PDParserLocateAndCreateDefinitionForObject (PDParserRef parser, PDInteger obid, PDBool master)
 
PDObjectRef PDParserLocateAndCreateObject (PDParserRef parser, PDInteger obid, PDBool master)
 
void PDParserDone (PDParserRef parser)
 
PDBool PDParserIsObjectStillMutable (PDParserRef parser, PDInteger obid)
 
PDInteger PDParserGetContainerObjectIDForObject (PDParserRef parser, PDInteger obid)
 
PDObjectRef PDParserGetRootObject (PDParserRef parser)
 
PDObjectRef PDParserGetInfoObject (PDParserRef parser)
 
PDObjectRef PDParserGetTrailerObject (PDParserRef parser)
 
PDCatalogRef PDParserGetCatalog (PDParserRef parser)
 
PDInteger PDParserGetTotalObjectCount (PDParserRef parser)
 

Detailed Description

Parser implementation for PDF documents.

The parser takes a PDTwinStreamRef on creation, and uses PDScannerRef instances to interpret the input PDF and to generate the output PDF document. It provides support for fetching objects from the PDF in a number of ways, inserting objects, and finding things out about the input PDF, such as encryption state, or about a given object, such as whether it is still mutable or not.

versioning and indirect stream lengths

[1]: an intriguing problem arises with PDF's that have data appended to them, where multiple instances of the same object with the same generation ID exist, which have streams that have (inhale) lengths which are indirect references rather than sizes (exhale) – is this accepted by the PDF spec, you wonder? Adobe InDesign does it, so I can only presume yes, but it seems rather odd; in any case, the problem is demonstrated by the following PDF fragments:PDF 1 0 obj <</Length 2 0 R>> stream ... endstream endobj 2 0 obj 123 endobj xref 0 3 0000000000 65535 n 0000000005 00000 n <– obj 1 0000000150 00000 n <– obj 2, defining obj 1 stream length trailer ... %EOF % here, appended PDF kicks in PDF 1 0 obj <</Length 2 0 R>> stram ... endstream 2 0 obj 457 endobj xref 0 3 0000000000 65535 n 0000000400 00000 n <– new obj 1 0000000950 00000 n <– new obj 2, defining obj 1 stream length ...When Pajdeg sets up the XREF table, it assumes that appended PDF's behave in a sane fashion, i.e. each append results in a set of replacements for given objects which is indeed the case; however, since Pajdeg iterates over all objects, rather than jumping to objects on demand, it encounters objects that are deprecated, such as 1 0 obj above; when Pajdeg hits the (old) 1 0 obj, it tries to move beyond it, but since it has a Length with a reference, Pajdeg looks up the reference IN THE XREF TABLE, which has been overridden, and would incorrectly presume that the length of the (old) 1 0 obj's stream is 457 bytes, when in reality it is 123 bytes. (This is fixed, but at the cost of multi-layer XREF tables.) Additionally, since each PDF append in fact is a patch operation, each XREF table has to not only be kept separate, but has to be properly patched with the previous tables' content.

Typedef Documentation

typedef struct PDParser* PDParserRef

A parser.

Function Documentation

PDObjectRef PDParserConstructObject ( PDParserRef  parser)

Construct a PDObjectRef for the current object.

Parameters
parserThe parser.
Note
Subsequent calls to PDParserConstructObject() if the parser has not iterated does nothing.
PDObjectRef PDParserCreateAppendedObject ( PDParserRef  parser)

Create a new object with an appropriate object id (determined via the XREF table) for appending to the end of the output document.

Note
Appended objects are put into a stack awaiting the end of the input PDF. Multiple objects will thus come out in the opposite order in the output PDF file.
Parameters
parserThe parser.
PDObjectRef PDParserCreateNewObject ( PDParserRef  parser)

Create a new object with an appropriate object id (determined via the XREF table), for insertion.

The object is inserted after the current object in the stream, or at the current position if in between objects.

Parameters
parserThe parser.
PDParserRef PDParserCreateWithStream ( PDTwinStreamRef  stream)

Set up a parser with a twin stream

Parameters
streamThe stream to use.
void PDParserDone ( PDParserRef  parser)

Write remaining objects, XREF table, trailer, and end fluff to output PDF.

Parameters
parserThe parser.
char* PDParserFetchCurrentObjectStream ( PDParserRef  parser,
PDInteger  obid 
)

Fetch the current object's stream, reading from the input source if necessary.

Once fetched, this will simply return the stream buffer as is.

Parameters
parserThe parser.
obidThe object ID of the current object. Assertion is thrown if it does not match the parser's expected ID, or if the current object is not in the original PDF (e.g. from PDParserCreateNewObject()).
Returns
Stream buffer.
PDCatalogRef PDParserGetCatalog ( PDParserRef  parser)

Set up (if necessary) and return the PDCatalog object for the current PDF.

Parameters
parserThe parser.
Returns
The catalog containing information about the pages in the PDF.
PDInteger PDParserGetContainerObjectIDForObject ( PDParserRef  parser,
PDInteger  obid 
)

Determine the object id of the object stream containing the object with the given id, if any.

Parameters
parserThe parser.
obidThe object whose container object ID is to be determined.
Returns
-1 if the object is not inside an object stream.
PDBool PDParserGetEncryptionState ( PDParserRef  parser)

Determine if the PDF is encrypted or not.

Parameters
parserThe parser.
Returns
true if encrypted, false if unencrypted.
PDObjectRef PDParserGetInfoObject ( PDParserRef  parser)

Get an immutable reference to the info object for the input PDF, or NULL if the input PDF does not contain an info object.

Parameters
parserThe parser
Returns
NULL or the info object reference
PDObjectRef PDParserGetRootObject ( PDParserRef  parser)

Get an immutable reference to the root object for the input PDF.

Parameters
parserThe parser.
PDInteger PDParserGetTotalObjectCount ( PDParserRef  parser)

Get the total number of objects in the input stream.

Parameters
parserThe parser.
PDObjectRef PDParserGetTrailerObject ( PDParserRef  parser)

Get a mutable reference to the trailer object for the PDF.

Parameters
parserThe parser
Returns
The trailer object, which is written at the very end and is thus mutable until the PDF is finalized
PDBool PDParserIsObjectStillMutable ( PDParserRef  parser,
PDInteger  obid 
)

Determine if object with given id has already been written to output stream (i.e. has become immutable).

Warning
This method only tells if the object with the given ID has been iterated past already. Whether a PDObjectRef instance is actually mutable or not depends on whether it was inherently immutable or not (objects based on PDParserLocateAndCreateDefinitionForObject() are inherently immutable, and include the object returned from PDPipeGetRootObject().).
Parameters
parserThe parser.
obidThe object ID.
See also
PDPipeGetRootObject
PDParserLocateAndCreateDefinitionForObject
PDBool PDParserIterate ( PDParserRef  parser)

Iterate to the next (living) object.

Parameters
parserThe parser.
Returns
false if there are no more objects
pd_stack PDParserLocateAndCreateDefinitionForObject ( PDParserRef  parser,
PDInteger  obid,
PDBool  master 
)

Fetch the definition (as a pd_stack) of the object with the given id.

Note
Use of PDParserLocateAndCreateObject is recommended, as it is generally faster and will not get confused about objects inserted this session.
Warning
This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.
This is no longer the recommended way to obtain random access objects. Use PDParserLocateAndCreateObject instead.
Parameters
parserThe parser.
obidThe object ID
masterIf true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF).
PDObjectRef PDParserLocateAndCreateObject ( PDParserRef  parser,
PDInteger  obid,
PDBool  master 
)

Fetch an object reference of the object with the given id.

Warning
This is an expensive operation that requires setting up a temporary buffer of sufficiently big size, seeking to the object in the input file, reading the definition, then seeking back.
Note
The object is returned as a retained object and must be PDRelease()d or it will leak.
Parameters
parserThe parser.
obidThe object ID
masterIf true, the master PDX ref is referenced, otherwise the current PDX ref is used. Generally speaking, you always want to use the master (non-master is used internally to determine the deprecated length of a stream for a multi-part PDF).
char* PDParserLocateAndFetchObjectStreamForObject ( PDParserRef  parser,
PDObjectRef  object 
)

Fetch the object stream of the given object.

Once fetched, this will simply return the stream buffer as is.

Parameters
parserThe parser.
objectThe object whose stream is to be fetched. Assertion is thrown if it is not in the original PDF.
Returns
Stream buffer.