Pajdeg  0.2.2
Pajdeg
Adding metadata to a PDF

From samples/add-metadata.c :

We now want to add metadata to an existing PDF. If the PDF has metadata already, we explode, but that's fine, we'll deal with that soon. The first new thing we have to do is declare a mutator task function above main.

PDTaskResult addMetadata(PDPipeRef pipe, PDTaskRef task, PDObjectRef object, void *info);

It takes four arguments: the pipe, its owning task, the object it's supposed to do its magic on, and an info object that we can use to store info in. We'll get to this function in a bit.

The next thing we have to do is create a new object. This is actually done straight off without using tasks. It may be a little confusing at first, but a mutator mutates/changes something, it never creates anything.

In any case, to create objects, we need to introduce a new friend, the parser. In our main():

// get the pipe's parser
PDParserRef parser = PDPipeGetParser(pipe);
printf("- adding metadata object\n");
// add a brand spanking new object to the PDF

The object has been given a unique object ID and is set up, ready to be stuffed into the output PDF as soon as we hit execute.

We want to tweak it first, of course. Here, we're setting the metadata to our own string. The three flags at the end are used to tell the object if we want it to set the Length property using our provided length, whether the passed buffer needs to be freed after the object is finished using it, and whether the passed content is encrypted or not. If you strdup()'d a string and passed it in, you would give true as the second argument, unless you planned to free() it yourself once you knew the object was done with it.

char *metaString = "Hello World!";
PDObjectSetStream(meta, metaString, strlen(metaString), true, false, false);

Adding a metadata object is fine and all, but it won't do any good unless we point the PDF's root object at the new metadata object. That's what our task from before is for.

PDTaskRef rootUpdater =
addMetadata);
// pass it the meta object as its info
PDTaskSetInfo(rootUpdater, meta);

We're creating a mutator task for the "root object" property type (i.e. the root object of the PDF), and we're passing our addMetadata function to it, and finally we're setting the info object to the meta object we made earlier.

Under the hood, this sets up a filter task for the root object's object ID, and attaches a mutator task to that filter task. The filter task will be pinged every time an object passes through the pipe and if the filter encounters the object whose ID matches the root object of the PDF, it will trigger its mutator task and hand it the object in question. That mutator task is our addMetadata function.

Which we will get to very soon. Before we do, though, there are a few things left: adding our task to the pipe,

PDPipeAddTask(pipe, rootUpdater);

executing the pipe,

PDInteger obcount = PDPipeExecute(pipe);
printf("- execution finished (%ld objects processed)\n", obcount);

and some clean-up.

PDRelease(pipe);
PDRelease(rootUpdater);
PDRelease(meta);

Caution: releasing the meta object before calling PDPipeExecute() will cause a crash, because addMetadata uses it and addMetadata is not called until PDPipeExecute() has been called.

The last part is the actual task callback.

PDTaskResult addMetadata(PDPipeRef pipe, PDTaskRef task, PDObjectRef object, void *info)
{

We could blindly change the root object's Metadata key to point to our new object. We could, but it would be very bad. We would leave a potentially huge abandoned object in the resulting PDF. Even worse, a PDF would have as many metadata objects as it had gone through our pipe, since we would be adding a new one every time.

printf("- task 'addMetadata' starting\n");
// get the dictionary for the object
// we will ruthlessly explode if the object already HAS a metadata entry
// (see replace-metadata.c)
if (PDDictionaryGetEntry(dict, "Metadata")) {
// normally you would return PDTaskAbort here, instead of killing the
// entire application
die("error: metadata already exists! aborting!\n");
}

Here, we are using a PDDictionary for the first time. It's simply a key/value pair container, used to represent dictionaries in PDFs. We can get the dictionary associated with a PDObject using PDObjectGetDictionary(). There is a corresponding PDObjectGetArray() for array type objects, and so on.

In any case, if PDDictionaryGet() returns a non-NULL value for the "Metadata" key, we explode. With that out of the way, setting the metadata is fairly straightforward.

Our meta object is the info, passed to the task:

PDObjectRef meta = info;

We put this into the dictionary as the Metadata value:

PDDictionarySetEntry(dict, "Metadata", meta);

Note that while meta is a PDObject, by setting a PDDictionary entry's value to a PDObject, it will ultimately end up being a PDReference value. In other words, objects will translate into "<object id> <generation number> R" in a PDDictionary, when written to a PDF.

Finally we return PDTaskDone to signal that we're finished:

return PDTaskDone;
}

Put together, this is what it all looks like:

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include "../src/Pajdeg.h"
#include "../src/PDDictionary.h"
#include "../src/PDString.h"
// convenient way to scream and die
#define die(msg...) do { fprintf(stderr, msg); exit(-1); } while (0)
// a "mutator" task, which is responsible for pointing the PDF's root object
// to a new metadata object that we're creating
PDTaskResult addMetadata(PDPipeRef pipe, PDTaskRef task, PDObjectRef object, void *info);
//
// main program
//
int main(int argc, char *argv[])
{
// want in and out files as arguments
if (argc != 3) die("syntax: %s <input PDF file> <output PDF name>\n", argv[0]);
printf("creating pipe\n"
"input : %s\n"
"output : %s\n", argv[1], argv[2]);
// create pipe
PDPipeRef pipe = PDPipeCreateWithFilePaths(argv[1], argv[2]);
if (NULL == pipe) die("failed to create pipe\n");
// the document metadata entry of a PDF is some object somewhere, which
// is pointed to from the so called Root object, so we have to add a new
// object, and point Root at it!
// get the pipe's parser
PDParserRef parser = PDPipeGetParser(pipe);
printf("- adding metadata object\n");
// add a brand spanking new object to the PDF
// give it our meta string as its stream (the stream is where the
// metadata is located); the three flags at the end are
// 1. "should the Length dictionary key be set automatically?",
// 2. "should the buffer be freed after use", and
// 3. "is the value sent in pre-encrypted or not?"
char *metaString = "Hello World!";
PDObjectSetStream(meta, metaString, strlen(metaString), true, false, false);
// now we need to point Root at it, but we can't just pull it out
// and change it -- we have to make a task for it
// (the reason we can change meta directly is because we MADE it)
printf("- creating mutator for root object\n");
// create a mutator for the root object
PDTaskRef rootUpdater =
addMetadata);
// pass it the meta object as its info
PDTaskSetInfo(rootUpdater, meta);
// add it to the pipe
PDPipeAddTask(pipe, rootUpdater);
printf("- executing pipe operation\n");
// execute
PDInteger obcount = PDPipeExecute(pipe);
printf("- execution finished (%ld objects processed)\n", obcount);
// clean up (note that we're not releasing meta until after PDPipeExecute
// is called, or it will end up being deallocated before the task is
// called)
PDRelease(pipe);
PDRelease(rootUpdater);
PDRelease(meta);
}
//
// task
//
PDTaskResult addMetadata(PDPipeRef pipe, PDTaskRef task, PDObjectRef object, void *info)
{
printf("- task 'addMetadata' starting\n");
// get the dictionary for the object
// we will ruthlessly explode if the object already HAS a metadata entry
// (see replace-metadata.c)
if (PDDictionaryGetEntry(dict, "Metadata")) {
// normally you would return PDTaskAbort here, instead of killing the
// entire application
die("error: metadata already exists! aborting!\n");
}
// meta is our info object
PDObjectRef meta = info;
// set the Root's metadata reference; we are setting it to a PDObject
// but this translates to a PDF indirect reference, as dictionaries
// cannot contain entire objects
PDDictionarySetEntry(dict, "Metadata", meta);
printf("- task 'addMetadata' finished updating root object\n");
// tell task handler to continue as normal
return PDTaskDone;
}

You can check out a dissection of a diff resulting from a tiny PDF when piped using this program on the Add metadata diff example page.

In the next part we'll be conditionally replacing or inserting metadata depending on whether it exists or not. There are a couple of ways of doing this, such as always deleting the current medata object and putting in a new one, but we're going to replace or insert.

Next up: Replacing metadata