Howdy
So it’s been a while with no new technical post here at Tuxxedo, but fear not, I’ve dedicated this whole post to talk about some technical theories and material that currently are being worked on.
So as the title says, this post is dedicated towards uploading files, its a huge missing feature within the Engine API, and a very common task for many applications. This post will go over some of the concerns about implementing an API that is simple on the outside and complex on the inside, being robust with security in mind.
A little background
Some years ago, I think it was a cold and dark January night in 2006, I wrote my at that time first basic script that could upload files from my personal computer to my website, in an attempt to venture into an area that I had not yet worked with in the web world. At the same time at a website I was staff at, there was a huge request for having image uploaders, so the next night I turned my script into a simple program that could upload images, named KalleLoad (how original!).
Development of this program progress rapidly and I added many new features to it, although the version I worked on never really finished when I stopped supporting the program, it is the version that the same old domain (which still hosts the original files from 2006 btw), still running 1.5.0 Gamma 2-dev. In the 1.5 series I re-thought many of the concepts regarding the upload mechanics, I was however not able to implement a clean and flexible API due to my at then time idea that KalleLoad had to support PHP4.
So what does this mean? This means that its a long overdue and time to realize some of these concepts and ideas within the Engine namespace.
Features
There are a range of features that this API will support, some of these were already previously supported but were poorly implemented or not usable without dirty hacks which invalidated their status as an API.
Multi upload: Support was half way implemented in KL 1.5, but never completed. By redesigning the class that handles the controls, we can implement a smart ‘queue’ a like system with a clean API using the \Tuxxedo\Design\InfoAccess pattern.
...
use Tuxxedo\Upload;
/* URL to a file somewhere, in this case the Google logo */
$url = 'https://www.google.com/images/srpr/logo3w.png';
/* Create the queue, with no resource handlers */
$queue = new Upload;
/* Create the batch of files to upload (Named value/Value >= Type) */
$queue['fileinput'] = Upload::TYPE_INPUT;
$queue[$url] = Upload::TYPE_URL;
/* Process the upload */
foreach($queue as $result)
{
/* $result is now an object of \Tuxxedo\Upload\Result */
/* Result objects contain information about the transaction */
/* Including detailed errors, other information such size too */
printf('Processing file \'%s\'...', $result['filename']);
}
...
So before there goes too much into that example, lets explain the basic idea of what is going on here:
$queue is initialized to an instance of \Tuxxedo\Upload, with no resource handlers
$queue is populated with 2 items to upload, from a form (<input type="file" />) and a URL
- Iterating
$queue starts processing the queue batch, meaning they will be uploaded
$result from each iteration contains detailed information about a particular file, including errors and such
The first one is a sort of no brainer, resource handlers will be explained later in this post, once the basics are understood.
Types are a way of telling the underlaying code layer how to actually process the input, and what to look for, since theres a relatively big difference in terms of how you would process a file from a form to the idea of fetching a remote file using streams to gather the relevant information needed before a file may be accepted for transfer, in this case, both input and URL’s have very different ways in the PHP world to get the same information, like MIME type, size etc.
The internal implementation of types, at least the URL handler, is using an abstract streaming layer, so that extensions to the Engine can make use of it to implement support for unsupported protocols, could for example be for attachments from a mail using IMAP. However the inner workings of this stream API is not gonna be covered more in this posting.
Processing, this is perhaps the most part thats not so obvious, since you would probably expect a method or similar to tell the loop (or even the programmer) that we’re processing the queue now, no body wants to write code we really do not need, or can gently skip in an “obvious” way once we understand the underlaying base idea. While the foreach loop will cause the queue to be processed (of non processed items), the ->process() method will be available for single item processing, or for what I would call “‘ol fashioned way” of doing it.
The last part of this feature is the result that is returned from processing, either from the loop or from calling a queue procession manually. This object does not only contain useful information gather about the file, but also controls that can be used if the generic resource handler is used, or if any of the resource handlers permits manual override of actions to a file, an example of such an action could be moving the file from the temporary upload cache to a custom folder that might not be what the resource handler decided, if the resource handler for example was written for a specific application and the values were sort of ‘baked in’.
Naming: Naming of files, KalleLoad have used two different naming algorithms, 1.0-1.4 used an md5 hashed string and 1.5 used a random alpha numeric character sequence for both the file name and upload folder (if generated). They can be expressed as:
Simple right? I was to implement these two algorithms plus an additional new one, while the ‘original’ is more for the retro of KalleLoad which will at least turn into something more randomness in terms of how the hash is calculated. The ‘classic’ one is really simple and straight forward and suits most tasks of creating a unique filename.
Some people like to name their pictures, whether it is an id of some sort, or just ‘gag’ we can use this to creating a unique filename as well, which is what I want to introduce a third naming algorithm, called ‘logical’, it will use ‘logical’ potions of the original filename to name the new file, while it sounds simple to skip out certain non ASCII characters for safe keeping and keep it alpha numeric with casing support, there is still quite a guessing game to figuring out a suitable name, let’s break it down a little:
- Strip out any non ASCII characters, like Unicode, since we want a ‘simple’ file name, this creates an issue for writing styles such as arabic, meaning files named as such will not be able to pass this first step of the test.
- Separate ‘words’ or ‘phrases’, using a space (” “), hyphen (“-”) and underscore (“_”) as separators
- Based on how the internal algorithm will be put together, should it include one, or multiple ‘words’, if so, concatenate them using either a hyphen or underscore as separator
- Prefix (most likely) or suffix the concatenated string for unique-/random-ness
So say we got a file named ‘AFUP05 209405.jpg‘, the algorithm could break it down to something like ‘1685_afup05-209405.jpg‘, the ’1685′ is the prefix, so theres no naming collisions for safe keeping in this case. I personally like this method because it is a bit more expressive than some ‘completely’ random string.
Resource handlers: these work as a term of Add-on, these can be written to a specific application, which lets the add-on do much of the business logic, which means that each time a queue is processed, it doesn’t need to be moved manually (like in the above code example comment).
So what can Resource handlers do? Anything, a resource handler have almost complete control of each stage in a processing a transaction queue. A resource handler is programmed much like datamanager adapters are, with a way of instructing Engine to only invoke the resource handler at certain breakpoints defined, for example if we imagine a ‘KalleLoad’ resource handler being implemented, in order to for fill the same requirements as the current website have we will need to invoke at the following breakpoints:
- On file type (the MIME type) to determine whether it is considered valid
- On naming, although this could be optional due to the naming algorithms implemented
- On storage, leaving the resource handler fully in-charge of moving the file into the correct location on the file system from its temporary location
With some clever hooking, this can be achieved quite easily without any overhead. I’ve not yet fully decided if I wanna use the Event handling approach which is currently available in the trunk version of Engine, but I fear that the way that the way resource handlers will work, with all this control is more than what the Event handler system was designed for, but I’m sure I can come up with a logical way to achieve this with some of OOP magic.
Supplying a resource handler must happen using each queue instance:
...
use Tuxxedo\Upload;
$queue = new Upload;
$queue->handler('\KalleLoad\API\UploadRsrcHandler');
$queue->handler('\Application\UploadRsrcHandler');
...
I’m still not sure if it is wise to have the ability to have more than one resource handler per queue instance at the time, since it creates all sort of complexity, such as needing to signal state changes to each of the resource handlers, so the other one doesn’t try to ‘hijack’ a file over another, which resource handler that has superior priority, and such, it generally creates more issues than what it solves in my mind, which is why I think if the above example is executed, the current resource handler would be ‘\Application\UploadRsrcHandler‘
Error handling
Now that we’re covered some of the major feature concepts behind the API, it is time to quickly dwell into the how we deal with errors. The API supplies just one new exception type, named ‘\Tuxxedo\Exception\Upload‘, this type works the same as that of the template compiler, by states. Each state may supply additional meta information that can be used to make error messages more expressive.
While exceptions are used in places where an ‘unrecoverable’ error occurs, meaning the API cannot continue function, could be a programmer mistake, network outtake or something that causes it to basically blow up. Handling errors per file transaction is another thing that is more silent, if you like, each object returned by the processing method, whether its from the loop or its from calling the actual method contains detailed information about a transaction, and at what state it is at, which means unless a resource handler picks up such errors, it is up to the programmer him-/her-self to pick up and deal with them:
...
$failed = Array();
foreach($queue as $result)
{
/* Check for errors */
if($result->state == Upload::STATE_ERROR)
{
$failed[] = $result;
continue;
}
}
if($failed)
{
echo 'The following files failed to upload:';
echo '<ul>';
foreach($failed as $result)
{
echo '<li>' . htmlspecialchars($result['filename']) . '</li>';
}
echo '</ul>';
}
...
This is still subject to change, but the base idea is to leave error handling to the developer if the resource handler doesn’t take over, at least for now, a more automated approach can always be looked at, but for now we good with this approach I think.
Round up
I think I covered many items here for now, I still got some more things I want to add regarding security and how all that blends into the whole picture along with more information about how I intend to design the stream API, including prototypes of handler inferfaces and how the resource handler hooks might work internally with more code than what was in this small blog posting.
I hope you enjoyed reading this, and don’t forget there is more to come once I get the grasp of a proper idea and tried some different techniques out, stay tuned!