After dealing with web applications a lot, you realize that it's all about a series of bytes transiting from one computer to another. You can accomplish tremendous things without ever understanding how this flow works. But when you want to go really fast, when you want to unlock the ultimate power hidden deep inside your server, then you have to speak like the computers do.
Which brings me to streams. This feature is greatly overlooked in PHP, and ubiquitous in Node. I'll explain how Node streams differ from PHP I/O manipulations, but first, I have a confession to make.
Just because I always felt at ease with computers, my friends always took me for some kind of geek. Okay I can use the OFFSET()
function in Excel to amaze the accountants, but that doesn't make me comparable to cosplayers or government database hackers. But it doesn't make any difference for my friends, and so they always ask me for help when they struggle with computers.
About ten years ago, a couple of friends went for a year in New-Zealand. Before leaving, they asked me if I could setup a photo blog for them to communicate with their friends (it was before Flickr and Picasa and Facebook and Instagram ever existed). So I downloaded an open-source photo album application (written in PHP by the way) and uploaded it to a server I was renting for my own. They made a lot of pictures, published them regularly, and we were all very jealous. Then they came back, and we all forgot about the photo blog.
A few days ago, the same couple of friends reminded me of the photo blog, and they wanted to get the pictures back. Or rather, as we're now living in a different era, they asked me if I could transfer the pictures to Flickr. Why did they think I could do that? Do they really think I'm a geek?
After thinking about it, the transfer doesn't sound too hard to do. I only need to upload a PHP script to the server that can browse the filesystem and make a POST http request to the Flickr API. How hard could that be?
You may wonder: why use fsockopen()
instead of Guzzle, Buzz, or even ZendHttpClient? Because they don't change the result of this script: it's too slow. There are tons of images in the directory, and the execution never ends.
The problem is that PHP does one thing at a time, and that Input/Output operations are blocking. In other terms, when you execute a PHP I/O function, the PHP process waits until the I/O completes. And it can take a very long time. Here is what really happens in the central loop of the previous script:
The PHP process wastes a lot of time waiting. Even if you have a very fast CPU, file and network I/O make the script too slow to be really usable on a large number of files.
Tip: Of course, Flickr has an authentication system which grants a token that should be added to each API call. But it's been removed from the current example to keep your attention focused.
To exchange data between systems, the script uses files and requests, but these concepts are too high level to be truly efficient in heavy usage scenarios. There is another reality, at a lower level. It might me daunting at first, but once you've discovered it, you can never come back. Come on, take the red pill, and let me introduce you to streams.
A stream represents a flow of bytes between two containers. In the previous PHP script, the data flows first from the disk to memory (file_get_contents()
), then from the memory to the distant server (fwrite()
). Wouldn't it be more efficient to start flowing bytes from the memory to the distant server before the initial disk flow is finished? Using streams, it's possible: a script can send an image to Flicker while it reads this image from the filesystem.
PHP offers a Stream API with low-level I/O functions. Here is how to rewrite the Flickr upload script using streams:
File reads are now less blocking: PHP only waits for chunks of 1024 bytes to be read from the disk to sent them over HTTP. Consequently, this second upload script is a bit faster than the first one.
But dealing with streams in PHP is painful. The API is purely functional, not object-oriented. There are tons of functions with opaque names, several ways to do a simple operation, and very defective pieces of documentation. There must be a better way to do this.
Node.js comes with a native asynchronous stream API. In fact, most I/O operations in Node result in a stream by default. HTTP requests are network I/O, file reads are disk I/O, so Node naturally treats both as streams.
This means that streaming data from one source to another is really a breeze. The following script is the equivalent of the previous PHP script using Node.js:
As you can see, HTTP is a first-class citizen in Node. If you compare the postOptions
object with the $out
string of the PHP example, where each new header of the HTTP request was appended to a string with \r\n
as separator, the difference is striking. The http.request()
API encourages you to check the HTTP response, while PHP functions just write to a distant resource without worrying about possible errors in the process. Node builds up on the HTTP protocol, and encourages its usage.
Also, Node allows to "pipe" two streams, just like you can pipe two commands in Linux. Here, the output of the fileStream
becomes the input of the httpStream
, by chunks of 64kB by default. In one simple method call:
Tip: Node uses a chunked transfer encoding by default on HTTP requests, so there is no need to specify the 'Content-Length' header.
The Node.js version is faster to execute, but it is not really faster to read and write than the PHP version. Let's find an even better tool for the job.
HTTP client requests with Node can be somewhat verbose. I recommend the use of request, a npm package facilitating HTTP requests. This package is so generic that chances are that you may use it in all your Node projects.
With the request
package, the code to initialize a POST request is simply request.post(url)
, so the Flickr upload script reduces to:
If you take out console messages and temporary variables, the code becomes extremely concise:
Compare that to the first PHP script. Stunning, isn't it?
Instead of reading one file after the other, why not use the power of Node to do that asynchronously? This kind of asynchronous iteration requires a meeting point to make sure all the operation are closed. You could try redeveloping this logic from scratch, but someone already did it better (as often with Node.js), in an npm package called async. async.forEach(array, iterator)
applies an iterator function to each item in an array, in parallel. Here is the Flickr upload script with parallel file reads:
But is this script really faster? The answer is yes and no. It is faster because you don't need to wait for the end of a file transfer to start the next. It is not faster because asking a disk to read from several files at the same time can more expensive than reading one file sequentially (unless you have a RAID 10 array, a NAS, or a SSD, your hard drive has a single read/write head. The boost of the parallel HTTP stream probably outweights the slowdown of the parallel disk reads. But the real problem is that if the script opens a lot of simultaneous HTTP connections to Flickr, and Flickr will probably kick you out for that. So this is a bad enhancement.
Use asynchronous streams wisely, and always check the benefit they offer. Sometimes they can be counter-productive.
Streams are a geek feature in PHP. They are for everybody in Node.js - because the stream API is much easier to use, and deeper into the core principles of Node. Even if you don't look for high performance I/O, you should use Node streams as much as possible, as they bring to the front what used to be hidden behind a curtain of abstraction, without adding any significant complexity to programming.
As for the Flickr upload, I ended up zipping all the photos together on the server, transfer the archive to my desktop using FTP, and then bulk transfer the photos to Flickr using their Desktop Uploadr. No PHP, no Node.js. No need to be a geek to talk to computers these days.
Also in this series : Node.js for PHP Programmers #3: Exceptions and Errors.
Tweet
Published on 17 Jun 2012
with tags development JavaScript NodeJS php