Build your own feed aggregator with symfony
With the help of the sfFeed2 plugin and the sfWebBrowser plugin, symfony makes the creation of a feed aggregator a breeze. Let's see what it would take to create the core of a Google Reader-like.
Fetching feeds
First of all, you'll have to fetch feeds from the Internet. It is strongly recommended to browse feeds in an asynchronous way, i.e. not when the user requests the page showing the aggregated feeds. There are two obvious reasons why you wouldn't want a synchronous process:
Distant servers providing the feeds that you want to fetch would receive one request per request on your server. That's a nasty trick to play to other service providers, and it can corrupt the distant server's statistics.
If you have to fetch a dozen URLs per request, then the response time might exceed the server timeout.
So you have to fetch feeds, store them somewhere (in your filesystem or in a database), and keep them for later. I choose to store them in the disk, which gives me an occasion to use the sfFileCache class. Here is the code that I write in a batch process:
define('SF_ROOT_DIR', realpath(dirname(__file__).'/..'));
define('SF_APP', 'frontend');
define('SF_ENVIRONMENT', 'dev');
define('SF_DEBUG', true);
require_once(SF_ROOT_DIR.DIRECTORY_SEPARATOR.'apps'.DIRECTORY_SEPARATOR.SF_APP.DIRECTORY_SEPARATOR.'config'.DIRECTORY_SEPARATOR.'config.php');
// Put the URLs of the feeds you want to fetch in an array
$urls = array(
'http://api.flickr.com/services/feeds/photos_public.gne?format=rss',
'http://del.icio.us/rss/popular',
'http://feeds.feedburner.com/TechCrunch',
'http://www.symfony-project.com/weblog/rss'
);
// Fetch the feeds
$feeds = array();
foreach($urls as $url)
{
try
{
$feeds[] = sfFeedPeer::createFromWeb($url);
echo "fetched feed ".$url."\n";
}
catch(Exception $e)
{
echo "error fetching feed ".$url.": ".$e."\n";
}
}
// Aggregate the feeds
$aggregated_feeds = sfFeedPeer::aggregate($feeds, array('limit' => 10));
// Cache the results
$f = new sfFileCache(sfConfig::get('sf_data_dir').'/feed');
$f->set('feeds', '', serialize($aggregated_feeds));
The interesting part of the batch is the use of the sfFeed2 plugin classes, made simple by the sfFeedPeer utility methods:
sfFeedPeer::createFromWeb()takes an URL as parameter, makes a request to this URL, decodes the response and populates asfFeedobject accordingly. It relies on thesfWebBrowserplugin for the HTTP request. It can recognize feeds of various formats (Atom1, RSS0.92, RSS1, RSS2).sfFeedPeer::aggregate()takes an array ofsfFeedobjects and returns a single feed, in which all feed items are aggregated and ordered chronologically. The second parameter is an array of options, that I use here to limit the number of items present in the resulting feed.
Then I serialize the sfFeed object containing the aggregated items and store it in the disk (under the data/ directory, to make it environment-independent) using the sfFileCache class.
I execute the batch once to test it and to generate the first version of the data/feed/feeds.cache file; as it needs to run periodically, I also add the following command to my crontab:
30 1 * * * cd /path/to/my/project && php batch/fetch_feeds.php
Displaying a feed
That's it for the first part. Now, what happens when a user makes a request to my application for the page showing the aggregated feeds? If this action is called feed/show, it can look like:
{
$f = new sfFileCache(sfConfig::get('sf_data_dir').'/feed');
$this->feed = unserialize($f->get('feeds', '', true));
}
The last thing I'll do is to display the details of each item, in feed/templates/showSuccess.php:
<?php foreach($feed->getItems() as $item): ?>
<div class="post">
<h2><?php echo link_to(truncate_text(strip_tags($item->getTitle()), 40), $item->getLink()) ?></h2>
Posted on <?php echo format_date($item->getPubDate(), "EEEE d MMMM 'at' h:ma ") ?>
by <?php echo link_to($item->getFeed()->getTitle(), $item->getFeed()->getLink()) ?>
<div class="summary"><?php echo truncate_text($item->getDescription(), 300) ?></div>
</div>
<?php endforeach; ?>
That's where I'm glad that the sfFeed and sfFeedItem classes provided by the sfFeed2 plugin have the same accessors whatever the format of the feed (Atom/Rss/etc). It makes the display of a feed item details very simple.
If you want to see the result, check the "outside" columns of the symfony community page.

Congrats François !!
Great example of the power of Symfony. I think we should show the code to every programmer that still doubts whether Symfony really simplifies web application development.
Is there a way to instead of being a cron job, make it so the first person that accesses the feed be the 'cache-er person' then subsequent hits will look at a 'how old' value and if the file which cached the feeds is older then that, have that client re-fetch and re-cache?
I see this as an alt to those of us w/o access to cron on our host or from a sysadmin unwilling to add it and/or maintain it.
What if I have thousands of feeds to aggregate? I mean we need a server side application to integrate into another one. What we have to do is, get content over 100,000 blogs everyday, actually even every 2 hours or less.
All we need is get the content write into a DB, we'll do the rest.
Do you think this example will handle a job like this?
weboo,
you won't be reading this, but for posterity... You have something like 100000 blogs. Each blog feed takes 1 sec on average (slow network, slow machines, timeouts etc. etc.). That's 1600 minutes, and a day is made of 1440 minutes. No, just THIS script cant' do it.
What you want to do is to have several machines/istances running the script taking in consideration that you want to run it for 90 minutes at most: that means having over 1000 istances running concurrently. You need a halluva of a backend to deal with that critical mass.
Of course there are interesting tricks, which are "network friedly" like using the "Last-Modified" http tag: this will make sure that in case of no change you drop the connection and save time and bandwidth.