
To support symfony 2's development, Fabien Potencier - the lead developer of the symfony framework - has released four new PHP 5.3 based components:
Though these components will be used by Symfony 2, they're built to be standalone components that can be easily used in any PHP 5.3 project. To prove that point, Fabien also released a new web scraper/crawler called
Goutte which uses these four components, along with four additional components from
Zend Framework. It's a prime example of the flexibility and power that standalone components, along with a willingness to share, can provide.
CssSelector
The first new component, CssSelector, converts CSS selectors to XPath so that the power of XPath can be used with the familiarity of CSS selectors. The component is actually a port of a Python library called lxml and represents a translation from Python to PHP along with the addition of some unit tests.
The use is simple, and is covered in greater detail by Fabien on his
blog. The following code, from Fabien's blog, iterates through a specific anchor tag and prints out the href attribute.
use Symfony\Components\CssSelector\Parser;
$document = new \DOMDocument();
$document->loadHTMLFile('http://fabien.potencier.org/articles');
$xpath = new \DOMXPath($document);
foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node)
{
printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href'));
}
DomCrawler
After the CssSelector, the obvious next step is to create a component that allows you to take control of any HTML or XML content. The DomCrawler allows you to do just that. Though there's not yet any real documentation, the unit tests reveal a powerful system for crawling the DOM.
use Symfony\Components\DomCrawler\Crawler;
$crawler = new Crawler();
$crawler->addHtmlContent('<html><div class="foo"></div></html>');
$crawler->filter('div')->attr('class') // returns foo
The component has a rich list of methods that can be called to perform tasks on your DOM such as filtering, returning attributes, returning text, calling methods iteratively on nodes, and manipulating link and form elements.
Process
The Process components tackles another issue entirely. Namely, the Process component allows PHP scripts to be run in entirely different processes. In other words, "PhpProcess runs a PHP script in a forked process." This is done via a simple class wrapper around the proc_* functions.
use Symfony\Components\Process\PhpProcess;
$process = new PhpProcess('/path/to/script.php');
$process->run();
echo $process->getOutput();
BrowserKit
Finally, the BrowserKit component brings all of the components together. The BrowserKit makes a request (via a method you define), and then allows you to interact with the page (e.g. click, submit) or retrieve information from the page (via the DomCrawler).
The best way to understand the BrowserKit is to see it in action with Goutte.
Goutte - a screen scraping and web crawling library
Goutte combines the above four components along with Zend Framework's Date, Uri, Http, and Validate components to form an easy and powerful way to programmatically crawl and interact with web pages.
$client = new Client();
$crawler = $client->request('GET', 'http://www.symfony-project.org/');
// Click on a link
$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);
// Read through a list of error messages
$nodes = $crawler->filter('ul.error_list');
foreach ($nodes as $node)
{
echo 'Error: ' . $node->text();
}