Back to Blog
A
PHP Architect ·

Introducing four new PHP 5.3 components and Goutte, a simple web scraper

To support symfony 2's development, Fabien Potencier - the lead developer of the symfony framework - has released four new PHP 5.3 based components: Though these components will be used by Symfony 2, they're built to be standalone components that can be easily used in any PHP 5.3 project. To prove that point, Fabien also released a new web scraper/crawler called Goutte which uses these four components, along with four additional components from Zend Framework. It's a prime example of the flexibility and power that standalone components, along with a willingness to share, can provide.

CssSelector

The first new component, CssSelector, converts CSS selectors to XPath so that the power of XPath can be used with the familiarity of CSS selectors. The component is actually a port of a Python library called lxml and represents a translation from Python to PHP along with the addition of some unit tests. The use is simple, and is covered in greater detail by Fabien on his blog. The following code, from Fabien's blog, iterates through a specific anchor tag and prints out the href attribute.
  use Symfony\Components\CssSelector\Parser;

  $document = new \DOMDocument();
  $document->loadHTMLFile('http://fabien.potencier.org/articles');

  $xpath = new \DOMXPath($document);
  foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node)
  {
    printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href'));
  }

DomCrawler

After the CssSelector, the obvious next step is to create a component that allows you to take control of any HTML or XML content. The DomCrawler allows you to do just that. Though there's not yet any real documentation, the unit tests reveal a powerful system for crawling the DOM.
  use Symfony\Components\DomCrawler\Crawler;

  $crawler = new Crawler();
  $crawler->addHtmlContent('<html><div class="foo"></div></html>');

  $crawler->filter('div')->attr('class') // returns foo
The component has a rich list of methods that can be called to perform tasks on your DOM such as filtering, returning attributes, returning text, calling methods iteratively on nodes, and manipulating link and form elements.

Process

The Process components tackles another issue entirely. Namely, the Process component allows PHP scripts to be run in entirely different processes. In other words, "PhpProcess runs a PHP script in a forked process." This is done via a simple class wrapper around the proc_* functions.
  use Symfony\Components\Process\PhpProcess;

  $process = new PhpProcess('/path/to/script.php');
  $process->run();

  echo $process->getOutput();

BrowserKit

Finally, the BrowserKit component brings all of the components together. The BrowserKit makes a request (via a method you define), and then allows you to interact with the page (e.g. click, submit) or retrieve information from the page (via the DomCrawler). The best way to understand the BrowserKit is to see it in action with Goutte.

Goutte - a screen scraping and web crawling library

Goutte combines the above four components along with Zend Framework's Date, Uri, Http, and Validate components to form an easy and powerful way to programmatically crawl and interact with web pages.
  $client = new Client();
  $crawler = $client->request('GET', 'http://www.symfony-project.org/');

  // Click on a link
  $link = $crawler->selectLink('Plugins')->link();
  $crawler = $client->click($link);

  // Read through a list of error messages
  $nodes = $crawler->filter('ul.error_list');
  foreach ($nodes as $node)
  {
    echo 'Error: ' . $node->text();
  }
A

PHP Architect

April 22, 2010

Share

Our Partners

Collaborating with industry leaders to bring you the best PHP resources and expertise

Interested in partnering? Get in touch →