Documentation

Crawler
in package
implements Countable, IteratorAggregate

Crawler eases navigation of a list of \DOMNode objects.

Tags
author

Fabien Potencier fabien@symfony.com

Interfaces, Classes and Traits

Countable
IteratorAggregate

Table of Contents

$uri  : string|null
$baseHref  : string|null
The base href value.
$defaultNamespacePrefix  : string
The default namespace prefix to be used with XPath and CSS expressions.
$document  : DOMDocument|null
$html5Parser  : HTML5|null
$isHtml  : bool
Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
$namespaces  : array<string, string>
A map of manually registered namespaces.
$nodes  : array<string|int, DOMNode>
__construct()  : mixed
add()  : mixed
Adds a node to the current list of nodes.
addContent()  : mixed
Adds HTML/XML content.
addDocument()  : mixed
Adds a \DOMDocument to the list of nodes.
addHtmlContent()  : mixed
Adds an HTML content to the list of nodes.
addNode()  : mixed
Adds a \DOMNode instance to the list of nodes.
addNodeList()  : mixed
Adds a \DOMNodeList to the list of nodes.
addNodes()  : mixed
Adds an array of \DOMNode instances to the list of nodes.
addXmlContent()  : mixed
Adds an XML content to the list of nodes.
attr()  : string|null
Returns the attribute value of the first node of the list.
children()  : static
Returns the children nodes of the current selection.
clear()  : mixed
Removes all the nodes.
closest()  : self|null
Return first parents (heading toward the document root) of the Element that matches the provided selector.
count()  : int
each()  : array<string|int, mixed>
Calls an anonymous function on each node of the list.
eq()  : static
Returns a node given its position in the node list.
evaluate()  : array<string|int, mixed>|Crawler
Evaluates an XPath expression.
extract()  : array<string|int, mixed>
Extracts information from the list of nodes.
filter()  : static
Filters the list of nodes with a CSS selector.
filterXPath()  : static
Filters the list of nodes with an XPath expression.
first()  : static
Returns the first node of the current selection.
form()  : Form
Returns a Form object for the first node in the list.
getBaseHref()  : string|null
Returns base href.
getIterator()  : ArrayIterator|array<string|int, DOMNode>
getNode()  : DOMNode|null
getUri()  : string|null
Returns the current URI.
html()  : string
Returns the first node of the list as HTML.
image()  : Image
Returns an Image object for the first node in the list.
images()  : array<string|int, Image>
Returns an array of Image objects for the nodes in the list.
last()  : static
Returns the last node of the current selection.
link()  : Link
Returns a Link object for the first node in the list.
links()  : array<string|int, Link>
Returns an array of Link objects for the nodes in the list.
matches()  : bool
nextAll()  : static
Returns the next siblings nodes of the current selection.
nodeName()  : string
Returns the node name of the first node of the list.
outerHtml()  : string
parents()  : static
Returns the parents nodes of the current selection.
previousAll()  : static
Returns the previous sibling nodes of the current selection.
reduce()  : static
Reduces the list of nodes by calling an anonymous function.
registerNamespace()  : mixed
selectButton()  : static
Selects a button by name or alt value for images.
selectImage()  : static
Selects images by alt value.
selectLink()  : static
Selects links by name or alt value for clickable images.
setDefaultNamespacePrefix()  : mixed
Overloads a default namespace prefix to be used with XPath and CSS expressions.
siblings()  : static
Returns the siblings nodes of the current selection.
slice()  : static
Slices the list of nodes by $offset and $length.
text()  : string
Returns the text of the first node of the list.
xpathLiteral()  : string
Converts string for XPath expressions.
sibling()  : array<string|int, mixed>
canParseHtml5String()  : bool
convertToHtmlEntities()  : string
Converts charset to HTML-entities to ensure valid parsing.
createCssSelectorConverter()  : CssSelectorConverter
createDOMXPath()  : DOMXPath
createSubCrawler()  : static
Creates a crawler for some subnodes.
discoverNamespace()  : string|null
filterRelativeXPath()  : static
Filters the list of nodes with an XPath expression.
findNamespacePrefixes()  : array<string|int, mixed>
isValidHtml5Heading()  : bool
parseHtml5()  : DOMDocument
parseHtmlString()  : DOMDocument
Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
parseXhtml()  : DOMDocument
relativize()  : string
Make the XPath relative to the current context.

Properties

$baseHref

The base href value.

private string|null $baseHref

$defaultNamespacePrefix

The default namespace prefix to be used with XPath and CSS expressions.

private string $defaultNamespacePrefix = 'default'

$document

private DOMDocument|null $document

$html5Parser

private HTML5|null $html5Parser

$isHtml

Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).

private bool $isHtml = true

$namespaces

A map of manually registered namespaces.

private array<string, string> $namespaces = []

$nodes

private array<string|int, DOMNode> $nodes = []

Methods

__construct()

public __construct([DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node = null ][, string $uri = null ][, string $baseHref = null ]) : mixed
Parameters
$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null = null

A Node to use as the base for the crawling

$uri : string = null
$baseHref : string = null
Return values
mixed

add()

Adds a node to the current list of nodes.

public add(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node) : mixed

This method uses the appropriate specialized add*() method based on the type of the argument.

Parameters
$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null

A node

Tags
throws
InvalidArgumentException

when node is not the expected type

Return values
mixed

addContent()

Adds HTML/XML content.

public addContent(string $content[, string $type = null ]) : mixed

If the charset is not set via the content type, it is assumed to be UTF-8, or ISO-8859-1 as a fallback, which is the default charset defined by the HTTP 1.1 specification.

Parameters
$content : string
$type : string = null
Return values
mixed

addDocument()

Adds a \DOMDocument to the list of nodes.

public addDocument(DOMDocument $dom) : mixed
Parameters
$dom : DOMDocument

A \DOMDocument instance

Return values
mixed

addHtmlContent()

Adds an HTML content to the list of nodes.

public addHtmlContent(string $content[, string $charset = 'UTF-8' ]) : mixed

The libxml errors are disabled when the content is parsed.

If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.

Parameters
$content : string
$charset : string = 'UTF-8'
Return values
mixed

addNode()

Adds a \DOMNode instance to the list of nodes.

public addNode(DOMNode $node) : mixed
Parameters
$node : DOMNode

A \DOMNode instance

Return values
mixed

addNodeList()

Adds a \DOMNodeList to the list of nodes.

public addNodeList(DOMNodeList $nodes) : mixed
Parameters
$nodes : DOMNodeList

A \DOMNodeList instance

Return values
mixed

addNodes()

Adds an array of \DOMNode instances to the list of nodes.

public addNodes(array<string|int, DOMNode$nodes) : mixed
Parameters
$nodes : array<string|int, DOMNode>

An array of \DOMNode instances

Return values
mixed

addXmlContent()

Adds an XML content to the list of nodes.

public addXmlContent(string $content[, string $charset = 'UTF-8' ][, int $options = LIBXML_NONET ]) : mixed

The libxml errors are disabled when the content is parsed.

If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.

Parameters
$content : string
$charset : string = 'UTF-8'
$options : int = LIBXML_NONET

Bitwise OR of the libxml option constants LIBXML_PARSEHUGE is dangerous, see http://symfony.com/blog/security-release-symfony-2-0-17-released

Return values
mixed

attr()

Returns the attribute value of the first node of the list.

public attr(string $attribute) : string|null
Parameters
$attribute : string
Tags
throws
InvalidArgumentException

When current node is empty

Return values
string|null

The attribute value or null if the attribute does not exist

children()

Returns the children nodes of the current selection.

public children([string $selector = null ]) : static
Parameters
$selector : string = null
Tags
throws
InvalidArgumentException

When current node is empty

throws
RuntimeException

If the CssSelector Component is not available and $selector is provided

Return values
static

clear()

Removes all the nodes.

public clear() : mixed
Return values
mixed

count()

public count() : int
Return values
int

each()

Calls an anonymous function on each node of the list.

public each(Closure $closure) : array<string|int, mixed>

The anonymous function receives the position and the node wrapped in a Crawler instance as arguments.

Example:

$crawler->filter('h1')->each(function ($node, $i) {
    return $node->text();
});
Parameters
$closure : Closure

An anonymous function

Return values
array<string|int, mixed>

An array of values returned by the anonymous function

eq()

Returns a node given its position in the node list.

public eq(int $position) : static
Parameters
$position : int
Return values
static

evaluate()

Evaluates an XPath expression.

public evaluate(string $xpath) : array<string|int, mixed>|Crawler

Since an XPath expression might evaluate to either a simple type or a \DOMNodeList, this method will return either an array of simple types or a new Crawler instance.

Parameters
$xpath : string
Return values
array<string|int, mixed>|Crawler

An array of evaluation results or a new Crawler instance

extract()

Extracts information from the list of nodes.

public extract(array<string|int, mixed> $attributes) : array<string|int, mixed>

You can extract attributes or/and the node value (_text).

Example:

$crawler->filter('h1 a')->extract(['_text', 'href']);
Parameters
$attributes : array<string|int, mixed>
Return values
array<string|int, mixed>

An array of extracted values

filter()

Filters the list of nodes with a CSS selector.

public filter(string $selector) : static

This method only works if you have installed the CssSelector Symfony Component.

Parameters
$selector : string
Tags
throws
RuntimeException

if the CssSelector Component is not available

Return values
static

filterXPath()

Filters the list of nodes with an XPath expression.

public filterXPath(string $xpath) : static

The XPath expression is evaluated in the context of the crawler, which is considered as a fake parent of the elements inside it. This means that a child selector "div" or "./div" will match only the div elements of the current crawler, not their children.

Parameters
$xpath : string
Return values
static

first()

Returns the first node of the current selection.

public first() : static
Return values
static

form()

Returns a Form object for the first node in the list.

public form([array<string|int, mixed> $values = null ][, string $method = null ]) : Form
Parameters
$values : array<string|int, mixed> = null
$method : string = null
Tags
throws
InvalidArgumentException

If the current node list is empty or the selected node is not instance of DOMElement

Return values
Form

A Form instance

getBaseHref()

Returns base href.

public getBaseHref() : string|null
Return values
string|null

getIterator()

public getIterator() : ArrayIterator|array<string|int, DOMNode>
Return values
ArrayIterator|array<string|int, DOMNode>

getNode()

public getNode(int $position) : DOMNode|null
Parameters
$position : int
Return values
DOMNode|null

getUri()

Returns the current URI.

public getUri() : string|null
Return values
string|null

html()

Returns the first node of the list as HTML.

public html([string|null $default = null ]) : string
Parameters
$default : string|null = null

When not null: the value to return when the current node is empty

Tags
throws
InvalidArgumentException

When current node is empty

Return values
string

The node html

image()

Returns an Image object for the first node in the list.

public image() : Image
Tags
throws
InvalidArgumentException

If the current node list is empty

Return values
Image

An Image instance

images()

Returns an array of Image objects for the nodes in the list.

public images() : array<string|int, Image>
Return values
array<string|int, Image>

An array of Image instances

last()

Returns the last node of the current selection.

public last() : static
Return values
static

Returns a Link object for the first node in the list.

public link([string $method = 'get' ]) : Link
Parameters
$method : string = 'get'
Tags
throws
InvalidArgumentException

If the current node list is empty or the selected node is not instance of DOMElement

Return values
Link

A Link instance

Returns an array of Link objects for the nodes in the list.

public links() : array<string|int, Link>
Tags
throws
InvalidArgumentException

If the current node list contains non-DOMElement instances

Return values
array<string|int, Link>

An array of Link instances

matches()

public matches(string $selector) : bool
Parameters
$selector : string
Return values
bool

nextAll()

Returns the next siblings nodes of the current selection.

public nextAll() : static
Tags
throws
InvalidArgumentException

When current node is empty

Return values
static

nodeName()

Returns the node name of the first node of the list.

public nodeName() : string
Tags
throws
InvalidArgumentException

When current node is empty

Return values
string

The node name

outerHtml()

public outerHtml() : string
Return values
string

parents()

Returns the parents nodes of the current selection.

public parents() : static
Tags
throws
InvalidArgumentException

When current node is empty

Return values
static

previousAll()

Returns the previous sibling nodes of the current selection.

public previousAll() : static
Tags
throws
InvalidArgumentException
Return values
static

reduce()

Reduces the list of nodes by calling an anonymous function.

public reduce(Closure $closure) : static

To remove a node from the list, the anonymous function must return false.

Parameters
$closure : Closure

An anonymous function

Return values
static

registerNamespace()

public registerNamespace(string $prefix, string $namespace) : mixed
Parameters
$prefix : string
$namespace : string
Return values
mixed

selectButton()

Selects a button by name or alt value for images.

public selectButton(string $value) : static
Parameters
$value : string
Return values
static

selectImage()

Selects images by alt value.

public selectImage(string $value) : static
Parameters
$value : string
Return values
static

A new instance of Crawler with the filtered list of nodes

Selects links by name or alt value for clickable images.

public selectLink(string $value) : static
Parameters
$value : string
Return values
static

setDefaultNamespacePrefix()

Overloads a default namespace prefix to be used with XPath and CSS expressions.

public setDefaultNamespacePrefix(string $prefix) : mixed
Parameters
$prefix : string
Return values
mixed

siblings()

Returns the siblings nodes of the current selection.

public siblings() : static
Tags
throws
InvalidArgumentException

When current node is empty

Return values
static

slice()

Slices the list of nodes by $offset and $length.

public slice(int $offset[, int $length = null ]) : static
Parameters
$offset : int
$length : int = null
Return values
static

text()

Returns the text of the first node of the list.

public text([string|null $default = null ][, bool $normalizeWhitespace = true ]) : string

Pass true as the second argument to normalize whitespaces.

Parameters
$default : string|null = null

When not null: the value to return when the current node is empty

$normalizeWhitespace : bool = true

Whether whitespaces should be trimmed and normalized to single spaces

Tags
throws
InvalidArgumentException

When current node is empty

Return values
string

The node value

xpathLiteral()

Converts string for XPath expressions.

public static xpathLiteral(string $s) : string

Escaped characters are: quotes (") and apostrophe (').

Examples:

echo Crawler::xpathLiteral('foo " bar'); //prints 'foo " bar'

echo Crawler::xpathLiteral("foo ' bar"); //prints "foo ' bar"

echo Crawler::xpathLiteral('a'b"c'); //prints concat('a', "'", 'b"c')

Parameters
$s : string
Return values
string

Converted string

sibling()

protected sibling(DOMNode $node[, string $siblingDir = 'nextSibling' ]) : array<string|int, mixed>
Parameters
$node : DOMNode
$siblingDir : string = 'nextSibling'
Return values
array<string|int, mixed>

canParseHtml5String()

private canParseHtml5String(string $content) : bool
Parameters
$content : string
Return values
bool

convertToHtmlEntities()

Converts charset to HTML-entities to ensure valid parsing.

private convertToHtmlEntities(string $htmlContent[, string $charset = 'UTF-8' ]) : string
Parameters
$htmlContent : string
$charset : string = 'UTF-8'
Return values
string

createCssSelectorConverter()

private createCssSelectorConverter() : CssSelectorConverter
Tags
throws
LogicException

If the CssSelector Component is not available

Return values
CssSelectorConverter

createDOMXPath()

private createDOMXPath(DOMDocument $document[, array<string|int, mixed> $prefixes = [] ]) : DOMXPath
Parameters
$document : DOMDocument
$prefixes : array<string|int, mixed> = []
Tags
throws
InvalidArgumentException
Return values
DOMXPath

createSubCrawler()

Creates a crawler for some subnodes.

private createSubCrawler(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $nodes) : static
Parameters
$nodes : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null
Return values
static

discoverNamespace()

private discoverNamespace(DOMXPath $domxpath, string $prefix) : string|null
Parameters
$domxpath : DOMXPath
$prefix : string
Tags
throws
InvalidArgumentException
Return values
string|null

filterRelativeXPath()

Filters the list of nodes with an XPath expression.

private filterRelativeXPath(string $xpath) : static

The XPath expression should already be processed to apply it in the context of each node.

Parameters
$xpath : string
Return values
static

findNamespacePrefixes()

private findNamespacePrefixes(string $xpath) : array<string|int, mixed>
Parameters
$xpath : string
Return values
array<string|int, mixed>

isValidHtml5Heading()

private isValidHtml5Heading(string $heading) : bool
Parameters
$heading : string
Return values
bool

parseHtml5()

private parseHtml5(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
$htmlContent : string
$charset : string = 'UTF-8'
Return values
DOMDocument

parseHtmlString()

Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.

private parseHtmlString(string $content, string $charset) : DOMDocument

Use libxml parser otherwise.

Parameters
$content : string
$charset : string
Return values
DOMDocument

parseXhtml()

private parseXhtml(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
$htmlContent : string
$charset : string = 'UTF-8'
Return values
DOMDocument

relativize()

Make the XPath relative to the current context.

private relativize(string $xpath) : string

The returned XPath will match elements matching the XPath inside the current crawler when running in the context of a node of the crawler.

Parameters
$xpath : string
Return values
string

Search results