Documentation

Crawler
in package

Application

implements Countable, IteratorAggregate

Crawler eases navigation of a list of \DOMNode objects.

Interfaces, Classes and Traits

Countable
IteratorAggregate

$uri : string|null
$baseHref : string|null: The base href value.
$defaultNamespacePrefix : string: The default namespace prefix to be used with XPath and CSS expressions.
$document : DOMDocument|null
$html5Parser : HTML5|null
$isHtml : bool: Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
$namespaces : array<string, string>: A map of manually registered namespaces.
$nodes : array<string|int, DOMNode>
__construct() : mixed
add() : mixed: Adds a node to the current list of nodes.
addContent() : mixed: Adds HTML/XML content.
addDocument() : mixed: Adds a \DOMDocument to the list of nodes.
addHtmlContent() : mixed: Adds an HTML content to the list of nodes.
addNode() : mixed: Adds a \DOMNode instance to the list of nodes.
addNodeList() : mixed: Adds a \DOMNodeList to the list of nodes.
addNodes() : mixed: Adds an array of \DOMNode instances to the list of nodes.
addXmlContent() : mixed: Adds an XML content to the list of nodes.
attr() : string|null: Returns the attribute value of the first node of the list.
children() : static: Returns the children nodes of the current selection.
clear() : mixed: Removes all the nodes.
closest() : self|null: Return first parents (heading toward the document root) of the Element that matches the provided selector.
count() : int
each() : array<string|int, mixed>: Calls an anonymous function on each node of the list.
eq() : static: Returns a node given its position in the node list.
evaluate() : array<string|int, mixed>|Crawler: Evaluates an XPath expression.
extract() : array<string|int, mixed>: Extracts information from the list of nodes.
filter() : static: Filters the list of nodes with a CSS selector.
filterXPath() : static: Filters the list of nodes with an XPath expression.
first() : static: Returns the first node of the current selection.
form() : Form: Returns a Form object for the first node in the list.
getBaseHref() : string|null: Returns base href.
getIterator() : ArrayIterator|array<string|int, DOMNode>
getNode() : DOMNode|null
getUri() : string|null: Returns the current URI.
html() : string: Returns the first node of the list as HTML.
image() : Image: Returns an Image object for the first node in the list.
images() : array<string|int, Image>: Returns an array of Image objects for the nodes in the list.
last() : static: Returns the last node of the current selection.
link() : Link: Returns a Link object for the first node in the list.
links() : array<string|int, Link>: Returns an array of Link objects for the nodes in the list.
matches() : bool
nextAll() : static: Returns the next siblings nodes of the current selection.
nodeName() : string: Returns the node name of the first node of the list.
outerHtml() : string
parents() : static: Returns the parents nodes of the current selection.
previousAll() : static: Returns the previous sibling nodes of the current selection.
reduce() : static: Reduces the list of nodes by calling an anonymous function.
registerNamespace() : mixed
selectButton() : static: Selects a button by name or alt value for images.
selectImage() : static: Selects images by alt value.
selectLink() : static: Selects links by name or alt value for clickable images.
setDefaultNamespacePrefix() : mixed: Overloads a default namespace prefix to be used with XPath and CSS expressions.
siblings() : static: Returns the siblings nodes of the current selection.
slice() : static: Slices the list of nodes by $offset and $length.
text() : string: Returns the text of the first node of the list.
xpathLiteral() : string: Converts string for XPath expressions.
sibling() : array<string|int, mixed>
canParseHtml5String() : bool
convertToHtmlEntities() : string: Converts charset to HTML-entities to ensure valid parsing.
createCssSelectorConverter() : CssSelectorConverter
createDOMXPath() : DOMXPath
createSubCrawler() : static: Creates a crawler for some subnodes.
discoverNamespace() : string|null
filterRelativeXPath() : static: Filters the list of nodes with an XPath expression.
findNamespacePrefixes() : array<string|int, mixed>
isValidHtml5Heading() : bool
parseHtml5() : DOMDocument
parseHtmlString() : DOMDocument: Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
parseXhtml() : DOMDocument
relativize() : string: Make the XPath relative to the current context.

$uri


    protected
        string|null
    $uri

$baseHref

The base href value.


    private
        string|null
    $baseHref

$defaultNamespacePrefix

The default namespace prefix to be used with XPath and CSS expressions.


    private
        string
    $defaultNamespacePrefix
     = 'default'

$document


    private
        DOMDocument|null
    $document

$html5Parser


    private
        HTML5|null
    $html5Parser

$isHtml

Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).


    private
        bool
    $isHtml
     = true

$namespaces

A map of manually registered namespaces.


    private
        array<string, string>
    $namespaces
     = []

$nodes


    private
        array<string|int, DOMNode>
    $nodes
     = []

__construct()


    public
                __construct([DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node = null ][, string $uri = null ][, string $baseHref = null ]) : mixed

Parameters

$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null = null: A Node to use as the base for the crawling
$uri : string = null
$baseHref : string = null

Return values

mixed —

add()

Adds a node to the current list of nodes.


    public
                add(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node) : mixed

This method uses the appropriate specialized add*() method based on the type of the argument.

Parameters

$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null: A node

Return values

mixed —

addContent()

Adds HTML/XML content.


    public
                addContent(string $content[, string $type = null ]) : mixed

If the charset is not set via the content type, it is assumed to be UTF-8, or ISO-8859-1 as a fallback, which is the default charset defined by the HTTP 1.1 specification.

Parameters

$content : string
$type : string = null

Return values

mixed —

addDocument()

Adds a \DOMDocument to the list of nodes.


    public
                addDocument(DOMDocument $dom) : mixed

Parameters

$dom : DOMDocument: A \DOMDocument instance

Return values

mixed —

addHtmlContent()

Adds an HTML content to the list of nodes.


    public
                addHtmlContent(string $content[, string $charset = 'UTF-8' ]) : mixed

The libxml errors are disabled when the content is parsed.

If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.

Parameters

$content : string
$charset : string = 'UTF-8'

Return values

mixed —

addNode()

Adds a \DOMNode instance to the list of nodes.


    public
                addNode(DOMNode $node) : mixed

Parameters

$node : DOMNode: A \DOMNode instance

Return values

mixed —

addNodeList()

Adds a \DOMNodeList to the list of nodes.


    public
                addNodeList(DOMNodeList $nodes) : mixed

Parameters

$nodes : DOMNodeList: A \DOMNodeList instance

Return values

mixed —

addNodes()

Adds an array of \DOMNode instances to the list of nodes.


    public
                addNodes(array<string|int, DOMNode> $nodes) : mixed

Parameters

$nodes : array<string|int, DOMNode>: An array of \DOMNode instances

Return values

mixed —

addXmlContent()

Adds an XML content to the list of nodes.


    public
                addXmlContent(string $content[, string $charset = 'UTF-8' ][, int $options = LIBXML_NONET ]) : mixed

The libxml errors are disabled when the content is parsed.

Parameters

$content : string
$charset : string = 'UTF-8'
$options : int = LIBXML_NONET: Bitwise OR of the libxml option constants LIBXML_PARSEHUGE is dangerous, see http://symfony.com/blog/security-release-symfony-2-0-17-released

Return values

mixed —

attr()

Returns the attribute value of the first node of the list.


    public
                attr(string $attribute) : string|null

Parameters

$attribute : string

Return values

string|null —

The attribute value or null if the attribute does not exist

children()

Returns the children nodes of the current selection.


    public
                children([string $selector = null ]) : static

Parameters

$selector : string = null

Return values

static —

clear()

Removes all the nodes.


    public
                clear() : mixed

Return values

mixed —

closest()

Return first parents (heading toward the document root) of the Element that matches the provided selector.


    public
                closest(string $selector) : self|null

Parameters

$selector : string

Return values

self|null —

count()


    public
                count() : int

Return values

int —

each()

Calls an anonymous function on each node of the list.


    public
                each(Closure $closure) : array<string|int, mixed>

The anonymous function receives the position and the node wrapped in a Crawler instance as arguments.

Example:

$crawler->filter('h1')->each(function ($node, $i) {
    return $node->text();
});

Parameters

$closure : Closure: An anonymous function

Return values

array<string|int, mixed> —

An array of values returned by the anonymous function

eq()

Returns a node given its position in the node list.


    public
                eq(int $position) : static

Parameters

$position : int

Return values

static —

evaluate()

Evaluates an XPath expression.


    public
                evaluate(string $xpath) : array<string|int, mixed>|Crawler

Since an XPath expression might evaluate to either a simple type or a \DOMNodeList, this method will return either an array of simple types or a new Crawler instance.

Parameters

$xpath : string

Return values

array<string|int, mixed>|Crawler —

An array of evaluation results or a new Crawler instance

extract()

Extracts information from the list of nodes.


    public
                extract(array<string|int, mixed> $attributes) : array<string|int, mixed>

You can extract attributes or/and the node value (_text).

Example:

$crawler->filter('h1 a')->extract(['_text', 'href']);

Parameters

$attributes : array<string|int, mixed>

Return values

array<string|int, mixed> —

An array of extracted values

filter()

Filters the list of nodes with a CSS selector.


    public
                filter(string $selector) : static

This method only works if you have installed the CssSelector Symfony Component.

Parameters

$selector : string

Return values

static —

filterXPath()

Filters the list of nodes with an XPath expression.


    public
                filterXPath(string $xpath) : static

The XPath expression is evaluated in the context of the crawler, which is considered as a fake parent of the elements inside it. This means that a child selector "div" or "./div" will match only the div elements of the current crawler, not their children.

Parameters

$xpath : string

Return values

static —

first()

Returns the first node of the current selection.


    public
                first() : static

Return values

static —

form()

Returns a Form object for the first node in the list.


    public
                form([array<string|int, mixed> $values = null ][, string $method = null ]) : Form

Parameters

$values : array<string|int, mixed> = null
$method : string = null

Return values

Form —

A Form instance

getBaseHref()

Returns base href.


    public
                getBaseHref() : string|null

Return values

string|null —

getIterator()


    public
                getIterator() : ArrayIterator|array<string|int, DOMNode>

Return values

ArrayIterator|array<string|int, DOMNode> —

getNode()


    public
                getNode(int $position) : DOMNode|null

Parameters

$position : int

Return values

DOMNode|null —

getUri()

Returns the current URI.


    public
                getUri() : string|null

Return values

string|null —

html()

Returns the first node of the list as HTML.


    public
                html([string|null $default = null ]) : string

Parameters

$default : string|null = null: When not null: the value to return when the current node is empty

Return values

string —

The node html

image()

Returns an Image object for the first node in the list.


    public
                image() : Image

Return values

Image —

An Image instance

images()

Returns an array of Image objects for the nodes in the list.


    public
                images() : array<string|int, Image>

Return values

array<string|int, Image> —

An array of Image instances

last()

Returns the last node of the current selection.


    public
                last() : static

Return values

static —

link()

Returns a Link object for the first node in the list.


    public
                link([string $method = 'get' ]) : Link

Parameters

$method : string = 'get'

Return values

Link —

A Link instance

links()

Returns an array of Link objects for the nodes in the list.


    public
                links() : array<string|int, Link>

Return values

array<string|int, Link> —

An array of Link instances

matches()


    public
                matches(string $selector) : bool

Parameters

$selector : string

Return values

bool —

nextAll()

Returns the next siblings nodes of the current selection.


    public
                nextAll() : static

Return values

static —

nodeName()

Returns the node name of the first node of the list.


    public
                nodeName() : string

Return values

string —

The node name

outerHtml()


    public
                outerHtml() : string

Return values

string —

parents()

Returns the parents nodes of the current selection.


    public
                parents() : static

Return values

static —

previousAll()

Returns the previous sibling nodes of the current selection.


    public
                previousAll() : static

Return values

static —

reduce()

Reduces the list of nodes by calling an anonymous function.


    public
                reduce(Closure $closure) : static

To remove a node from the list, the anonymous function must return false.

Parameters

$closure : Closure: An anonymous function

Return values

static —

registerNamespace()


    public
                registerNamespace(string $prefix, string $namespace) : mixed

Parameters

$prefix : string
$namespace : string

Return values

mixed —

selectButton()

Selects a button by name or alt value for images.


    public
                selectButton(string $value) : static

Parameters

$value : string

Return values

static —

selectImage()

Selects images by alt value.


    public
                selectImage(string $value) : static

Parameters

$value : string

Return values

static —

A new instance of Crawler with the filtered list of nodes

selectLink()

Selects links by name or alt value for clickable images.


    public
                selectLink(string $value) : static

Parameters

$value : string

Return values

static —

setDefaultNamespacePrefix()

Overloads a default namespace prefix to be used with XPath and CSS expressions.


    public
                setDefaultNamespacePrefix(string $prefix) : mixed

Parameters

$prefix : string

Return values

mixed —

siblings()

Returns the siblings nodes of the current selection.


    public
                siblings() : static

Return values

static —

slice()

Slices the list of nodes by $offset and $length.


    public
                slice(int $offset[, int $length = null ]) : static

Parameters

$offset : int
$length : int = null

Return values

static —

text()

Returns the text of the first node of the list.


    public
                text([string|null $default = null ][, bool $normalizeWhitespace = true ]) : string

Pass true as the second argument to normalize whitespaces.

Parameters

$default : string|null = null: When not null: the value to return when the current node is empty
$normalizeWhitespace : bool = true: Whether whitespaces should be trimmed and normalized to single spaces

Return values

string —

The node value

xpathLiteral()

Converts string for XPath expressions.


    public
            static    xpathLiteral(string $s) : string

Escaped characters are: quotes (") and apostrophe (').

Examples:

echo Crawler::xpathLiteral('foo " bar'); //prints 'foo " bar'

echo Crawler::xpathLiteral("foo ' bar"); //prints "foo ' bar"

echo Crawler::xpathLiteral('a'b"c'); //prints concat('a', "'", 'b"c')

Parameters

$s : string

Return values

string —

Converted string

sibling()


    protected
                sibling(DOMNode $node[, string $siblingDir = 'nextSibling' ]) : array<string|int, mixed>

Parameters

$node : DOMNode
$siblingDir : string = 'nextSibling'

Return values

array<string|int, mixed> —

canParseHtml5String()


    private
                canParseHtml5String(string $content) : bool

Parameters

$content : string

Return values

bool —

convertToHtmlEntities()

Converts charset to HTML-entities to ensure valid parsing.


    private
                convertToHtmlEntities(string $htmlContent[, string $charset = 'UTF-8' ]) : string

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

string —

createCssSelectorConverter()


    private
                createCssSelectorConverter() : CssSelectorConverter

Return values

CssSelectorConverter —

createDOMXPath()


    private
                createDOMXPath(DOMDocument $document[, array<string|int, mixed> $prefixes = [] ]) : DOMXPath

Parameters

$document : DOMDocument
$prefixes : array<string|int, mixed> = []

Return values

DOMXPath —

createSubCrawler()

Creates a crawler for some subnodes.


    private
                createSubCrawler(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $nodes) : static

Parameters

$nodes : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null

Return values

static —

discoverNamespace()


    private
                discoverNamespace(DOMXPath $domxpath, string $prefix) : string|null

Parameters

$domxpath : DOMXPath
$prefix : string

Return values

string|null —

filterRelativeXPath()

Filters the list of nodes with an XPath expression.


    private
                filterRelativeXPath(string $xpath) : static

The XPath expression should already be processed to apply it in the context of each node.

Parameters

$xpath : string

Return values

static —

findNamespacePrefixes()


    private
                findNamespacePrefixes(string $xpath) : array<string|int, mixed>

Parameters

$xpath : string

Return values

array<string|int, mixed> —

isValidHtml5Heading()


    private
                isValidHtml5Heading(string $heading) : bool

Parameters

$heading : string

Return values

bool —

parseHtml5()


    private
                parseHtml5(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

DOMDocument —

parseHtmlString()

Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.


    private
                parseHtmlString(string $content, string $charset) : DOMDocument

Use libxml parser otherwise.

Parameters

$content : string
$charset : string

Return values

DOMDocument —

parseXhtml()


    private
                parseXhtml(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

DOMDocument —

relativize()

Make the XPath relative to the current context.


    private
                relativize(string $xpath) : string

The returned XPath will match elements matching the XPath inside the current crawler when running in the context of a node of the crawler.

Parameters

$xpath : string

Return values

string —

Crawler in package Application implements Countable, IteratorAggregate

Tags

Interfaces, Classes and Traits

Table of Contents

Properties

$uri

$baseHref

$defaultNamespacePrefix

$document

$html5Parser

$isHtml

$namespaces

$nodes

Methods

__construct()

Parameters

Return values

add()

Parameters

Tags

Return values

addContent()

Parameters

Return values

addDocument()

Parameters

Return values

addHtmlContent()

Parameters

Return values

addNode()

Parameters

Return values

addNodeList()

Parameters

Return values

addNodes()

Parameters

Return values

addXmlContent()

Parameters

Return values

attr()

Parameters

Tags

Return values

children()

Parameters

Tags

Return values

clear()

Return values

closest()

Parameters

Tags

Return values

count()

Return values

each()

Parameters

Return values

eq()

Parameters

Return values

evaluate()

Parameters

Return values

extract()

Parameters

Return values

filter()

Parameters

Tags

Return values

filterXPath()

Parameters

Return values

first()

Return values

form()

Crawler
in package

Application

implements Countable, IteratorAggregate