Crawler
in package
implements
Countable, IteratorAggregate
Crawler eases navigation of a list of \DOMNode objects.
Tags
Interfaces, Classes and Traits
- Countable
- IteratorAggregate
Table of Contents
- $uri : string|null
- $baseHref : string|null
- The base href value.
- $defaultNamespacePrefix : string
- The default namespace prefix to be used with XPath and CSS expressions.
- $document : DOMDocument|null
- $html5Parser : HTML5|null
- $isHtml : bool
- Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
- $namespaces : array<string, string>
- A map of manually registered namespaces.
- $nodes : array<string|int, DOMNode>
- __construct() : mixed
- add() : mixed
- Adds a node to the current list of nodes.
- addContent() : mixed
- Adds HTML/XML content.
- addDocument() : mixed
- Adds a \DOMDocument to the list of nodes.
- addHtmlContent() : mixed
- Adds an HTML content to the list of nodes.
- addNode() : mixed
- Adds a \DOMNode instance to the list of nodes.
- addNodeList() : mixed
- Adds a \DOMNodeList to the list of nodes.
- addNodes() : mixed
- Adds an array of \DOMNode instances to the list of nodes.
- addXmlContent() : mixed
- Adds an XML content to the list of nodes.
- attr() : string|null
- Returns the attribute value of the first node of the list.
- children() : static
- Returns the children nodes of the current selection.
- clear() : mixed
- Removes all the nodes.
- closest() : self|null
- Return first parents (heading toward the document root) of the Element that matches the provided selector.
- count() : int
- each() : array<string|int, mixed>
- Calls an anonymous function on each node of the list.
- eq() : static
- Returns a node given its position in the node list.
- evaluate() : array<string|int, mixed>|Crawler
- Evaluates an XPath expression.
- extract() : array<string|int, mixed>
- Extracts information from the list of nodes.
- filter() : static
- Filters the list of nodes with a CSS selector.
- filterXPath() : static
- Filters the list of nodes with an XPath expression.
- first() : static
- Returns the first node of the current selection.
- form() : Form
- Returns a Form object for the first node in the list.
- getBaseHref() : string|null
- Returns base href.
- getIterator() : ArrayIterator|array<string|int, DOMNode>
- getNode() : DOMNode|null
- getUri() : string|null
- Returns the current URI.
- html() : string
- Returns the first node of the list as HTML.
- image() : Image
- Returns an Image object for the first node in the list.
- images() : array<string|int, Image>
- Returns an array of Image objects for the nodes in the list.
- last() : static
- Returns the last node of the current selection.
- link() : Link
- Returns a Link object for the first node in the list.
- links() : array<string|int, Link>
- Returns an array of Link objects for the nodes in the list.
- matches() : bool
- nextAll() : static
- Returns the next siblings nodes of the current selection.
- nodeName() : string
- Returns the node name of the first node of the list.
- outerHtml() : string
- parents() : static
- Returns the parents nodes of the current selection.
- previousAll() : static
- Returns the previous sibling nodes of the current selection.
- reduce() : static
- Reduces the list of nodes by calling an anonymous function.
- registerNamespace() : mixed
- selectButton() : static
- Selects a button by name or alt value for images.
- selectImage() : static
- Selects images by alt value.
- selectLink() : static
- Selects links by name or alt value for clickable images.
- setDefaultNamespacePrefix() : mixed
- Overloads a default namespace prefix to be used with XPath and CSS expressions.
- siblings() : static
- Returns the siblings nodes of the current selection.
- slice() : static
- Slices the list of nodes by $offset and $length.
- text() : string
- Returns the text of the first node of the list.
- xpathLiteral() : string
- Converts string for XPath expressions.
- sibling() : array<string|int, mixed>
- canParseHtml5String() : bool
- convertToHtmlEntities() : string
- Converts charset to HTML-entities to ensure valid parsing.
- createCssSelectorConverter() : CssSelectorConverter
- createDOMXPath() : DOMXPath
- createSubCrawler() : static
- Creates a crawler for some subnodes.
- discoverNamespace() : string|null
- filterRelativeXPath() : static
- Filters the list of nodes with an XPath expression.
- findNamespacePrefixes() : array<string|int, mixed>
- isValidHtml5Heading() : bool
- parseHtml5() : DOMDocument
- parseHtmlString() : DOMDocument
- Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
- parseXhtml() : DOMDocument
- relativize() : string
- Make the XPath relative to the current context.
Properties
$uri
protected
string|null
$uri
$baseHref
The base href value.
private
string|null
$baseHref
$defaultNamespacePrefix
The default namespace prefix to be used with XPath and CSS expressions.
private
string
$defaultNamespacePrefix
= 'default'
$document
private
DOMDocument|null
$document
$html5Parser
private
HTML5|null
$html5Parser
$isHtml
Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
private
bool
$isHtml
= true
$namespaces
A map of manually registered namespaces.
private
array<string, string>
$namespaces
= []
$nodes
private
array<string|int, DOMNode>
$nodes
= []
Methods
__construct()
public
__construct([DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node = null ][, string $uri = null ][, string $baseHref = null ]) : mixed
Parameters
- $node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null = null
-
A Node to use as the base for the crawling
- $uri : string = null
- $baseHref : string = null
Return values
mixed —add()
Adds a node to the current list of nodes.
public
add(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node) : mixed
This method uses the appropriate specialized add*() method based on the type of the argument.
Parameters
- $node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null
-
A node
Tags
Return values
mixed —addContent()
Adds HTML/XML content.
public
addContent(string $content[, string $type = null ]) : mixed
If the charset is not set via the content type, it is assumed to be UTF-8, or ISO-8859-1 as a fallback, which is the default charset defined by the HTTP 1.1 specification.
Parameters
- $content : string
- $type : string = null
Return values
mixed —addDocument()
Adds a \DOMDocument to the list of nodes.
public
addDocument(DOMDocument $dom) : mixed
Parameters
- $dom : DOMDocument
-
A \DOMDocument instance
Return values
mixed —addHtmlContent()
Adds an HTML content to the list of nodes.
public
addHtmlContent(string $content[, string $charset = 'UTF-8' ]) : mixed
The libxml errors are disabled when the content is parsed.
If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.
Parameters
- $content : string
- $charset : string = 'UTF-8'
Return values
mixed —addNode()
Adds a \DOMNode instance to the list of nodes.
public
addNode(DOMNode $node) : mixed
Parameters
- $node : DOMNode
-
A \DOMNode instance
Return values
mixed —addNodeList()
Adds a \DOMNodeList to the list of nodes.
public
addNodeList(DOMNodeList $nodes) : mixed
Parameters
- $nodes : DOMNodeList
-
A \DOMNodeList instance
Return values
mixed —addNodes()
Adds an array of \DOMNode instances to the list of nodes.
public
addNodes(array<string|int, DOMNode> $nodes) : mixed
Parameters
- $nodes : array<string|int, DOMNode>
-
An array of \DOMNode instances
Return values
mixed —addXmlContent()
Adds an XML content to the list of nodes.
public
addXmlContent(string $content[, string $charset = 'UTF-8' ][, int $options = LIBXML_NONET ]) : mixed
The libxml errors are disabled when the content is parsed.
If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.
Parameters
- $content : string
- $charset : string = 'UTF-8'
- $options : int = LIBXML_NONET
-
Bitwise OR of the libxml option constants LIBXML_PARSEHUGE is dangerous, see http://symfony.com/blog/security-release-symfony-2-0-17-released
Return values
mixed —attr()
Returns the attribute value of the first node of the list.
public
attr(string $attribute) : string|null
Parameters
- $attribute : string
Tags
Return values
string|null —The attribute value or null if the attribute does not exist
children()
Returns the children nodes of the current selection.
public
children([string $selector = null ]) : static
Parameters
- $selector : string = null
Tags
Return values
static —clear()
Removes all the nodes.
public
clear() : mixed
Return values
mixed —closest()
Return first parents (heading toward the document root) of the Element that matches the provided selector.
public
closest(string $selector) : self|null
Parameters
- $selector : string
Tags
Return values
self|null —count()
public
count() : int
Return values
int —each()
Calls an anonymous function on each node of the list.
public
each(Closure $closure) : array<string|int, mixed>
The anonymous function receives the position and the node wrapped in a Crawler instance as arguments.
Example:
$crawler->filter('h1')->each(function ($node, $i) {
return $node->text();
});
Parameters
- $closure : Closure
-
An anonymous function
Return values
array<string|int, mixed> —An array of values returned by the anonymous function
eq()
Returns a node given its position in the node list.
public
eq(int $position) : static
Parameters
- $position : int
Return values
static —evaluate()
Evaluates an XPath expression.
public
evaluate(string $xpath) : array<string|int, mixed>|Crawler
Since an XPath expression might evaluate to either a simple type or a \DOMNodeList, this method will return either an array of simple types or a new Crawler instance.
Parameters
- $xpath : string
Return values
array<string|int, mixed>|Crawler —An array of evaluation results or a new Crawler instance
extract()
Extracts information from the list of nodes.
public
extract(array<string|int, mixed> $attributes) : array<string|int, mixed>
You can extract attributes or/and the node value (_text).
Example:
$crawler->filter('h1 a')->extract(['_text', 'href']);
Parameters
- $attributes : array<string|int, mixed>
Return values
array<string|int, mixed> —An array of extracted values
filter()
Filters the list of nodes with a CSS selector.
public
filter(string $selector) : static
This method only works if you have installed the CssSelector Symfony Component.
Parameters
- $selector : string
Tags
Return values
static —filterXPath()
Filters the list of nodes with an XPath expression.
public
filterXPath(string $xpath) : static
The XPath expression is evaluated in the context of the crawler, which is considered as a fake parent of the elements inside it. This means that a child selector "div" or "./div" will match only the div elements of the current crawler, not their children.
Parameters
- $xpath : string
Return values
static —first()
Returns the first node of the current selection.
public
first() : static
Return values
static —form()
Returns a Form object for the first node in the list.
public
form([array<string|int, mixed> $values = null ][, string $method = null ]) : Form
Parameters
- $values : array<string|int, mixed> = null
- $method : string = null
Tags
Return values
Form —A Form instance
getBaseHref()
Returns base href.
public
getBaseHref() : string|null
Return values
string|null —getIterator()
public
getIterator() : ArrayIterator|array<string|int, DOMNode>
Return values
ArrayIterator|array<string|int, DOMNode> —getNode()
public
getNode(int $position) : DOMNode|null
Parameters
- $position : int
Return values
DOMNode|null —getUri()
Returns the current URI.
public
getUri() : string|null
Return values
string|null —html()
Returns the first node of the list as HTML.
public
html([string|null $default = null ]) : string
Parameters
- $default : string|null = null
-
When not null: the value to return when the current node is empty
Tags
Return values
string —The node html
image()
Returns an Image object for the first node in the list.
public
image() : Image
Tags
Return values
Image —An Image instance
images()
Returns an array of Image objects for the nodes in the list.
public
images() : array<string|int, Image>
Return values
array<string|int, Image> —An array of Image instances
last()
Returns the last node of the current selection.
public
last() : static
Return values
static —link()
Returns a Link object for the first node in the list.
public
link([string $method = 'get' ]) : Link
Parameters
- $method : string = 'get'
Tags
Return values
Link —A Link instance
links()
Returns an array of Link objects for the nodes in the list.
public
links() : array<string|int, Link>
Tags
Return values
array<string|int, Link> —An array of Link instances
matches()
public
matches(string $selector) : bool
Parameters
- $selector : string
Return values
bool —nextAll()
Returns the next siblings nodes of the current selection.
public
nextAll() : static
Tags
Return values
static —nodeName()
Returns the node name of the first node of the list.
public
nodeName() : string
Tags
Return values
string —The node name
outerHtml()
public
outerHtml() : string
Return values
string —parents()
Returns the parents nodes of the current selection.
public
parents() : static
Tags
Return values
static —previousAll()
Returns the previous sibling nodes of the current selection.
public
previousAll() : static
Tags
Return values
static —reduce()
Reduces the list of nodes by calling an anonymous function.
public
reduce(Closure $closure) : static
To remove a node from the list, the anonymous function must return false.
Parameters
- $closure : Closure
-
An anonymous function
Return values
static —registerNamespace()
public
registerNamespace(string $prefix, string $namespace) : mixed
Parameters
- $prefix : string
- $namespace : string
Return values
mixed —selectButton()
Selects a button by name or alt value for images.
public
selectButton(string $value) : static
Parameters
- $value : string
Return values
static —selectImage()
Selects images by alt value.
public
selectImage(string $value) : static
Parameters
- $value : string
Return values
static —A new instance of Crawler with the filtered list of nodes
selectLink()
Selects links by name or alt value for clickable images.
public
selectLink(string $value) : static
Parameters
- $value : string
Return values
static —setDefaultNamespacePrefix()
Overloads a default namespace prefix to be used with XPath and CSS expressions.
public
setDefaultNamespacePrefix(string $prefix) : mixed
Parameters
- $prefix : string
Return values
mixed —siblings()
Returns the siblings nodes of the current selection.
public
siblings() : static
Tags
Return values
static —slice()
Slices the list of nodes by $offset and $length.
public
slice(int $offset[, int $length = null ]) : static
Parameters
- $offset : int
- $length : int = null
Return values
static —text()
Returns the text of the first node of the list.
public
text([string|null $default = null ][, bool $normalizeWhitespace = true ]) : string
Pass true as the second argument to normalize whitespaces.
Parameters
- $default : string|null = null
-
When not null: the value to return when the current node is empty
- $normalizeWhitespace : bool = true
-
Whether whitespaces should be trimmed and normalized to single spaces
Tags
Return values
string —The node value
xpathLiteral()
Converts string for XPath expressions.
public
static xpathLiteral(string $s) : string
Escaped characters are: quotes (") and apostrophe (').
Examples:
echo Crawler::xpathLiteral('foo " bar'); //prints 'foo " bar'
echo Crawler::xpathLiteral("foo ' bar"); //prints "foo ' bar"
echo Crawler::xpathLiteral('a'b"c'); //prints concat('a', "'", 'b"c')
Parameters
- $s : string
Return values
string —Converted string
sibling()
protected
sibling(DOMNode $node[, string $siblingDir = 'nextSibling' ]) : array<string|int, mixed>
Parameters
- $node : DOMNode
- $siblingDir : string = 'nextSibling'
Return values
array<string|int, mixed> —canParseHtml5String()
private
canParseHtml5String(string $content) : bool
Parameters
- $content : string
Return values
bool —convertToHtmlEntities()
Converts charset to HTML-entities to ensure valid parsing.
private
convertToHtmlEntities(string $htmlContent[, string $charset = 'UTF-8' ]) : string
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
string —createCssSelectorConverter()
private
createCssSelectorConverter() : CssSelectorConverter
Tags
Return values
CssSelectorConverter —createDOMXPath()
private
createDOMXPath(DOMDocument $document[, array<string|int, mixed> $prefixes = [] ]) : DOMXPath
Parameters
- $document : DOMDocument
- $prefixes : array<string|int, mixed> = []
Tags
Return values
DOMXPath —createSubCrawler()
Creates a crawler for some subnodes.
private
createSubCrawler(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $nodes) : static
Parameters
- $nodes : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null
Return values
static —discoverNamespace()
private
discoverNamespace(DOMXPath $domxpath, string $prefix) : string|null
Parameters
- $domxpath : DOMXPath
- $prefix : string
Tags
Return values
string|null —filterRelativeXPath()
Filters the list of nodes with an XPath expression.
private
filterRelativeXPath(string $xpath) : static
The XPath expression should already be processed to apply it in the context of each node.
Parameters
- $xpath : string
Return values
static —findNamespacePrefixes()
private
findNamespacePrefixes(string $xpath) : array<string|int, mixed>
Parameters
- $xpath : string
Return values
array<string|int, mixed> —isValidHtml5Heading()
private
isValidHtml5Heading(string $heading) : bool
Parameters
- $heading : string
Return values
bool —parseHtml5()
private
parseHtml5(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
DOMDocument —parseHtmlString()
Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
private
parseHtmlString(string $content, string $charset) : DOMDocument
Use libxml parser otherwise.
Parameters
- $content : string
- $charset : string
Return values
DOMDocument —parseXhtml()
private
parseXhtml(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
DOMDocument —relativize()
Make the XPath relative to the current context.
private
relativize(string $xpath) : string
The returned XPath will match elements matching the XPath inside the current crawler when running in the context of a node of the crawler.
Parameters
- $xpath : string