use HTML::Element; $a = HTML::Element->new('a', href => 'http://www.perl.com/'); $a->push_content("The Perl Homepage"); $tag = $a->tag; print "$tag starts out as:", $a->starttag, "\n"; print "$tag ends as:", $a->endtag, "\n"; print "$tag\'s href attribute is: ", $a->attr('href'), "\n"; $links_r = $a->extract_links(); print "Hey, I found ", scalar(@$links_r), " links.\n"; print "And that, as HTML, is: ", $a->as_HTML, "\n"; $a = $a->delete;
Objects of the HTML::Element class can be used to represent elements of HTML document trees. These objects have attributes, notably attributes that designates each element's parent and content. The content is an array of text segments and other HTML::Element objects. A tree with HTML::Element objects as nodes can represent the syntax tree for a HTML document.
<html lang='en-US'> <head> <title>Stuff</title> <meta name='author' content='Jojo'> </head> <body> <h1>I like potatoes!</h1> </body> </html>
Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:
html (lang='en-US') / \ / \ / \ head body /\ \ / \ \ / \ \ title meta h1 | (name='author', | "Stuff" content='Jojo') "I like potatoes"
This is the traditional way to diagram a tree, with the ``root'' at the top, and it's this kind of diagram that people have in mind when they say, for example, that ``the meta element is under the head element instead of under the body element''. (The same is also said with ``inside'' instead of ``under'' --- the use of ``inside'' makes more sense when you're looking at the HTML source.)
Another way to represent the above tree is with indenting:
html (attributes: lang='en-US') head title "Stuff" meta (attributes: name='author' content='Jojo') body h1 "I like potatoes"
Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The "$tree->dump" method uses indentation just that way.
However you diagram the tree, it's stored the same in memory --- it's a network of objects, each of which has attributes like so:
element #1: _tag: 'html' _parent: none _content: [element #2, element #5] lang: 'en-US' element #2: _tag: 'head' _parent: element #1 _content: [element #3, element #4] element #3: _tag: 'title' _parent: element #2 _content: [text segment "Stuff"] element #4 _tag: 'meta' _parent: element #2 _content: none name: author content: Jojo element #5 _tag: 'body' _parent: element #1 _content: [element #6] element #6 _tag: 'h1' _parent: element #5 _content: [text segment "I like potatoes"]
The ``treeness'' of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.
While you could access the content of a tree by writing code that says "access the 'src' attribute of the root's first child's seventh child's third child``, you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to ''traverse" it; an HTML::Element method ("$h->traverse") is provided for this purpose; and several other HTML::Element methods are based on it.
(For everything you ever wanted to know about trees, and then some, see Niklaus Wirth's Algorithms + Data Structures = Programs or Donald Knuth's The Art of Computer Programming, Volume 1.)
If methods are provided for accessing an attribute (like "$h->tag" for ``_tag'', "$h->content_list", etc. below), use those instead of calling attr "$h->attr", whether for reading or setting.
Note that setting an attribute to "undef" (as opposed to "", the empty string) actually deletes the attribute.
There are four kinds of ``pseudo-elements'' that show up as HTML::Element objects:
<!-- I like Pie. Pie is good -->
produces an HTML::Element object with these attributes:
"_tag", "~comment", "text", " I like Pie.\n Pie is good\n "
<!DOCTYPE foo>
produces an element whose attributes include:
"_tag", "~declaration", "text", "DOCTYPE foo"
<?stuff foo?>
produces an element whose attributes include:
"_tag", "~pi", "text", "stuff foo?"
(assuming a recent version of HTML::Parser)
That is, this is useful if you want to insert code into a tree that you plan to dump out with "as_HTML", where you want, for some reason, to suppress "as_HTML"'s normal behavior of amp-quoting text segments.
For example, this:
my $literal = HTML::Element->new('~literal', 'text' => 'x < 4 & y > 7' ); my $span = HTML::Element->new('span'); $span->push_content($literal); print $span->as_HTML;
prints this:
<span>x < 4 & y > 7</span>
Whereas this:
my $span = HTML::Element->new('span'); $span->push_content('x < 4 & y > 7'); # normal text segment print $span->as_HTML;
prints this:
<span>x < 4 & y > 7</span>
Unless you're inserting lots of pre-cooked code into existing trees, and dumping them out again, it's not likely that you'll find "~literal" pseudo-elements useful.
You should not use this to directly set the parent of an element. Instead use any of the other methods under ``Structure-Modifying Methods'', below.
Note that not($h->parent) is a simple test for whether $h is the root of its subtree.
In a scalar context, this returns the count of the items, as you may expect.
While older code should feel free to continue to use "$h->content", new code should use "$h->content_list" in almost all conceivable cases. It is my experience that in most cases this leads to simpler code anyway, since it means one can say:
@children = $h->content_list;
instead of the inelegant:
@children = @{$h->content || []};
If you do use "$h->content" (or "$h->content_array_ref"), you should not use the reference returned by it (assuming it returned a reference, and not undef) to directly set or change the content of an element or text segment! Instead use content_refs_list or any of the other methods under ``Structure-Modifying Methods'', below.
foreach my $item_r ($h->content_refs_list) { next if ref $$item_r; $$item_r =~ s/honour/honor/g; }
You could currently achieve the same affect with:
foreach my $item (@{ $h->content_array_ref }) { # deprecated! next if ref $item; $item =~ s/honour/honor/g; }
...except that using the return value of "$h->content" or "$h->content_array_ref" to do that is deprecated, and just might stop working in the future.
(This has nothing to do with the Perl function called ``pos'', for controlling where regular expression matching starts.)
If you set "$h->pos($element)", be sure that $element is either $h, or an element under $h.
If you've been modifying the tree under $h and are no longer sure "$h->pos" is valid, you can enforce validity with:
$h->pos(undef) unless $h->pos->is_inside($h);
Example output of "$h->all_attr()" : "'_parent', "[object_value]" , '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang', '_content', ".
$body->push_content( ['br'], ['ul', map ['li', $_], qw(Peaches Apples Pears Mangos) ] );
See "new_from_lol" method's documentation, far below, for more explanation.
The push_content method will try to consolidate adjacent text segments while adding to the content list. That's to say, if $h's content_list is
('foo bar ', $some_node, 'baz!')
and you call
$h->push_content('quack?');
then the resulting content list will be this:
('foo bar ', $some_node, 'baz!quack?')
and not this:
('foo bar ', $some_node, 'baz!', 'quack?')
If that latter is what you want, you'll have to override the feature of consolidating text by using splice_content, as in:
$h->splice_content(scalar($h->content_list),0,'quack?');
Similarly, if you wanted to add 'Skronk' to the beginning of the content list, calling this:
$h->unshift_content('Skronk');
then the resulting content list will be this:
('Skronkfoo bar ', $some_node, 'baz!')
and not this:
('Skronk', 'foo bar ', $some_node, 'baz!')
What you'd to do get the latter is:
$h->splice_content(0,0,'Skronk');
The items of content to be added should each be either a text segment (a string), an HTML::Element object, or an arrayref (which is fed thru "new_from_lol").
The unshift_content method will try to consolidate adjacent text segments while adding to the content list. See above for a discussion of this.
The items of content to be added (if any) should each be either a text segment (a string), an arrayref (which is fed thru "new_from_lol"), or an HTML::Element object that's not already a child of $h.
Also, note that this method does not destroy $h --- use "$h->replace_with(...)->delete" if you need that.
Perl uses garbage collection based on reference counting; when no references to a data structure exist, it's implicitly destroyed --- i.e., when no value anywhere points to a given object anymore, Perl knows it can free up the memory that the now-unused object occupies.
But this fails with HTML::Element trees, because a parent element always holds references to its children, and its children elements hold references to the parent, so no element ever looks like it's not in use. So, to destroy those elements, you need to call "$h->delete" on the parent.
The returned element is parentless. Any '_pos' attributes present in the source element/tree will be absent in the copy. For that and other reasons, the clone of an HTML::TreeBuilder object that's in mid-parse (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot (currently) be used to continue the parse.
You are free to clone HTML::TreeBuilder trees, just as long as: 1) they're done being parsed, or 2) you don't expect to resume parsing into the clone. (You can continue parsing into the original; it is never affected.)
Note that this must be called as a class method, not as an instance method. "clone_list" will croak if called as an instance method. You can also call it like so:
ref($h)->clone_list(...nodes...)
If $indent_char is specified and defined, the HTML to be output is intented, using the string you specify (which you probably should set to ``\t'', or some number of spaces, if you specify it).
If "\%optional_end_tags" is specified and defined, it should be a reference to a hash that holds a true value for every tag name whose end tag is optional. Defaults to "\%HTML::Element::optionalEndTag", which is an alias to %HTML::Tagset::optionalEndTag, which, at time of writing, contains true values for "p, li, dt, dd". A useful value to pass is an empty hashref, "{}", which means that no end-tags are optional for this dump. Otherwise, possibly consider copying %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting values as you like, and passing a reference to that hash.
Text under 'script' or 'style' elements is never included in what's returned. If "skip_dels" is true, then text content under ``del'' nodes is not included in what's returned.
The Lisp form is indented, and contains external (``href'', etc.) as well as internal attributes (``_tag'', ``_content'', ``_implicit'', etc.), except for ``_parent'', which is omitted.
Current example output for a given element:
("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
That is, a particular ``p'' element may happen to have no content, so $that_p_element->is_empty will be true --- even though the prototypical ``p'' element isn't ``empty'' (not in the way that the prototypical ``hr'' element is).
If you think this might make for potentially confusing code, consider simply using the clearer exact equivalent: not($h->content_list)
$h->parent->content->[$h->pindex] or ($h->parent->content_list)[$h->pindex]
assuming $h isn't root. If the element $h is root, then $h->pindex returns undef.
In list context: returns all the nodes that're the left siblings of $h (starting with the leftmost). If $h is the leftmost (or only) child of its parent (or has no parent), then this returns empty-list.
(See also $h->preinsert(LIST).)
In list context: returns all the nodes that're the right siblings of $h, starting with the leftmost. If $h is the rightmost (or only) child of its parent (or has no parent), then this returns empty-list.
(See also $h->postinsert(LIST).)
So if the way to get to a node starting at the root is to go to child 2 of the root, then child 10 of that, and then child 0 of that, and then you're there --- then that node's address is ``0.2.10.0''.
As a bit of a special case, the address of the root is simply ``0''.
I forsee this being used mainly for debugging, but you may find your own uses for it.
If there is no node at the given address, this returns undef.
You can specify ``relative addressing'' (i.e., that indexing is supposed to start from $h and not from $h->root) by having the address start with a period --- e.g., $h->address(``.3.2'') will look at child 3 of $h, and child 2 of that.
If you simply want a count of the number of elements in $h's lineage, use $h->depth.
This method is deprecated in favor of the more expressive "look_down" method, which new code should use instead.
There are three kinds of criteria you can specify:
my @wide_pix_images = $h->look_down( "_tag", "img", "alt", "pix!", sub { $_[0]->attr('width') > 350 } );
Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria are almost always faster than coderef criteria, so should presumably be put before them in your list of criteria. That is, in the example above, the sub ref is called only for elements that have already passed the criteria of having a ``_tag'' attribute with value ``img'', and an ``alt'' attribute with value ``pix!''. If the coderef were first, it would be called on every element, and then what elements pass that criterion (i.e., elements for which the coderef returned true) would be checked for their ``_tag'' and ``alt'' attributes.
Note that comparison of string attribute-values against the string value in "(attr_name, attr_value)" is case-INsensitive! A criterion of "('align', 'right')" will match an element whose ``align'' value is ``RIGHT'', or ``right'' or ``rIGhT'', etc.
Note also that "look_down" considers "" (empty-string) and undef to be different things, in attribute values. So this:
$h->look_down("alt", "")
will find elements with an ``alt'' attribute, but where the value for the ``alt'' attribute is "". But this:
$h->look_down("alt", undef)
is the same as:
$h->look_down(sub { !defined($_[0]->attr('alt')) } )
That is, it finds elements that do not have an ``alt'' attribute at all (or that do have an ``alt'' attribute, but with a value of undef --- which is not normally possible).
Note that when you give several criteria, this is taken to mean you're looking for elements that match all your criterion, not just any of them. In other words, there is an implicit ``and'', not an ``or''. So if you wanted to express that you wanted to find elements with a ``name'' attribute with the value ``foo'' or with an ``id'' attribute with the value ``baz'', you'd have to do it like:
@them = $h->look_down( sub { # the lcs are to fold case lc($_[0]->attr('name')) eq 'foo' or lc($_[0]->attr('id')) eq 'baz' } );
Coderef criteria are more expressive than "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria could be expressed in terms of coderefs. However, "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria are a convenient shorthand. (In fact, "look_down" itself is basically ``shorthand'' too, since anything you can do with "look_down" you could do by traversing the tree, either with the "traverse" method or with a routine of your own. However, "look_down" often makes for very concise and clear code.)
($h, $h->descendants)
$h->look_up instead scans over the list
($h, $h->lineage)
So, for example, this returns all ancestors of $h (possibly including $h itself) that are ``td'' elements with an ``align'' attribute with a value of ``right'' (or ``RIGHT'', etc.):
$h->look_up("_tag", "td", "align", "right");
Consider a document consisting of:
<html lang='i-klingon'> <head><title>Pati Pata</title></head> <body> <h1 lang='la'>Stuff</h1> <p lang='es-MX' align='center'> Foo bar baz <cite>Quux</cite>. </p> <p>Hooboy.</p> </body> </html>
If $h is the ``cite'' element, $h->attr_get_i(``lang'') in list context will return the list ('es-MX', 'i-klingon'). In scalar context, it will return the value 'es-MX'.
If you call with multiple attribute names...
$h->attr_get_i('lang', 'align');
will return:
('es-MX', 'center', 'i-klingon') # in list context or 'es-MX' # in scalar context.
But note that this:
$h->attr_get_i('align', 'lang');
will return:
('center', 'es-MX', 'i-klingon') # in list context or 'center' # in scalar context.
{ # Across $h and all descendants... 'a' => [ ...list of all 'a' elements... ], 'em' => [ ...list of all 'em' elements... ], 'img' => [ ...list of all 'img' elements... ], }
(There are entries in the hash for only those tagnames that occur at/under $h --- so if there's no ``img'' elements, there'll be no ``img'' entry in the hashr(ref) returned.)
Example usage:
my $map_r = $h->tagname_map(); my @heading_tags = sort grep m/^h\d$/s, keys %$map_r; if(@heading_tags) { print "Heading levels used: @heading_tags\n"; } else { print "No headings.\n" }
You might specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only ``a'' and ``img'' elements, you could code it like this:
for (@{ $e->extract_links('a', 'img') }) { my($link, $element, $attr, $tag) = @$_; print "Hey, there's a $tag that links to ", $link, ", in its $attr attribute, at ", $element->address(), ".\n"; }
That is, if you read a file with lines delimited by "\cm\cj"'s, the text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into "\n"'s.
Tabs are expanded to however many spaces it takes to get to the next 8th column --- the usual way of expanding them.
Sameness of descendant elements is tested, recursively, with "$child1->same_as($child_2)", and sameness of text segments is tested with "$segment1 eq $segment2".
In each arrayref in that structure, different kinds of values are treated as follows:
Arrayrefs are considered to designate a sub-tree representing children for the node constructed from the current arrayref.
Hashrefs are considered to contain attribute-value pairs to add to the element to be constructed from the current arrayref
Text segments at the start of any arrayref will be considered to specify the name of the element to be constructed from the current araryref; all other text segments will be considered to specify text segments as children for the current arrayref.
Existing element objects are either inserted into the treelet constructed, or clones of them are. That is, when the lol-tree is being traversed and elements constructed based what's in it, if an existing element object is found, if it has no parent, then it is added directly to the treelet constructed; but if it has a parent, then "$that_node->clone" is added to the treelet at the appropriate place.
An example will hopefully make this more obvious:
my $h = HTML::Element->new_from_lol( ['html', ['head', [ 'title', 'I like stuff!' ], ], ['body', {'lang', 'en-JP', _implicit => 1}, 'stuff', ['p', 'um, p < 4!', {'class' => 'par123'}], ['div', {foo => 'bar'}, '123'], ] ] ); $h->dump;
Will print this:
<html> @0 <head> @0.0 <title> @0.0.0 "I like stuff!" <body lang="en-JP"> @0.1 (IMPLICIT) "stuff" <p class="par123"> @0.1.1 "um, p < 4!" <div foo="bar"> @0.1.2 "123"
And printing $h->as_HTML will give something like:
<html><head><title>I like stuff!</title></head> <body lang="en-JP">stuff<p class="par123">um, p < 4! <div foo="bar">123</div></body></html>
You can even do fancy things with "map":
$body->push_content( # push_content implicitly calls new_from_lol on arrayrefs... ['br'], ['blockquote', ['h2', 'Pictures!'], map ['p', $_], $body2->look_down("_tag", "img"), # images, to be copied from that other tree. ], # and more stuff: ['ul', map ['li', ['a', {'href'=>"$_.png"}, $_ ] ], qw(Peaches Apples Pears Mangos) ], );
@elements = HTML::Element->new_from_lol( ['hr'], ['p', 'And there, on the door, was a hook!'], ); # constructs two elements.
Note that these ``~text'' objects are not recognized as text nodes by methods like as_text. Presumably you will want to call $h->objectify_text, perform whatever task that you needed that for, and then call $h->deobjectify_text before calling anything like $h->as_text.
Note that if $h itself is a ``~text'' pseudo-element, it will be destroyed --- a condition you may need to treat specially in your calling code (since it means you can't very well do anything with $h after that). So that you can detect that condition, if $h is itself a ``~text'' pseudo-element, then this method returns the value of the ``text'' attribute, which should be a defined value; in all other cases, it returns undef.
(This method assumes that no ``~text'' pseudo-element has any children.)
This returns empty list (or false, in scalar context) if the subtree's linkage methods are sane; otherwise it returns two items (or true, in scalar context): the element where the error occurred, and a string describing the error.
This method is provided is mainly for debugging and troubleshooting --- it should be quite impossible for any document constructed via HTML::TreeBuilder to parse into a non-sane tree (since it's not the content of the tree per se that's in question, but whether the tree in memory was properly constructed); and it should be impossible for you to produce an insane tree just thru reasonable use of normal documented structure-modifying methods. But if you're constructing your own trees, and your program is going into infinite loops as during calls to traverse() or any of the secondary structural methods, as part of debugging, consider calling is_insane on the tree.
* There's almost nothing to stop you from making a ``tree'' with cyclicities (loops) in it, which could, for example, make the traverse method go into an infinite loop. So don't make cyclicities! (If all you're doing is parsing HTML files, and looking at the resulting trees, this will never be a problem for you.)
* There's no way to represent comments or processing directives in a tree with HTML::Elements. Not yet, at least.
* There's (currently) nothing to stop you from using an undefined value as a text segment. If you're running under "perl -w", however, this may make HTML::Element's code produce a slew of warnings.
* The value of an element's _parent attribute must either be undef or otherwise false, or must be an element.
* The value of an element's _content attribute must either be undef or otherwise false, or a reference to an (unblessed) array. The array may be empty; but if it has items, they must ALL be either mere strings (text segments), or elements.
* The value of an element's _tag attribute should, at least, be a string of printable characters.
Moreover, bear these rules in mind:
* Do not break encapsulation on objects. That is, access their contents only thru $obj->attr or more specific methods.
* You should think twice before completely overriding any of the methods that HTML::Element provides. (Overriding with a method that calls the superclass method is not so bad, though.)
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Original authors: Gisle Aas, Sean Burke and Andy Lester.
Thanks to Mark-Jason Dominus for a POD suggestion.