The XPath language was written to easily traverse an XML tree structure, but we can use it with HTML trees as well. Here’s a sample program for extracting search result links from a google search. We’ll use XPath to find the data we want, and then pick apart the XPath syntax:
The XPath used in this program is:
//h3/a
In English, this XPath says:
Find all “a” tags with a parent tag whose name is “h3″
Thus, our program finds all “a” tags with “h3″ parents, loops over them, and prints out the text content.
XPath works like a directory structure where the leading “/” indicates the root of the tree. Slashes separate the tag matching information. When there’s nothing between slashes, it’s a sort of wild card—meaning “any tag matches”. The “h3″ and “a” are tag name matchers, and only match when the tag name matches.
2 comments:
Ok, can you, please, suggest how to find all content separated by a tag with nokogiri?
For example
"text
text"
I need to get array of pieces separated with br tag.
I'm so sorry for my late response. you can get the array of pieces by belows
doc.xpath('//h3/a').each do |node|
puts node.text
end
I think you might change 'h3' tag to 'br' tag.
Thank you.
Post a Comment