.

Friday, August 13, 2010

Getting Started with Nokogiri

The XPath language was written to easily traverse an XML tree structure, but we can use it with HTML trees as well. Here’s a sample program for extracting search result links from a google search. We’ll use XPath to find the data we want, and then pick apart the XPath syntax:
require 'open-uri'
require 'nokogiri'
 
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.xpath('//h3/a').each do |node|
  puts node.text
end
The XPath used in this program is:
//h3/a
In English, this XPath says:
Find all “a” tags with a parent tag whose name is “h3″
Thus, our program finds all “a” tags with “h3″ parents, loops over them, and prints out the text content.
XPath works like a directory structure where the leading “/” indicates the root of the tree. Slashes separate the tag matching information. When there’s nothing between slashes, it’s a sort of wild card—meaning “any tag matches”. The “h3″ and “a” are tag name matchers, and only match when the tag name matches.

2 comments:

Anonymous said...

Ok, can you, please, suggest how to find all content separated by a tag with nokogiri?

For example
"text
text
"

I need to get array of pieces separated with br tag.

byungjin said...

I'm so sorry for my late response. you can get the array of pieces by belows

doc.xpath('//h3/a').each do |node|
puts node.text
end

I think you might change 'h3' tag to 'br' tag.
Thank you.

Post a Comment

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Macys Printable Coupons