html_nodes {rvest} | R Documentation |
More easily extract pieces out of HTML documents using XPath and CSS selectors. CSS selectors are particularly useful in conjunction with http://selectorgadget.com/: it makes it easy to find exactly which selector you should be using. If you haven't used CSS selectors before, work your way through the fun tutorial at http://flukeout.github.io/
html_nodes(x, css, xpath) html_node(x, css, xpath)
x |
Either a document, a node set or a single node. |
css, xpath |
Nodes to select. Supply one of |
html_node
vs html_nodes
html_node
is like [[
it always extracts exactly one
element. When given a list of nodes, html_node
will always return
a list of the same length, the length of html_nodes
might be longer
or shorter.
CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.
It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:
Pseudo selectors that require interactivity are ignored:
:hover
, :active
, :focus
, :target
,
:visited
The following pseudo classes don't work with the wild card element, *:
*:first-of-type
, *:last-of-type
, *:nth-of-type
,
*:nth-last-of-type
, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar]
is the same as :not([foo=bar])
:not()
accepts a sequence of simple selectors, not just single
simple selector.
# CSS selectors ---------------------------------------------- ateam <- read_html("http://www.boxofficemojo.com/movies/?id=ateam.htm") html_nodes(ateam, "center") html_nodes(ateam, "center font") html_nodes(ateam, "center font b") # But html_node is best used in conjunction with %>% from magrittr # You can chain subsetting: ateam %>% html_nodes("center") %>% html_nodes("td") ateam %>% html_nodes("center") %>% html_nodes("font") td <- ateam %>% html_nodes("center") %>% html_nodes("td") td # When applied to a list of nodes, html_nodes() returns all nodes, # collapsing results into a new nodelist. td %>% html_nodes("font") # html_node() returns the first matching node. If there are no matching # nodes, it returns a "missing" node if (utils::packageVersion("xml2") > "0.1.2") { td %>% html_node("font") } # To pick out an element at specified position, use magrittr::extract2 # which is an alias for [[ library(magrittr) ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img") ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img") # Find all images contained in the first two tables ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img") ateam %>% html_nodes("table") %>% extract(1:2) %>% html_nodes("img") # XPath selectors --------------------------------------------- # chaining with XPath is a little trickier - you may need to vary # the prefix you're using - // always selects from the root node # regardless of where you currently are in the doc ateam %>% html_nodes(xpath = "//center//font//b") %>% html_nodes(xpath = "//b")