Using R to scrape the web
利用R抓取web并解析
本文使用R中web解析包rvest
来抓取和解析web,rvest
借鉴了Pyhon
和Ruby
中web解析包Beautiful Soup
,
并支持%>%
语法,使程序非常简洁,可读性非常强。
加载R包
library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(GGally)
抓取web
本文以参考文献中实例为基础,改编而来。以抓取暴雪游戏中英雄属性为例, 抓取英雄的属性包括:名字,角色,攻击范围,HP,Mana,attack damage, attack speed
heroes = html("http://www.heroesnexus.com/heroes")
hero_name = heroes %>% html_nodes(".hero-champion h3") %>% html_text()
hero_type = heroes %>% html_nodes(".hero-champion .hero-type .role") %>% html_text()
hero_hp = heroes %>% html_nodes(".hero-champion .hero-hp") %>% html_text() %>%
str_extract("(HP: \\d*)") %>% str_replace("HP: ", "") %>% as.numeric()
hero_Mana = heroes %>% html_nodes(".hero-champion .hero-mana") %>% html_text() %>%
str_extract("(Mana: \\d*)") %>% str_replace("Mana: ", "") %>% as.numeric()
hero_attack_damage = heroes %>% html_nodes(".hero-champion .hero-atk") %>% html_text() %>%
str_extract("(Attack Damage: \\d*)") %>% str_replace("Attack Damage: ", "") %>% as.numeric()
# R中正则表达式的写法是\\d,而不是\d
hero_attack_speed = heroes %>% html_nodes(".hero-champion .hero-atk") %>% html_text() %>%
str_extract("(Attack Speed: \\d?(.\\d*))") %>% str_replace("Attack Speed: ", "") %>% as.numeric()
需要注意的是R
中使用正则表达式与Pyhotn
和Ruby
中略有不同,详细区别需查询帮助。
生成最终数据
final_data = data.frame(name = hero_name, type = hero_type, hp = hero_hp,
mana = hero_Mana, atk = hero_attack_damage, atk_spd = hero_attack_speed)
可视化数据
final_data %>%
select(hp, atk, atk_spd, type) %>%
ggpairs(data=., color = "type", title="英雄类型")