In this guide, I’ll go over how you can use web scraping rvest
and Selenium
to get translations from Google Translate. Note: I encourage responsible scraping - I always try to do it with some space between requests. You can only do 5000 characters at a time with the free Google translate. I will say that I tried to do this with just rvest and the predictability of the links for Google translate - but I could not get rvest to pull the right data off the page, so here’s a slightly more difficult approach that appears to work. Happy to hear comments!
First, load the rvest
and RSelenium
libraries. I wish I could remember precisely what I did to set up RSelenium
but I don’t :| there are good tutorials out there if you need help with setting it up.
## Loading required package: xml2
Next, put in the text you would like to translate:
words_translate <- c("hebben deze van door heet woord maar wat sommige")
This next part controls the browser:
tells you what browser to control/open and gets the session started. If you get an error that there’s already something open on that port, runrD[["server"]]$stop()
to stop the session and try again.- The second line sets up you at the client for controlling the session.
is exactly how it sounds, go to this page.- When you run these, you will see a browser open, then go to the Google page.
##an example to show you what's happening
rD <- rsDriver(browser = "firefox")
remDr <- rD[["client"]]
Once you get the page open, this part is a bit harder. You have to figure out the area of the page you want to control. I have used the SelectorGadget
plugin for this, as well as right clicking -> inspect element to find the right class ids and also just View Page Source because I understand html. You should start with SelectorGadget
if you aren’t familiar with html and css.
finds a specific area of the page.$sendKeysToElement
sends text to the area of the page you found. You can also do things likeclickElement
to click on a certain area of the page. Note that the\uE007
is the Enter key. So, we are filling in our words we want and hitting enter.$getPageSource
gets the page source -rvest
but I could not get that to find all the right information to get the translated text back.
webElem <- remDr$findElement(using = "class name","goog-textarea")
webElem$sendKeysToElement(list(words_translate, "\uE007"))
webpage <-remDr$getPageSource()
Next, you need to translate the page source into something usable. I will say that in theory, html_nodes
allows you to specify a specific class id you are looking for (that’s the result-shield
stuff), but I could not get that to work. So, I grabbed the text, the class codes, slapped them together, and then sorted it out.
#load dplyr
library(dplyr, quietly = T)
#get all the text
answers <- webpage %>% #your webpage
unlist() %>% #unlist, as it saves as a list
read_html() %>% #read the html
html_nodes("div") %>% #grab all the divs
html_text() #get the text from those divs
#get the class names
class_names <- webpage %>%
unlist() %>%
read_html() %>%
html_nodes("div") %>%
html_attrs() %>% #get the attributes, that's the class codes
sapply(function(x) x[1]) #just the first one is good
#get the answer that has this class code
answers[class_names == "result-shield-container tlid-copy-target"][1]
## [1] "have this van by hot word but some"
Now we have the translation of some top Dutch words. You could loop over a set of translations you want to do, storing them in a data frame, tibble, list, etc. I would recommend a Sys.sleep()
between loops to just not make the website angry. I usually use something like Sys.sleep(runif(1,0,5))
to get a random sleep time between 0 and 5 seconds.
When you are done be sure to close the remote session/connection:
#close the browser
# stop the selenium server
The nice thing about this set up is that you could pull the automatic translation here, and then “click” on a different translation using Selenium - you just would have to figure out where to click on the page. I find myself doing a lot of trial and error for clicks, so just play around it with until it clicks where you want.