In this guide, I’ll go over how you can use web scraping
Selenium to get translations from Google Translate. Note: I encourage responsible scraping - I always try to do it with some space between requests. You can only do 5000 characters at a time with the free Google translate. I will say that I tried to do this with just rvest and the predictability of the links for Google translate - but I could not get rvest to pull the right data off the page, so here’s a slightly more difficult approach that appears to work. Happy to hear comments!
First, load the
RSelenium libraries. I wish I could remember precisely what I did to set up
RSelenium but I don’t :| there are good tutorials out there if you need help with setting it up.
## Loading required package: xml2
Next, put in the text you would like to translate:
##words words_translate <- c("hebben deze van door heet woord maar wat sommige")
This next part controls the browser:
rsDrivertells you what browser to control/open and gets the session started. If you get an error that there’s already something open on that port, run
rD[["server"]]$stop()to stop the session and try again.
- The second line sets up you at the client for controlling the session.
$navigateis exactly how it sounds, go to this page.
- When you run these, you will see a browser open, then go to the Google page.
##an example to show you what's happening rD <- rsDriver(browser = "firefox") remDr <- rD[["client"]] remDr$navigate("https://translate.Google.com/")
Once you get the page open, this part is a bit harder. You have to figure out the area of the page you want to control. I have used the
SelectorGadget plugin for this, as well as right clicking -> inspect element to find the right class ids and also just View Page Source because I understand html. You should start with
SelectorGadget if you aren’t familiar with html and css.
$findElementfinds a specific area of the page.
$sendKeysToElementsends text to the area of the page you found. You can also do things like
clickElementto click on a certain area of the page. Note that the
\uE007is the Enter key. So, we are filling in our words we want and hitting enter.
$getPageSourcegets the page source -
read_htmlbut I could not get that to find all the right information to get the translated text back.
webElem <- remDr$findElement(using = "class name","goog-textarea") webElem$sendKeysToElement(list(words_translate, "\uE007")) webpage <-remDr$getPageSource()
Next, you need to translate the page source into something usable. I will say that in theory,
html_nodes allows you to specify a specific class id you are looking for (that’s the
result-shield stuff), but I could not get that to work. So, I grabbed the text, the class codes, slapped them together, and then sorted it out.
#load dplyr library(dplyr, quietly = T) #get all the text answers <- webpage %>% #your webpage unlist() %>% #unlist, as it saves as a list read_html() %>% #read the html html_nodes("div") %>% #grab all the divs html_text() #get the text from those divs #get the class names class_names <- webpage %>% unlist() %>% read_html() %>% html_nodes("div") %>% html_attrs() %>% #get the attributes, that's the class codes sapply(function(x) x) #just the first one is good #get the answer that has this class code answers[class_names == "result-shield-container tlid-copy-target"]
##  "have this van by hot word but some"
Now we have the translation of some top Dutch words. You could loop over a set of translations you want to do, storing them in a data frame, tibble, list, etc. I would recommend a
Sys.sleep() between loops to just not make the website angry. I usually use something like
Sys.sleep(runif(1,0,5)) to get a random sleep time between 0 and 5 seconds.
When you are done be sure to close the remote session/connection:
#close the browser remDr$close() # stop the selenium server rD[["server"]]$stop()
The nice thing about this set up is that you could pull the automatic translation here, and then “click” on a different translation using Selenium - you just would have to figure out where to click on the page. I find myself doing a lot of trial and error for clicks, so just play around it with until it clicks where you want.