Thursday, September 3, 2015

Crawling IMDB Website



Yesterday Farnoush asked me if I teach some web crawling with R in my graduate course.
Well! I thought I must do some web crawling myself first, then teach to the others. To start, you need to know some basic HTML or CSS language. These languages are somehow like Latex. HTML and CSS help to structure of a web page.
First we need some web crawling tools that reads web pages and put them in a proper R class. I suggest installing the rvest R library.
Now lets set our mission: we want to extract the rating of a movie from the IMDB website, for instance. I choose a well-known movie like “sepration” from Asghar Farhadi, an Oscar winning Iranian movie. The family name of the director matches Fanoush’s family name, a nice match, perhaps Farnoush and Asghar are relatives ;).
The IMDB webpage of the movie is here http://www.imdb.com/title/tt1832382/



> library("rvest")
> separation_movie <- html("http://www.imdb.com/title/tt1832382/") ; class(separation_movie)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

Watch carefully! Now the object separation_movie is not only a text file downloaded from the web, it is an HTML object in R. Let’s see how we can find what we need in this HTML object. If you know some basic HTML language, you know that paragraphs are found by <p> and </p> tags. Let’s check the 15th paragraph in the HTML object:

> (separation_movie %>% html_nodes("p"))[15]

## [[1]]
## 
## A married couple are faced with a difficult decision - to improve the life of their child by moving to another country or to stay in Iran and look after a deteriorating parent who has Alzheimer's disease.
## ## attr(,"class") ## [1] "XMLNodeSet"
Wow! the 15th paragraph of IMDB is the story of the movie. Now let’s extract the text from it


> (separation_movie %>% html_nodes("p"))[15]%>%html_text()

## [1] "\nA married couple are faced with a difficult decision - to improve the life of their child by moving to another country or to stay in Iran and look after a deteriorating parent who has Alzheimer's disease.
Well! Now let’s check if we can extract the rating of the movie from the html file. We just need to know how the movie rating is structured inside the html file. If you check the whole HTML object separation_movie you will find that span HTML tag is associated with the rating. Let’s try isolating the span then:
> separation_movie %>% html_nodes("strong span")

## [[1]]
## 8.4 
## 
## attr(,"class")
## [1] "XMLNodeSet"
If you want to extract the rating, just isolate the text inside the HTML tag

separation_movie %>% html_nodes("strong span") %>% html_text() 

## [1] "8.4"
We can easily make a database of IMDB with stories, actors, actor photos, directors, number of ratings, and so on.
We just require to crawl the link to other movies, save their web page addresses and crawl all of our addresses to save the information of the movies.

1 comment:

  1. Here is an interview video of Sergey Brin (co-founder of Google with Larry Page) talking about crawlers,
    https://www.youtube.com/watch?v=CDXOcvUNBaA

    And here is a complete course in Python,
    https://www.udacity.com/course/intro-to-computer-science--cs101

    ReplyDelete