TOC
2019 - A Year in Bollywood
Before getting started
Source of Data :
This is one of the first posts I am writing in which I have scraped the dataset myself off web Wikipedia
Based on structure of data, seeking to get answers for below based on dataset:
1. Which Actor had most movies in 2019
2. Which Genre works & which don’t in bollywood ?
3. Which production houses collaborate to make movies the most ?
4. How many movies gets released on “Non-Friday” days ?
5. Lastly, Have English title in Bollywood movies the new normal !
Lets get started
Starting to scrape the data related to Bollywood movies from the Wikipedia link shared earlier
library(purrr)
library(htmltab)
library(lubridate)
url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
tbls <- map2(url, 4:7, htmltab)
tbls <- do.call(rbind, tbls)
tbls$Cast <- gsub('([[:upper:]])', ' \\1', tbls$Cast)
tbls$Cast <- gsub(' ', '-', tbls$Cast)
tbls <- tbls %>% mutate(release_date=dmy(paste0(Opening.1,"-",Opening,"-",2019)),day=wday(release_date,label = TRUE))
Viewing scraped data
1. Which Actor made most movies
2. Which Genres work in Bollywood ?
3. Which Production house make the most movies
4. How many movies gets released on “Non-Friday” days ?
5. Are English titles the new normal in Bollywood
comments powered by Disqus