TOC
Excel in R
Before getting started
Inspiration for this Post :
This post is basically for people working in Excel & looking to move to R. While importing Tidy Data in R is straight-forward, I have seen even seasoned users struggling to import Messy data from Excel to R. This creates sort of a barrier or a two step process, first clean data in Excel then import in R or use Excel altogether, both of which is not a great Idea.
This Post is basically my walk-through through this awesome book by Duncan Garmonsway. If you want to go in details of any step, you can go through the book itself.
The sample excel workbook which will be worked throughout this post can be downloaded from here
After the end of this intro post, we should be comfortable with the below
1. Importing any excel data directly into R?
2. Working with merged cells, highlighted cells, tables, pivots etc & importing them as required in R?
3. Utilise imported data from Excel to generate insights either by plotting or other advanced functions
Lets get started
1. Tidy Data
So, we can read a normal excel file in R using below & most of time this should be enough, however most is not good enough when you are looking to build an expertise, hence using knowledge from Mr.Garmonsway book we would be taking more power in our hand & delve into advance ways to import/handle excel data in R
#Loading all libraries needed first
library(tibble)
library(tidyr)
library(dplyr)
library(purrr)
library(stringr)
library(readr)
library(readxl)
library(tidyxl)
library(unpivotr)
library(smungs) # GitHub only https://github.com/nacnudus/smungs
theme_set(theme_minimal())
- Normal way to import data in R
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
library(readxl)
read_excel(path, sheet = "clean")
# A tibble: 3 x 2
Name Age
<chr> <dbl>
1 Matilda 1
2 Nicholas 3
3 Olivia 5
- Utilising unpivotR & tidyxl to import data in R, as you can see its more complicated for the same task, however soon we will realise its importance when data gets a little messy in nature
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "clean") %>%
behead("N", header) %>%
select(row, data_type, header, character, numeric) %>%
spatter(header) %>%
select(-row)
# A tibble: 3 x 2
Age Name
<dbl> <chr>
1 1 Matilda
2 3 Nicholas
3 5 Olivia
2. Not-so tidy Data
When the same data is in transposed state, it gets tricky, even after using t() transpose function the output obtained is some sort of matrix which is tricky to decode & could take another 5-10 mins to get to right order
- Normal way using transpose post importing
base::t(read_excel(path, sheet = "transposed"))
[,1]
Name "Age"
Matilda "1"
Nicholas "3"
Olivia "5"
- Importing the transpose is the same is importing the usual layout, merely using the “W” (west) direction instead of “N” (north) when beheading the headers.
xlsx_cells(path, sheet = "transposed") %>%
behead("W", header) %>%
select(col, data_type, header, character, numeric) %>%
spatter(header) %>%
select(Name, Age)
# A tibble: 3 x 2
Name Age
<chr> <dbl>
1 Matilda 1
2 Nicholas 3
3 Olivia 5
3. Excel Data which does not begin at cell A1
- This seems neat, skipping 2 rows & dropping NA results to get our final data
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
readxl::read_excel(path,
sheet = "notes",
skip = 2,
n_max = 4,
col_types = c("guess", "guess", "skip")) %>%
drop_na()
# A tibble: 2 x 2
Name Age
<chr> <dbl>
1 Matilda 1
2 Nicholas 3
- Replicating same bit using unpivotR is a bit tricky though more empowering, I agree that over here the better way would be to use read_excel. However, an important part over here is to understand rectify() function which can create a blueprint of your excel sheet & then depending on your data-wrangling skills, one can work your way out
cells <- xlsx_cells(path, sheet = "notes")
rectify(cells)
# A tibble: 7 x 5
`row/col` `1(A)` `2(B)` `3(C)` `4(D)`
<int> <chr> <chr> <chr> <chr>
1 1 Title text <NA> <NA> <NA>
2 2 <NA> <NA> <NA> <NA>
3 3 <NA> Name Age <NA>
4 4 <NA> Matilda 1 <NA>
5 5 <NA> Nicholas 3 <NA>
6 6 <NA> <NA> <NA> <NA>
7 7 <NA> <NA> <NA> Footnote
rectify(cells) %>%
select(`2(B)`,`3(C)`)
# A tibble: 7 x 2
`2(B)` `3(C)`
<chr> <chr>
1 <NA> <NA>
2 <NA> <NA>
3 Name Age
4 Matilda 1
5 Nicholas 3
6 <NA> <NA>
7 <NA> <NA>
rectify(cells) %>%
select(`2(B)`,`3(C)`) %>%
drop_na()
# A tibble: 3 x 2
`2(B)` `3(C)`
<chr> <chr>
1 Name Age
2 Matilda 1
3 Nicholas 3
temp <- rectify(cells) %>%
select(`2(B)`,`3(C)`) %>%
drop_na()
temp
# A tibble: 3 x 2
`2(B)` `3(C)`
<chr> <chr>
1 Name Age
2 Matilda 1
3 Nicholas 3
names(temp) <- temp %>% slice(1) %>% unlist()
temp <- temp %>% slice(-1)
temp
# A tibble: 2 x 2
Name Age
<chr> <chr>
1 Matilda 1
2 Nicholas 3
4. There’s a whole section on importing formatted data from excel in R, which I would not be covering in this post since I feel it is highly need based, I would rather import data & do all the processing in R itself
5. Accessing the formatting of cells in Excel
This is not possible via vanilla read_excel() in R
The possiblity of using this could be endless once we have data in place. I personally this is one of the biggest USP tidyxl(xlsx_cells) as it could read meta-data in a way
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
cells <-
xlsx_cells(path, sheet = "formatting") %>%
select(row, col, character, style_format, local_format_id)
cells
# A tibble: 14 x 5
row col character style_format local_format_id
<int> <int> <chr> <chr> <int>
1 1 1 bold Normal 6
2 2 1 italic Normal 8
3 3 1 underline Normal 51
4 4 1 strikethrough Normal 52
5 5 1 red text Normal 12
6 6 1 font size 14 Normal 53
7 7 1 font arial Normal 54
8 8 1 yellow fill Normal 11
9 9 1 black border Normal 43
10 10 1 thick border Normal 55
11 11 1 dashed border Normal 56
12 12 1 row height 30 Normal 1
13 13 2 column width 16.76 Normal 1
14 14 1 Bad' style Explanatory Text 57
If you see formatting sheet from the workbook, you would realise that all the formatting from excel is imported as it is into the dataframe & now we can utilise it in any way
6. Reading data with date headers from Excel to R
- This happens to columns with date headers when imported via vanilla read_excel. The date part gets converted to 43101 etc. integers. Not great !
read_excel(path, sheet = "non-text headers")
# A tibble: 2 x 3
Name `43101` `42736`
<chr> <dbl> <dbl>
1 Matilda 2 4
2 Nicholas 1 3
- This is the way to tidy it using unpivotr libraries. Data gets imported as intended. That’s great!
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "non-text headers") %>%
behead("W", name) %>%
behead("N", `academic-year`) %>%
# mutate(`academic-year` = strftime(`academic-year`, "%Y")) %>%
select(row, data_type, `academic-year`, name, numeric) %>%
spatter(`academic-year`) %>%
select(-row)
# A tibble: 2 x 3
name `2017-01-01` `2018-01-01`
<chr> <dbl> <dbl>
1 Matilda 4 2
2 Nicholas 3 1
For this post, I think that’s all Folks. If you want to know more then do check out the amazing book (link above)
comments powered by Disqus