Excel in R

This post goes through importing Messy data from Excel Spreadsheets to R

Posted by Vaibhav Singh on Friday, December 20, 2019

TOC

Excel in R

Before getting started

Inspiration for this Post :

This post is basically for people working in Excel & looking to move to R. While importing Tidy Data in R is straight-forward, I have seen even seasoned users struggling to import Messy data from Excel to R. This creates sort of a barrier or a two step process, first clean data in Excel then import in R or use Excel altogether, both of which is not a great Idea.

This Post is basically my walk-through through this awesome book by Duncan Garmonsway. If you want to go in details of any step, you can go through the book itself.

The sample excel workbook which will be worked throughout this post can be downloaded from here

After the end of this intro post, we should be comfortable with the below

1. Importing any excel data directly into R?
2. Working with merged cells, highlighted cells, tables, pivots etc & importing them as required in R?
3. Utilise imported data from Excel to generate insights either by plotting or other advanced functions

Lets get started

1. Tidy Data

So, we can read a normal excel file in R using below & most of time this should be enough, however most is not good enough when you are looking to build an expertise, hence using knowledge from Mr.Garmonsway book we would be taking more power in our hand & delve into advance ways to import/handle excel data in R

#Loading all libraries needed first
library(tibble)
library(tidyr)
library(dplyr)
library(purrr)
library(stringr)
library(readr)
library(readxl)
library(tidyxl)
library(unpivotr)
library(smungs) # GitHub only https://github.com/nacnudus/smungs

theme_set(theme_minimal())
  1. Normal way to import data in R
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")

library(readxl)
read_excel(path, sheet = "clean")
# A tibble: 3 x 2
  Name       Age
  <chr>    <dbl>
1 Matilda      1
2 Nicholas     3
3 Olivia       5
  1. Utilising unpivotR & tidyxl to import data in R, as you can see its more complicated for the same task, however soon we will realise its importance when data gets a little messy in nature
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")

xlsx_cells(path, sheet = "clean") %>%
  behead("N", header) %>%
  select(row, data_type, header, character, numeric) %>%
  spatter(header) %>%
  select(-row)
# A tibble: 3 x 2
    Age Name    
  <dbl> <chr>   
1     1 Matilda 
2     3 Nicholas
3     5 Olivia  

2. Not-so tidy Data

When the same data is in transposed state, it gets tricky, even after using t() transpose function the output obtained is some sort of matrix which is tricky to decode & could take another 5-10 mins to get to right order

  1. Normal way using transpose post importing
base::t(read_excel(path, sheet = "transposed"))
         [,1] 
Name     "Age"
Matilda  "1"  
Nicholas "3"  
Olivia   "5"  
  1. Importing the transpose is the same is importing the usual layout, merely using the “W” (west) direction instead of “N” (north) when beheading the headers.
xlsx_cells(path, sheet = "transposed") %>%
  behead("W", header) %>%
  select(col, data_type, header, character, numeric) %>%
  spatter(header) %>%
  select(Name, Age)
# A tibble: 3 x 2
  Name       Age
  <chr>    <dbl>
1 Matilda      1
2 Nicholas     3
3 Olivia       5

3. Excel Data which does not begin at cell A1

  1. This seems neat, skipping 2 rows & dropping NA results to get our final data
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")

readxl::read_excel(path,
                   sheet = "notes",
                   skip = 2,
                   n_max = 4,
                   col_types = c("guess", "guess", "skip")) %>%
  drop_na()
# A tibble: 2 x 2
  Name       Age
  <chr>    <dbl>
1 Matilda      1
2 Nicholas     3
  1. Replicating same bit using unpivotR is a bit tricky though more empowering, I agree that over here the better way would be to use read_excel. However, an important part over here is to understand rectify() function which can create a blueprint of your excel sheet & then depending on your data-wrangling skills, one can work your way out
cells <- xlsx_cells(path, sheet = "notes")

rectify(cells)
# A tibble: 7 x 5
  `row/col` `1(A)`     `2(B)`   `3(C)` `4(D)`  
      <int> <chr>      <chr>    <chr>  <chr>   
1         1 Title text <NA>     <NA>   <NA>    
2         2 <NA>       <NA>     <NA>   <NA>    
3         3 <NA>       Name     Age    <NA>    
4         4 <NA>       Matilda  1      <NA>    
5         5 <NA>       Nicholas 3      <NA>    
6         6 <NA>       <NA>     <NA>   <NA>    
7         7 <NA>       <NA>     <NA>   Footnote
rectify(cells) %>% 
  select(`2(B)`,`3(C)`) 
# A tibble: 7 x 2
  `2(B)`   `3(C)`
  <chr>    <chr> 
1 <NA>     <NA>  
2 <NA>     <NA>  
3 Name     Age   
4 Matilda  1     
5 Nicholas 3     
6 <NA>     <NA>  
7 <NA>     <NA>  
rectify(cells) %>% 
  select(`2(B)`,`3(C)`) %>% 
  drop_na()
# A tibble: 3 x 2
  `2(B)`   `3(C)`
  <chr>    <chr> 
1 Name     Age   
2 Matilda  1     
3 Nicholas 3     
temp <- rectify(cells) %>% 
  select(`2(B)`,`3(C)`) %>% 
  drop_na()  

temp
# A tibble: 3 x 2
  `2(B)`   `3(C)`
  <chr>    <chr> 
1 Name     Age   
2 Matilda  1     
3 Nicholas 3     
names(temp) <- temp %>% slice(1) %>% unlist()
temp <- temp %>% slice(-1)

temp
# A tibble: 2 x 2
  Name     Age  
  <chr>    <chr>
1 Matilda  1    
2 Nicholas 3    

4. There’s a whole section on importing formatted data from excel in R, which I would not be covering in this post since I feel it is highly need based, I would rather import data & do all the processing in R itself

5. Accessing the formatting of cells in Excel

  1. This is not possible via vanilla read_excel() in R

  2. The possiblity of using this could be endless once we have data in place. I personally this is one of the biggest USP tidyxl(xlsx_cells) as it could read meta-data in a way

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")

cells <-
  xlsx_cells(path, sheet = "formatting") %>%
  select(row, col, character, style_format, local_format_id)

cells
# A tibble: 14 x 5
     row   col character          style_format     local_format_id
   <int> <int> <chr>              <chr>                      <int>
 1     1     1 bold               Normal                         6
 2     2     1 italic             Normal                         8
 3     3     1 underline          Normal                        51
 4     4     1 strikethrough      Normal                        52
 5     5     1 red text           Normal                        12
 6     6     1 font size 14       Normal                        53
 7     7     1 font arial         Normal                        54
 8     8     1 yellow fill        Normal                        11
 9     9     1 black border       Normal                        43
10    10     1 thick border       Normal                        55
11    11     1 dashed border      Normal                        56
12    12     1 row height 30      Normal                         1
13    13     2 column width 16.76 Normal                         1
14    14     1 Bad' style         Explanatory Text              57

If you see formatting sheet from the workbook, you would realise that all the formatting from excel is imported as it is into the dataframe & now we can utilise it in any way

6. Reading data with date headers from Excel to R

  1. This happens to columns with date headers when imported via vanilla read_excel. The date part gets converted to 43101 etc. integers. Not great !
read_excel(path, sheet = "non-text headers")
# A tibble: 2 x 3
  Name     `43101` `42736`
  <chr>      <dbl>   <dbl>
1 Matilda        2       4
2 Nicholas       1       3
  1. This is the way to tidy it using unpivotr libraries. Data gets imported as intended. That’s great!
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")

xlsx_cells(path, sheet = "non-text headers") %>%
  behead("W", name) %>%
  behead("N", `academic-year`) %>%
#  mutate(`academic-year` = strftime(`academic-year`, "%Y")) %>%
  select(row, data_type, `academic-year`, name, numeric) %>%
  spatter(`academic-year`) %>%
  select(-row)
# A tibble: 2 x 3
  name     `2017-01-01` `2018-01-01`
  <chr>           <dbl>        <dbl>
1 Matilda             4            2
2 Nicholas            3            1

For this post, I think that’s all Folks. If you want to know more then do check out the amazing book (link above)


comments powered by Disqus