What is Data Blog ?

This post is for beginners - If this is your first time - Come right in !

Posted by Vaibhav Singh on Sunday, November 15, 2020

TOC

Introduction

This is a sample data blog, where in I will be working with one of the most famous & processed dataset in Data analytics & ML community mtcars. The comments & workings would be around the dataset, hence it is advisable to go through the dataset once.

mtcars dataset –> 32 rows & 11 columns, column names as below:

[, 1]   mpg     Miles/(US) gallon
[, 2]   cyl     Number of cylinders
[, 3]   disp    Displacement (cu.in.)
[, 4]   hp      Gross horsepower
[, 5]   drat    Rear axle ratio
[, 6]   wt      Weight (lb/1000)
[, 7]   qsec    1/4 mile time
[, 8]   vs      Engine (0 = V-shaped, 1 = straight)
[, 9]   am      Transmission (0 = automatic, 1 = manual)
[,10]   gear    Number of forward gears
[,11]   carb    Number of carburetors

mtcars dataset
Download data set here
Read about the variables here

Break the Syntax

This is an important section, if you want to just skim through what I have done & not getting details of things.

Anything written in red is the question which I will answer during the post

#Anything written under this grey box is R Code & its output 

Anything written like this is the comment on output produced

Anything written like this is hyperlinked for further readings

Anything written like this is to just get your attention

Anything left are the Images or output produced (I am sure you can identify them :) For example refer plot 1

A fancy pie chart.

Figure 1: A fancy pie chart.

Lets get started

Data Exploration

The initial bit where we are getting familiar with dataset is called Data Exploration phase. In this stage, we check the dimensions (rows,columns) of the dataset, view first few rows to get an idea of how data is structured, check summary alongside plethora of other steps. Let’s do some exploration on mtcars dataset.

  1. Checking the dimensions then Viewing the first 10 rows of the data
library(tidyverse)
library(broom)
library(knitr)
library(printr)

dim(mtcars)
[1] 32 11
kable(head(mtcars),align = 'c')
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

We observe that the dataset contains 32 rows & 11 columns. The column names are shortened due to lengthy names but a fair idea can be derived from that they could mean. (To read more about description of variable click link mentioned above)

For ex:-
mpg –> Miles/(US) gallon
disp –> Displacement (cu.in.)

  1. Summary & structure of the dataset
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Min. :10.4000 Min. :4.0000 Min. : 71.100 Min. : 52.000 Min. :2.76000 Min. :1.51300 Min. :14.5000 Min. :0.0000 Min. :0.00000 Min. :3.0000 Min. :1.0000
1st Qu.:15.4250 1st Qu.:4.0000 1st Qu.:120.825 1st Qu.: 96.500 1st Qu.:3.08000 1st Qu.:2.58125 1st Qu.:16.8925 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:3.0000 1st Qu.:2.0000
Median :19.2000 Median :6.0000 Median :196.300 Median :123.000 Median :3.69500 Median :3.32500 Median :17.7100 Median :0.0000 Median :0.00000 Median :4.0000 Median :2.0000
Mean :20.0906 Mean :6.1875 Mean :230.722 Mean :146.688 Mean :3.59656 Mean :3.21725 Mean :17.8487 Mean :0.4375 Mean :0.40625 Mean :3.6875 Mean :2.8125
3rd Qu.:22.8000 3rd Qu.:8.0000 3rd Qu.:326.000 3rd Qu.:180.000 3rd Qu.:3.92000 3rd Qu.:3.61000 3rd Qu.:18.9000 3rd Qu.:1.0000 3rd Qu.:1.00000 3rd Qu.:4.0000 3rd Qu.:4.0000
Max. :33.9000 Max. :8.0000 Max. :472.000 Max. :335.000 Max. :4.93000 Max. :5.42400 Max. :22.9000 Max. :1.0000 Max. :1.00000 Max. :5.0000 Max. :8.0000

Here we explore two other things related to dataset

  1. The structure of the mtcars data, which we can are all numeric column. To read up more on data structures, check this link

  2. The summary of the dataset, since all columns are numeric we get to see an overview of broad stastical parameters like mean, median, mode etc.

Data Visualization

Let’s say we want to analyse what is the relationship between mpg(miles per gallon) & hp (horse power) in different cyl (cylinder types)

mtcars %>% 
  ggplot(aes(x=mpg, y=hp, color=as.factor(cyl)))+
  geom_point()+
  labs(title="Relationship between mpg & hp",
       subtitle = "We can observe that as number of cylinders increases so does mpg but horse power decreases",
       x="Miles Per Gallon",
       y="Gross Horse Power ",
       color="Number of cylinders")

Linear Regression

This would be the last section for this intro post. Lets quickly see that if we have to predict mpg using all other variables which variable would be most important feature in our prediction. While there are many alogrithms to do this, we would be using linear regression in this post.

head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
linear <- lm(mpg~. , data=mtcars)

tidy(summary(linear))
term estimate std.error statistic p.value
(Intercept) 12.303374156 18.717884429 0.657305808 0.518124397
cyl -0.111440478 1.045023363 -0.106639222 0.916087376
disp 0.013335240 0.017857500 0.746758487 0.463488650
hp -0.021482119 0.021768579 -0.986840654 0.334955314
drat 0.787110972 1.635373069 0.481303616 0.635277898
wt -3.715303928 1.894414300 -1.961188706 0.063252151
qsec 0.821040750 0.730844796 1.123413280 0.273941270
vs 0.317762814 2.104508606 0.150991454 0.881423472
am 2.520226887 2.056650553 1.225403549 0.233989711
gear 0.655413017 1.493259961 0.438914211 0.665206434
carb -0.199419255 0.828752498 -0.240625827 0.812178713

Above, we can see that wt is the most important predictor as its p-value is the lowest. (Well, its not so simple, but we'll call off at this point this as of now)

That’s all Folks for the intro post. This was just an intro post to get first timers familiar with the syntax & flow of the data blogs. Incase, you would love to explore more detailed & advanced posts, head over to other post from home page


comments powered by Disqus