class: center, middle, inverse, title-slide # Data Science Story Telling with R ## klikR ### Tatjana Kecojevic ### 24 Nov 2018 --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/c/c1/Rlogo.png) ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Rlogo.png) --- class: inverse, center, middle #R Workshop: part I ## Hi, Zdravo, Ciao! Welcome to the Data Science Story Telling with R! Let me introduce you to our team: - Hi, I'm Tanja! A Data Scientist at [DataTeka](https://datateka.com/). - Hi, I'm Zeljko! I work at [InfostudHub](https://www.infostudhub.rs). --- ## How's the day planned - There will be a little bit of instruction, and a few exercises, then some more instruction, and some more exercises, some reading, some more exercises, ... - The goal for the day is to work with your team and mentor to make a web app for looking at data. - We are going to learn about the software R, and the language of data analysis. There's a lot of things to learn. It's ok if you can't remember it all. Most important thing is to have fun and play, break things and fix them, try out new stuff! - We will have breaks whenever you feel you want them - there are snacks, drinks and pizzas ππ. ### CODE of CONDUCT: 1. Be positive 2. Be inclusive 3. Ask for help, and give some help --- ## MATERIALS for WORKSHOP: To download workshop's material please go to: <https://github.com/DataTeka/klikrws> <img src="images/GitHub.png" width="750px" style="display: block; margin: auto;" /> --- class: center, middle # How do we do it? π€ ###Steps of a typical data science project: <img src="images/Program_HW.png" width="500px" /> --- class: inverse, center, middle #Get Started π€«π΄ <img src="images/George_Desk.gif" width="600px" /> --- ## Write R Code ππΆ To start using **R** you need to: 1) Install [R](https://cran.r-project.org/) [(and RStudio)](https://www.rstudio.com/products/rstudio/download/#download) 2) Launch it and set your working directory: letting R know where to find all of your files. - **On a mac**, it'd look like this `setwd("~/Documents/DS_Story")` - **On a pc**, it might look like this `setwd("C:/Documents/DS_Story")` 3) Start writing **R** code! **Tip**π‘: - When start working on a new R code/R Project in [RStudio IDE](https://support.rstudio.com/hc/en-us/sections/200107586-Using-the-RStudio-IDE) use ***File -> New Project*** This way your working directory would be set up when you start a new project and it will save all your files in it. Next time you open your project it would set project's directory as a working directory... It would help you with so much [more](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects). --- class: center, middle ##[RStudio IDE Cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf) <img src="images/RStudio.png" width="500px" /> ***Top Left:*** Code Editor; -- ***Bottom Left:*** R Console; -- ***Top Right:*** Environment -- ***Bottom Right:*** Plots and Files --- #Dataset Today we will examine [Olympic games data](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) that include data about the Games from Athens 1896 to Rio 2016. The file `athlete_events.csv` contains `\(271,116\)` rows and `\(15\)` columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). <img src="images/olympicCSV.png" width="800px" /> **Note** π‘: there are 15 columns, each of which we call a **variable**. --- class: inverse, center, middle # Let's get introduced to some basic statistical concepts π§ --- ##What will I learn in Part I? During Part I of the workshop you will be introduced to some basic R syntax and a set of methods that enable data to be explored using `R` with the **objective** - of summarising and understanding the main features of the variables contained within the data and - to investigate the nature of any linkages between the variables that may exist. The starting point is to understand **what data is**. - What is the **population**? - Why do we use **samples**? So, from where do I start? - **Do I understand the problem** under investigation and are the objectives of the investigation clear? *The only way to obtain this information is to ask questions, and keep asking questions until satisfactory answers have been obtained.* - Do I understand exactly **what each variable is measuring/recording?** --- #Describing Variables A starting point is to examine the characteristics of each individual variable in the data set. The way to proceed depends upon the type of variable being examined. **Classification of variable types** The variables can be one of two broad types: - Attribute variables - Measured variables .pull-left[ **attribute** gender days in a week ] .pull-right[ **measured** age weight ] --- ##The Concept of Statistical Distribution **The concept of the statistical distribution is central to statistical analysis.** This concept relates to the population and conceptually assumes that we have perfect information, the exact composition of the population is known. .pull-left[ **attribute:** ![](DSSR_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] .pull-right[ **measured:** ![](DSSR_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- class: center, middle ##Summary Statistics ![](DSSR_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ##Investigating relationship between variables One of the key steps required of the Data Analyst is to investigate the relationship between variables. This requires a further **classification of the variables** contained within the data, as either a <span style="color:darkred">**response**</span> variable or an <span style="color:darkred">**explanatory**</span> variable. A **response** variable is a variable that measures either directly or indirectly the objectives of the analysis. An **explanatory** variable is a variable that may influence the response variable. --- class: center, middle ##Bivariate Relationships <img src="images/RelationshipMatrix.png" width="500px" /> --- class: center, middle ##DA Methodology <img src="images/DaMethodology.png" width="600px" /> Note that the 'Further Data Analysis' stage may or may-not be required depending on the outcome of the 'Initial Data Analysis' at stage 1. --- class: center, middle ##Measured Vs Attribute(2-levels) <img src="images/MvAMethodology.png" width="700px" /> --- class: center, middle ##Measured Vs Measured <img src="images/MvMMethodology.png" width="700px" /> --- ##Further Data Analysis If the <span style="color:darkblue">'**Initial Data Analysis**'</span> is <span style="color:blue">*inconclusive*</span> then <span style="color:darkblue">'**Further Data Analysis**'</span> is required. The 'Further Data Analysis' is procedure that enables a decision to be made, based on the sample evidence, as to one of two outcomes: - There is no relationship - There is a relationship These statistical procedures are called <span style="color:darkred">**hypothesis tests**</span>, which essentially <span style="color:red">*provide a decision rule for choosing between one of the two outcomes*</span>: "There is no relationship" or "There is a relationship" based on the sample evidence. All hypothesis tests are carried out in four stages: - Stage 1: Specifying the hypotheses. - Stage 2: Defining the test parameters and the decision rule. - Stage 3: Examining the sample evidence. - Stage 4: The conclusions. --- class: inverse, center, middle #How do we do it in R?: part II π€ ##klikR --- ##Before Tidyverse R, there is Base R! When you download and install **R** for the first time, you are installing **the Base R** software. **Base R** contains most of the functions youβll use on a daily basis: `mean()`, `subset()`... To learn about **R**'s basic operations, data structures and base functions you could look at one of the R-Ladies Manchester's handouts: [Introduction to base R](https://tanjakec.github.io/blog/introduction-to-r/). If you want to access data and code written by other people, youβll need to install it as a **package**. An **R package** is a bundle of functions (code), data, documentation, vignettes (examples), stored in one neat place. "In **R**, the fundamental unit of shareable code is the package." [Hadley Wickham](http://r-pkgs.had.co.nz/intro.html) --- ##The verse! ππΆ An opinionated collection of **R packages** for data science. [`install.packages("tidyverse")`](https://www.tidyverse.org/) [`library(tidyverse)`](https://www.tidyverse.org/packages/) - Have you tried learning data science by reading books? ππ [**R for Data Science**](http://r4ds.had.co.nz/) by Garrett Grolemund & Hadley Wickham - Have you tried learning data science by posting your questions and discussing it with other people within the R community? π₯π»πππ£ [**RStudio Community**](https://community.rstudio.com/) --- ##The <span style="color:blue">`dplyr`</span> Package ππ π©βοΈ: provides a β<span style="color:red">grammar</span>β (the verbs) for data manipulation and for operating on data frames. The **key opertor and the esential verbs** are : - <span style="color:blue">`%>%`</span>: **the βpipeβ operator** used to connect multiple verb actions together into a pipeline. - <span style="color:blue">`select()`</span>: return a subset of the columns of a data frame. - <span style="color:blue">`mutate()`</span>: add new variables/columns or transform existing variables. - <span style="color:blue">`filter()`</span>: extract a subset of rows from a data frame based on logical conditions. - <span style="color:blue">`arrange()`</span>: reorder rows of a data frame according to single or multiple variables. - <span style="color:blue">`summarise()`</span> / <span style="color:blue">`summarize()`</span>: reduces each group to a single row by calculating aggregate measures. --- ##The Olimpic Games Data A historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. The main data frame olympics has **271,116 rows** and **15 variables**: - **ID** - Unique number for each athlete - **Name** - Athlete's name - **Sex** - M or F - **Age** - Integer - **Height** - In centimeters - **Weight** - In kilograms - **Team** - Team name - **NOC** - National Olympic Committee 3-letter code - **Games** - Year and season - **Year** - Integer - **Season** - Summer or Winter - **City** - Host city - **Sport** - Sport - **Event** - Event - **Medal** - Gold, Silver, Bronze, or NA --- ##The Olimpic Games Data ```r # import csv data file into R olympic <- read.csv("data/athlete_events.csv") olympic[1:5,] ``` ``` ## ID Name Sex Age Height Weight Team NOC ## 1 1 A Dijiang M 24 180 80 China CHN ## 2 2 A Lamusi M 23 170 60 China CHN ## 3 3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN ## 4 4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN ## 5 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED ## Games Year Season City Sport ## 1 1992 Summer 1992 Summer Barcelona Basketball ## 2 2012 Summer 2012 Summer London Judo ## 3 1920 Summer 1920 Summer Antwerpen Football ## 4 1900 Summer 1900 Summer Paris Tug-Of-War ## 5 1988 Winter 1988 Winter Calgary Speed Skating ## Event Medal ## 1 Basketball Men's Basketball <NA> ## 2 Judo Men's Extra-Lightweight <NA> ## 3 Football Men's Football <NA> ## 4 Tug-Of-War Men's Tug-Of-War Gold ## 5 Speed Skating Women's 500 metres <NA> ``` **Note** π‘: we are reading only first 5 raws and there are 15 columns!! --- ##Setting up Working Environment π‘ Install necessary packages you will be working with! ```r install.packages("dplyr", repos = "http://cran.us.r-project.org") install.packages("ggplot2", repos = "http://cran.us.r-project.org") install.packages("DT", repos = "http://cran.us.r-project.org") ``` And now we're ready to start practicing Elain's Dance!!! ππ΅πΆ <img src="images/ElainDanceI.png" width="300px" style="display: block; margin: auto;" /> --- ## First look at the data: <span style="color:blue">`dim()`</span> & <span style="color:blue">`head()`</span> ```r dim(olympic) ``` ``` ## [1] 271116 15 ``` ```r head(olympic, n = 3) ``` ``` ## ID Name Sex Age Height Weight Team NOC Games ## 1 1 A Dijiang M 24 180 80 China CHN 1992 Summer ## 2 2 A Lamusi M 23 170 60 China CHN 2012 Summer ## 3 3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN 1920 Summer ## Year Season City Sport Event Medal ## 1 1992 Summer Barcelona Basketball Basketball Men's Basketball <NA> ## 2 2012 Summer London Judo Judo Men's Extra-Lightweight <NA> ## 3 1920 Summer Antwerpen Football Football Men's Football <NA> ``` This is hard to read...?! π --- ##Examine the structure of the data: <span style="color:blue">`str()`</span> ```r str(olympic) ``` ``` ## 'data.frame': 271116 obs. of 15 variables: ## $ ID : int 1 2 3 4 5 5 5 5 5 5 ... ## $ Name : Factor w/ 134732 levels " Gabrielle Marie \"Gabby\" Adcock (White-)",..: 8 9 44318 29412 21469 21469 21469 21469 21469 21469 ... ## $ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ... ## $ Age : int 24 23 24 34 21 21 25 25 27 27 ... ## $ Height: int 180 170 NA NA 185 185 185 185 185 185 ... ## $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ... ## $ Team : Factor w/ 1184 levels "30. Februar",..: 199 199 273 278 705 705 705 705 705 705 ... ## $ NOC : Factor w/ 230 levels "AFG","AHO","ALB",..: 42 42 56 56 146 146 146 146 146 146 ... ## $ Games : Factor w/ 51 levels "1896 Summer",..: 38 49 7 2 37 37 39 39 40 40 ... ## $ Year : int 1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ... ## $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ... ## $ City : Factor w/ 42 levels "Albertville",..: 6 18 3 27 9 9 1 1 17 17 ... ## $ Sport : Factor w/ 66 levels "Aeronautics",..: 9 33 25 62 54 54 54 54 54 54 ... ## $ Event : Factor w/ 765 levels "Aeronautics Mixed Aeronautics",..: 160 398 349 710 623 619 623 619 623 619 ... ## $ Medal : Factor w/ 3 levels "Bronze","Gold",..: NA NA NA 2 NA NA NA NA NA NA ... ``` The **output could look messy** and it might not fit the screen when dealing with a big data set that has lots of variables! π€ͺ --- ##Do it in a tidy way: <span style="color:blue">`glimpse()`</span> ```r suppressPackageStartupMessages(library(dplyr)) glimpse(olympic) ``` ``` ## Observations: 271,116 ## Variables: 15 ## $ ID <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7... ## $ Name <fct> A Dijiang, A Lamusi, Gunnar Nielsen Aaby, Edgar Lindena... ## $ Sex <fct> M, M, M, M, F, F, F, F, F, F, M, M, M, M, M, M, M, M, M... ## $ Age <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31,... ## $ Height <int> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 18... ## $ Weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75,... ## $ Team <fct> China, China, Denmark, Denmark/Sweden, Netherlands, Net... ## $ NOC <fct> CHN, CHN, DEN, DEN, NED, NED, NED, NED, NED, NED, USA, ... ## $ Games <fct> 1992 Summer, 2012 Summer, 1920 Summer, 1900 Summer, 198... ## $ Year <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1... ## $ Season <fct> Summer, Summer, Summer, Summer, Winter, Winter, Winter,... ## $ City <fct> Barcelona, London, Antwerpen, Paris, Calgary, Calgary, ... ## $ Sport <fct> Basketball, Judo, Football, Tug-Of-War, Speed Skating, ... ## $ Event <fct> Basketball Men's Basketball, Judo Men's Extra-Lightweig... ## $ Medal <fct> NA, NA, NA, Gold, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ``` Ahhh... this π better! π --- ##The pipeline operater: <span style="color:blue">`%>%`</span> βββ <pre> **Left Hand Side (LHS)** <span style="color:blue">`%>%`</span> **Right Hand Side (RHS)** </pre> <pre> <span style="color:blue">x %>% f(..., y)</span> <span style="color:blue"> f(x,y)</span> </pre> The "pipe" passes the **result** of the **LHS** as the 1st operator argument of the **function** on the **RHS** <pre> <span style="color:blue">3 %>% sum(4)</span> <==> <span style="color:blue"> sum(3, 4)</span> </pre> <span style="color:blue">`%>%`</span> is very practical for chaining together multiple <span style="color:blue">`dplyr`</span> functions in a sequence of operations. --- ##pick variables by their names: <span style="color:blue">`select()`</span>, <img src="images/select().png" width="450px" /> - <span style="color:blue">`starts_with("X")`</span> every name that starts with "X". - <span style="color:blue">`ends_with("X")`</span> every name that ends with "X". - <span style="color:blue">`contains("X")`</span> every name that contains "X". - <span style="color:blue">`matches("X")`</span> every name that matches "X", where "X" can be a regular expression. - <span style="color:blue">`num_range("x", 1:5)`</span> the variables named x01, x02, x03, x04, x05. - <span style="color:blue">`one_of(x)`</span> => every name that appears in x, which should be a character vector. --- ##Select your variables Use `olympic df` to select the variable(s) 1) that ends with letter `t` 2) starts with letter `S`. Try to do this selection using base R. Check out all the [`select()`](https://dplyr.tidyverse.org/reference/select_helpers.html) options that are available. --- ##Solutions: ```r end_t <- select(olympic, ends_with("t")) head(end_t, n = 1) ``` ``` ## Height Weight Sport Event ## 1 180 80 Basketball Basketball Men's Basketball ``` ```r beg_S <- select(olympic, starts_with("S")) head(beg_S, n = 1) ``` ``` ## Sex Season Sport ## 1 M Summer Basketball ``` of course all of this could be done using **base R** like for example: ```r beg_S_base <- olympic[c("Sex", "Season", "Sport")] head(beg_S_base, n = 1) ``` ``` ## Sex Season Sport ## 1 M Summer Basketball ``` but it's less intuitive and often requires more typing. --- ##Create new variables of existing variables: <span style="color:blue">`mutate()`</span> <img src="images/mutate().png" width="400px" /> It would allow you to add to the data frame `df` a new column, `z`, which is the multiplication of the columns `x` and `y`: `mutate(df, z = x * y)`. If we would like to observe `BMI` of the athletes we could create a new column `BMI`. The BMI is universally expressed in kg/m2, resulting from mass in kilograms and height in metres. **Note**π‘: variable `**Height**` - In centimeters! ```r olympic <- mutate(olympic, BMI = Weight / (Height/100)^2) head(olympic, n = 1) ``` ``` ## ID Name Sex Age Height Weight Team NOC Games Year Season ## 1 1 A Dijiang M 24 180 80 China CHN 1992 Summer 1992 Summer ## City Sport Event Medal BMI ## 1 Barcelona Basketball Basketball Men's Basketball <NA> 24.69136 ``` Check [here](https://dplyr.tidyverse.org/reference/mutate.html) for more functionalities with mutate. --- ##Pick observations by their values: <span style="color:blue">`filter()`</span> <img src="images/filter().png" width="450px" /> There is a set of logical operators in **R** that you can use inside `filter()`: - `x < y`: `TRUE` if `x` is less than `y` - `x <= y`: `TRUE` if `x` is less than or equal to `y` - `x == y`: `TRUE` if `x` equals `y` - `x != y`: `TRUE` if `x` does not equal `y` - `x >= y`: `TRUE` if `x` is greater than or equal to `y` - `x > y`: `TRUE` if `x` is greater than `y` - `x %in% c(a, b, c)`: `TRUE` if `x` is in the vector `c(a, b, c)` - `is.na(x)`: Is `NA` - `!is.na(x)`: Is not `NA` --- ##Filter your data: Use `olympic df` to filter: 1) only Serbian teams and save it as `olympicSR` 2) only Serbian teams from 2000 onward and save it as `olympicSR21c` 3) athletes whos wight is bigger then 100kg and height is over 2m. Don't forget to **use `==` instead of `=`**! and Don't forget the quotes ** `""` ** --- ##Solutions: ```r olympicSR <- filter(olympic, Team == "Serbia") dim(olympicSR) ``` ``` ## [1] 388 16 ``` ```r olympicSR21c <- filter(olympicSR, Year >= 2000) dim(olympicSR21c) ``` ``` ## [1] 386 16 ``` ```r big_athlete <- filter(olympic, Weight > 100 & Height > 200) dim(big_athlete) ``` ``` ## [1] 894 16 ``` --- ##Reorder the rows: <span style="color:blue">`arrange()`</span> is used to reorder rows of a **d**ata **f**rame (df) according to one of the variables/columns. <img src="images/arrange().png" width="300px" /> - If you pass `arrange()` a character variable, **R** will rearrange the rows in alphabetical order according to values of the variable. - If you pass a factor variable, **R** will rearrange the rows according to the order of the levels in your factor (running `levels()` on the variable reveals this order). --- ##Arranging your data 1) Arrange Serbian athletes in `olympicSR21c` `df` by `Height` in ascending and descending order. 2) Using `olympicSR df` - Find the youngest athlete. - Find the heaviest athlete. --- ##Solution 1): ```r olympicSR21c_hs <- arrange(olympicSR21c, Height) head(olympicSR21c_hs, 2) ``` ``` ## ID Name Sex Age Height Weight Team NOC Games Year ## 1 81094 Olivera Moldovan F 23 158 62 Serbia SRB 2012 Summer 2012 ## 2 81094 Olivera Moldovan F 27 158 62 Serbia SRB 2016 Summer 2016 ## Season City Sport ## 1 Summer London Canoeing ## 2 Summer Rio de Janeiro Canoeing ## Event Medal BMI ## 1 Canoeing Women's Kayak Doubles, 500 metres <NA> 24.83576 ## 2 Canoeing Women's Kayak Singles, 200 metres <NA> 24.83576 ``` ```r *olympicSR21c_ht <- arrange(olympicSR21c, desc(Height)) head(olympicSR21c_ht, 2) ``` ``` ## ID Name Sex Age Height Weight Team NOC Games ## 1 98227 Miroslav Raduljica M 28 213 130 Serbia SRB 2016 Summer ## 2 115246 Vladimir timac M 28 211 112 Serbia SRB 2016 Summer ## Year Season City Sport Event Medal ## 1 2016 Summer Rio de Janeiro Basketball Basketball Men's Basketball Silver ## 2 2016 Summer Rio de Janeiro Basketball Basketball Men's Basketball Silver ## BMI ## 1 28.65393 ## 2 25.15667 ``` --- ##Solution 2): ```r head(arrange(olympicSR, Age), 5) ``` ``` ## ID Name Sex Age Height Weight Team NOC ## 1 23792 Anja Crevar F 16 164 49 Serbia SRB ## 2 23792 Anja Crevar F 16 164 49 Serbia SRB ## 3 89864 Milica Ostoji F 16 172 60 Serbia SRB ## 4 54201 Tatjana Jelaa (-Mirkovi ) F 17 178 85 Serbia SRB ## 5 80027 Duan Miloevi M 17 171 62 Serbia SRB ## Games Year Season City Sport ## 1 2016 Summer 2016 Summer Rio de Janeiro Swimming ## 2 2016 Summer 2016 Summer Rio de Janeiro Swimming ## 3 2008 Summer 2008 Summer Beijing Swimming ## 4 2008 Summer 2008 Summer Beijing Athletics ## 5 1912 Summer 1912 Summer Stockholm Athletics ## Event Medal BMI ## 1 Swimming Women's 200 metres Individual Medley <NA> 18.21832 ## 2 Swimming Women's 400 metres Individual Medley <NA> 18.21832 ## 3 Swimming Women's 200 metres Freestyle <NA> 20.28123 ## 4 Athletics Women's Javelin Throw <NA> 26.82742 ## 5 Athletics Men's 100 metres <NA> 21.20311 ``` --- ```r head(arrange(olympicSR, desc(Weight)), 5) ``` ``` ## ID Name Sex Age Height Weight Team NOC Games ## 1 62130 Asmir Kolainac M 23 187 140 Serbia SRB 2008 Summer ## 2 62130 Asmir Kolainac M 27 187 140 Serbia SRB 2012 Summer ## 3 62130 Asmir Kolainac M 31 187 140 Serbia SRB 2016 Summer ## 4 98227 Miroslav Raduljica M 28 213 130 Serbia SRB 2016 Summer ## 5 106231 Dejan Savi M 33 190 120 Serbia SRB 2008 Summer ## Year Season City Sport Event Medal ## 1 2008 Summer Beijing Athletics Athletics Men's Shot Put <NA> ## 2 2012 Summer London Athletics Athletics Men's Shot Put <NA> ## 3 2016 Summer Rio de Janeiro Athletics Athletics Men's Shot Put <NA> ## 4 2016 Summer Rio de Janeiro Basketball Basketball Men's Basketball Silver ## 5 2008 Summer Beijing Water Polo Water Polo Men's Water Polo Bronze ## BMI ## 1 40.03546 ## 2 40.03546 ## 3 40.03546 ## 4 28.65393 ## 5 33.24100 ``` --- ##Collapse many values down to a single summary: <span style="color:blue">`summarise()`</span> <img src="images/summarise().png" width="450px" /> - uses the same syntax as `mutate()`, but the resulting dataset consists of a single row instead of an entire new column in the case of `mutate()`. - builds a new dataset that contains only the summarising statistics. Use `summarise()`: 1) to print out a summary of `olypicSR` `df` containing two variables: max_Age and max_BMI. 2) to print out a summary of `olypicSR` `df` containing two variables: mean_Age and mean_BMI. Explore more about [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html). --- ##Solution: Summarise your data ```r summarise(olympicSR, max_Age = max(Age), max_BMI = max(BMI)) ``` ``` ## max_Age max_BMI ## 1 46 40.03546 ``` ```r summarise(olympicSR, mean_Age = mean(Age), mean_BMI = mean(BMI)) ``` ``` ## mean_Age mean_BMI ## 1 26.38918 23.34068 ``` --- class: inverse, center, middle ## Let's `%>%` all up! Confer with your team members. What relationship do you expect to see between: `Age` and `Height` of the athletes? `Age` and `BMI`? --- <img src="images/pipe_short_cut.png" width="750px" style="display: block; margin: auto;" /> --- **Do you know what this code does?** ```r olympicSR_pipe <- olympic %>% filter(Team == "Serbia" & Year > 2000) %>% mutate(BMI = Weight / (Height/100)^2) plot(olympicSR_pipe$Age, olympicSR_pipe$Height, cex = 0.5, col = "red") ``` <img src="images/Cosmo.jpg" width="250px" style="display: block; margin: auto;" /> --- <img src="DSSR_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle ##We have learnt all of Elain's moves!!! ππ΅πΆ <img src="images/ElainDanceII.png" width="300px" /> --- class: inverse, center, middle ## Can we make it look better?: ggplot; part III π ##klikR --- class: inverse, center, middle #"The simple graph has brought more information to the data analystβs mind than any other device." John Tukey --- ## grammar of graphics Enables you to specify building blocks of a plot and to combine them to create graphical display you want. There are 8 building blocks: - data - aesthetic mapping - geometric object - statistical transformations - scales - coordinate system - position adjustments - faceting --- ##<span style="color:blue">ggplot()</span> 1. "Initialise" a plot with `ggplot()` 2. Add layers with `geom_` functions ```r library(ggplot2) ggplot(olympicSR_pipe, aes(x = Age, y = Height)) + geom_point(col ="red") ``` <img src="DSSR_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> **Tip**: You can use the following code template to make graphs with [`ggplot2`](https://ggplot2.tidyverse.org): ```r ggplot(data = <DATA>, (mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` --- #<span style="color:blue">ggplot()</span> gallery Run the following code to see what graphs it's going to produce. ```r ggplot(data = olympic, mapping = aes(x = Height), binwidth = 10) + geom_histogram() # ggplot(data = olympic, mapping = aes(x = Height)) + geom_density() # ggplot(data = olympic, mapping = aes(x = Season, color = Sex)) + geom_bar() # ggplot(data = olympic, mapping = aes(x = Sex, fill = Season)) + geom_bar() ``` You can see a nice list of all kinds of `ggplot`s at <http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html> --- ##Confer with your neighbours: **Does the BMI of the athletes depend upon their Age?** `$$\hat{y}=\hat{\beta_0} + \hat{\beta_1} x + e$$` Run this code in your console to fit the model `Age` vs `BMI`. Pay attention to spelling, capitalization, and parentheses! ```r m1 <- lm(olympic$BMI ~ olympic$Age) summary(m1) ``` --- **Can you answer the question usig the output of the fitted model?** ```r m1 <- lm(olympic$BMI ~ olympic$Age) summary(m1) ``` ``` ## ## Call: ## lm(formula = olympic$BMI ~ olympic$Age) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.301 -1.790 -0.232 1.410 41.587 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 19.880164 0.029279 679.0 <2e-16 *** ## olympic$Age 0.115906 0.001142 101.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.842 on 206163 degrees of freedom ## (64951 observations deleted due to missingness) ## Multiple R-squared: 0.04762, Adjusted R-squared: 0.04762 ## F-statistic: 1.031e+04 on 1 and 206163 DF, p-value: < 2.2e-16 ``` --- ## Your turn! Use `olympic` data. **Does the Weight depend upon Age?** 1) Data set is big, hence let us use a sample of 10,000 athletes (tip: `sample_n(df, n)`) 2) Produce a scattep plot: what does it tell you? 3) Fit a regression model: is there a relationship? How strong is it? Is the relationship linear? What conclusion(s) can you draw? 4) What are the other questions you could ask; could you provide the answers to them? --- ## Possible Solution Q1 & Q2: sample and scatter plot ```r sam_olymp <- sample_n(olympic, 10000) ggplot(sam_olymp, aes(x = Age, y = Weight)) + geom_point(alpha = 0.2, shape = 21, fill = "blue", colour="black", size = 5) + geom_smooth(method = "lm", se = F, col = "maroon3") + geom_smooth(method = "loess", se = F, col = "limegreen") ``` <img src="DSSR_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> --- ## Possible Solution Q3: simple regression model ```r my.model <- lm(sam_olymp$Weight ~ sam_olymp$Age) summary(my.model) ``` ``` ## ## Call: ## lm(formula = sam_olymp$Weight ~ sam_olymp$Age) ## ## Residuals: ## Min 1Q Median 3Q Max ## -43.643 -9.922 -1.358 8.078 143.205 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 56.70828 0.74080 76.55 <2e-16 *** ## sam_olymp$Age 0.56346 0.02883 19.54 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 13.92 on 7734 degrees of freedom ## (2264 observations deleted due to missingness) ## Multiple R-squared: 0.04706, Adjusted R-squared: 0.04693 ## F-statistic: 381.9 on 1 and 7734 DF, p-value: < 2.2e-16 ``` --- ## Adding layers to your <span style="color:blue">`ggplot()`</span> ```r ggplot(sam_olymp, aes(x = Age, y = Weight, col = "red")) + geom_point(alpha = 0.2, shape = 21, fill = "blue", colour="black", size = 5) + geom_smooth(method = "lm", se = F, col = "maroon3") + geom_smooth(method = "loess", se = F, col = "limegreen") + labs (title= "Age vs Weight", x = "Age", y = "Weight") + theme(legend.position = "none", panel.border = element_rect(fill = NA, colour = "black", size = .75), plot.title=element_text(hjust=0.5)) + geom_text(x = 80000, y = 125, label = "regression line", col = "maroon3") + geom_text(x = 90000, y = 75, label = "smooth line", col = "limegreen") ``` --- ## Voila <img src="DSSR_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> --- ## **There is a challenge:** - `dplyr`'s `group_by()` function enables you to group your data. It allows you to create a separate df that splits the original df by a variable. - `datatable()` from `DT` package enables you to display as table on HTML page an R data object that could be filtered, arranged etc. - `boxplot()` function produces boxplot(s) of the given (grouped) values. Knowing about `group_by()` and `DT::datatable()` functions, coud we find out number of medals per each team? ```r olympic %>% filter(!is.na(Medal)) %>% group_by(Team, Medal) %>% summarize(cases=n()) %>% DT::datatable() ``` Could you find the number of medals per each team for the last Rio games? **Hint**π‘: Games in Rio were in 2016! --- ## Possible Solution:
--- **Exercise:** πͺ Let us Visualise data about number of female and male athletes from ex YU countries available in the data set: "Bosnia and Herzegovina", "Croatia", "Serbia", "Serbia and Montenegro", "Montenegro", "Slovenia". First we would need to get the data we want to be presented on a graph. ```r exyu <- olympic %>% filter(Team %in% c("Bosnia and Herzegovina", "Croatia", "Serbia", "Serbia and Montenegro", "Montenegro", "Slovenia")) %>% group_by(Team, Sex) %>% * summarize(total = n()) exyu ``` ``` ## # A tibble: 12 x 3 ## # Groups: Team [?] ## Team Sex total ## <fct> <fct> <int> ## 1 Bosnia and Herzegovina F 39 ## 2 Bosnia and Herzegovina M 95 ## 3 Croatia F 236 ## 4 Croatia M 640 ## 5 Montenegro F 36 ## 6 Montenegro M 58 ## 7 Serbia F 139 ## 8 Serbia M 249 ## 9 Serbia and Montenegro F 58 ## 10 Serbia and Montenegro M 263 ## 11 Slovenia F 410 ## 12 Slovenia M 697 ``` --- **How do we plot this?** π€ ```r # we need a bar chart with each team on the x axis and number of male and female athlethes on the y axis. ggplot(data = exyu, aes(x = Team, y = total, fill = Sex)) + geom_bar(stat="identity", position="dodge", col = "black") + # to make it read easier we will flip x & y coordinates coord_flip() + # we will add description for x and y axies and title and subtitle labs(x="ex YU country", y="No of athletes", title = "Comparisons of M and F representatives in exYU Teams", subtitle = "for klikR workshop", caption = "Data from: kaggle - 120 years of Olympic history") + # add the border on the graph theme(panel.border = element_rect(fill = NA, colour = "black", size = 1)) + #remove the grid lines theme(plot.title = element_text(size = 14, vjust = 2), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_blank()) ``` --- **Our graph!** ππ <img src="DSSR_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle ##Let's do Elain's Dance!!! ππ΅πΆ <img src="images/Elain_dance.gif" width="500px" /> --- ## useful links: cheatsheets: - [data-wrangling-cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) - [ggplot2-cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) websites: - [tidyverse, visualization, and manipulation basics](https://www.rstudio.com/resources/webinars/tidyverse-visualization-and-manipulation-basics/) - [ggplot part of tidy verse](http://ggplot2.tidyverse.org/index.html) - [Introduction to R graphics with ggplot2](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html#introduction) --- class: inverse, center, middle #R Workshop: part IV ##klikR --- #R Markdown π»πππ Enables you to: - save and execute code and display its output - create high quality reports that could include [LaTeX](https://www.latex-project.org/) equations [R Markdown](https://rmarkdown.rstudio.com/) documents are fully reproducable and support many static and dynamic output formts, to name a few: PDF, HTML, MS Word, Beamer... It is a variant of [Markdown](https://daringfireball.net/projects/markdown/) that has embedded **R code chunks** (denoted by three backticks), to be used with [knitr](https://yihui.name/knitr/) to make it easy to create reproducible web-based reports. To use **R Markdown** you will need to install package from [CRAN](https://cran.r-project.org/) and load it with: ```r install.packages("rmakdown",repos = "http://cran.us.r-project.org") suppressPackageStartupMessages(library(rmarkdown)) ``` --- class: middle <img src="images/RMarkdown.png" width="1000px" /> You would deffinitely find usefull the following: - [The R Markdown Cheatsheet](https://ntaback.github.io/UofT_STA130/rmarkdown-2.0.pdf) - [The R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) --- #Starting with RMarkdown **<span style="color:red">Task 1</span>:** Open the file `RMarkdown_Intro.Rmd` - Change the title of the Markdown Document from `My First Markdown Document` to `RMarkdown Introduction`. - Click the **"Knit"** button to see the compiled version of your sample code. --- class: inverse, center, middle ##Congratulations! Youβve just Knitted your `\(1^{st}\)` Rmd document!!!! ππ <img src="images/kramer_congrats.gif" width="300px" /> --- ## Basic Text editing **<span style="color:red">Task 2</span>:** Letβs formatted this document further by - Changing the author of the document to your own name. - Rewriting the first sentence of the document to say "This is my first R Markdown document." - Recompiling the document so you can see your changes? --- ##Adding a link You can turn a word into a link by surrounding it in **hard brackets: [ ]** and then placing the link behind it in **parentheses: ( )**, like this: [RStudio] (www.rstudio.com) **<span style="color:red">Task 3</span>:** Make GitHub in the following paragraph link to https://github.com/DataTeka/DSStory --- #Text formatting To embed formatting instructions into your document using Markdown, you would surround text by: - one asterisks to make it italic: *italic*; - two asterisks to make it bold: **bold** and - backticks to make it monospaced: `monospaced`. To make an ordered list you need to place each item on a new line after a number followed by a period followed by a space: 1. order list 2. item 2 Note that you need to place a blank line between the list and any paragraphs that come before it. --- ##**<span style="color:red">Task 4</span>:** - Make the following paragraph (line #20) in your Rmd document look like this: The variables can be one of two broad types: 1) **Attribute variable**: has its outcomes described in terms of its characteristics or attributes; 2) **Measured variable**: has the resulting outcome expressed in numerical terms. - Make word Knit in the following paragraph bold. --- #Embeding the `R` code To embed an R code chunk you would use three back ticks: ` ```{r} ` ` chunk of code` ` ``` ` **<span style="color:red">Task 5</span>**: Replace the `cars` data set with the `olympic` data set (but don't forget to read the data!). You can also embed plots by setting `echo = FALSE` to the code chunk to prevent printing of the R code that generates the plot: ` ```{r, echo=FALSE} ` ` chunk of code` ` ``` ` **<span style="color:red">Task 6</span>**: Replace the base boxplot of mpg vs. cyl by one of the ggplot you have created earlier (remember to upload the necessary packages!). --- ##Adding **LaTex** equations Finally, if you wish to add mathematical equations to your Markdown document you can easily embed LaTeX math equations into your report. To display equation in its own line it needs to be surrounded by double dollar symbol `$$` `y = a + bx` `$$`, or to embed an equation in line within the text you would use only one dollar symbol: `$y = a + bx$`. **<span style="color:red">Task 7</span>**: Display the equation into itβs own line. --- class: inverse, center, middle #Congratulations! You have got the basics to start creating your own fabulous dynamic documentsβ¦ !!!! ππ <img src="images/giphy.gif" width="300px" /> ##π»ππππ€π€ͺπ€©π **Useful Links**: R Markdown: <http://www.stat.cmu.edu/~cshalizi/rmarkdown/> RStudio R Markdown: <https://rmarkdown.rstudio.com> [RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets/) <img src="images/SeinfeldDance.gif" width="300px" style="display: block; margin: auto;" /> --- class: center, middle # Thanks! [www.datateka.com](www.datateka.com) [tanjakec.github.io](tanjakec.github.io) @DataTeka @Tatjana_Kec Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com).