We will create a data analysis report about air pollution in Belgrade with the explanation of the R code used. Data that will be used for the analysis comes from the stations, which are set up by an independent initiative Vazduh Gradjanima and is available from their GitHub repository. The report will be created as an R Markdown file that can be hosted on Netlify.
We will start our workshop by installing R and RStudio, by following the instructions available on this link: Install R i RStudia.
To learn how to use R and develop a report like this visit Data Challenge Platform with learning material developed in partnership with UNDP Serbia. In particular we will focus on matirijal Day 1 and Day 2.
After we get familiar with the basic R syntax and data wrangling using data available from the open data portal Republic of Serbia https://data.gov.rs/sr/, we can dig into the air pollution data set following the instructions given below.
We will start the analysis by uploading the necessary packages and data into R.
If you have not got the packages used in the following code, you will need to uncomment the first few lines (delete the # symbol) in the code below.
It is always a good idea to have a look at data you upload, before you start using it for your analysis.
## Rows: 29,426
## Columns: 11
## $ timestamp <chr> "2019-10-13T14:00:00Z", "2019-10-13T15:00:00Z", "2019-10-…
## $ device_id <chr> "qr0Kde0VpQOQLCQNrFyIZWfy", "qr0Kde0VpQOQLCQNrFyIZWfy", "…
## $ device_title <chr> "Klimerko Krunska", "Klimerko Krunska", "Klimerko Krunska…
## $ location <chr> "44.80868072883205,20.465567163036557", "44.8086807288320…
## $ pm10 <dbl> 20.5, 13.5, 22.0, 25.5, 34.0, 33.5, 31.5, 50.0, 51.5, 43.…
## $ humidity <dbl> 30.62500, 33.50000, 37.45000, 41.15000, 44.07500, 44.8750…
## $ pm1 <dbl> 13.0, 11.0, 15.5, 20.5, 22.0, 23.0, 22.0, 31.0, 32.0, 27.…
## $ pm2.5 <dbl> 17.0, 11.5, 19.0, 25.5, 31.0, 32.0, 30.0, 41.0, 44.0, 38.…
## $ pressure <dbl> 1006.000, 1006.000, 1006.000, 1006.750, 1007.000, 1007.00…
## $ temperature <dbl> 29.100, 28.025, 26.325, 25.000, 23.950, 23.475, 22.750, 2…
## $ uv.index <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
We can do some tweaking, by removing a few variables and separating day from time and creating separate columns for latitude and longitude variables.
## Rows: 29,426
## Columns: 11
## $ date <chr> "2019-10-13", "2019-10-13", "2019-10-13", "2019-10-13", "…
## $ time <chr> "14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00…
## $ device_title <chr> "Klimerko Krunska", "Klimerko Krunska", "Klimerko Krunska…
## $ lat <chr> "44.80868072883205", "44.80868072883205", "44.80868072883…
## $ lng <chr> "20.465567163036557", "20.465567163036557", "20.465567163…
## $ pm10 <dbl> 20.5, 13.5, 22.0, 25.5, 34.0, 33.5, 31.5, 50.0, 51.5, 43.…
## $ humidity <dbl> 30.62500, 33.50000, 37.45000, 41.15000, 44.07500, 44.8750…
## $ pm1 <dbl> 13.0, 11.0, 15.5, 20.5, 22.0, 23.0, 22.0, 31.0, 32.0, 27.…
## $ pm2.5 <dbl> 17.0, 11.5, 19.0, 25.5, 31.0, 32.0, 30.0, 41.0, 44.0, 38.…
## $ pressure <dbl> 1006.000, 1006.000, 1006.000, 1006.750, 1007.000, 1007.00…
## $ temperature <dbl> 29.100, 28.025, 26.325, 25.000, 23.950, 23.475, 22.750, 2…
We will check how many unique records each variable in our data has.
## date time device_title lat lng pm10
## 93 24 26 26 26 3403
## humidity pm1 pm2.5 pressure temperature
## 7572 2853 3245 1949 5291
To us it is most interesting to notice that we have uniq[[3]] = 26 stations in total. Next, we’ll check if all of the stations are active, by examining the number of readings for each station.
There are a couple of stations with only one or two readings. We will remove those stations form the data set and focus in our study only on the active stations.
To see where the stations are allocated we will plot them on Google maps. We will use the leaflet package to do this and create a sub data with only a list of the stations and their positions.
## device_title lat lng
## Length:23 Min. :44.27 Min. :19.88
## Class :character 1st Qu.:44.77 1st Qu.:20.42
## Mode :character Median :44.79 Median :20.46
## Mean :44.75 Mean :20.46
## 3rd Qu.:44.80 3rd Qu.:20.48
## Max. :44.84 Max. :20.92
Once we have the subset data we can plot it using the leaflet package as below.
Notice that there are a few stations outside of Belgrade. By clicking on their location points on the map, we can identify them and remove them from our data set, as we want to focus only on Belgrade’s stations.
Once removed, we will create a new map, by again creating sub dataframe with only stations we want to use in the analysis.
## device_title lat lng
## Length:22 Min. :44.75 Min. :20.40
## Class :character 1st Qu.:44.78 1st Qu.:20.42
## Mode :character Median :44.80 Median :20.46
## Mean :44.79 Mean :20.46
## 3rd Qu.:44.81 3rd Qu.:20.48
## Max. :44.84 Max. :20.52
In this section we will look through the readings and try to make some sense of it all. If you are not familiar with the information collected by the stations, ie. about the pm particles you can check the following link: https://www.irceline.be/en/documentation/faq/what-is-pm10-and-pm2.5.
The next chunk of code counts the number of readings per each week day, it shows it in a table and visualises it using the ggplot2() package.
The previous two plots look very similar. It would be interesting to see the bars of the two plots next to each other.
It would be better to have January showing after December, so we will reorder them.
In this section we will try to investigate if the concentration of pm particles depens upon other factors such as pressure and humidity.
To investigate the relationship between air pressure and concentration of pm10 particles we will create a scatterplot:
The data is quite messy, with a number of very extreme readings, which is why we “chopped” the axis using the xlim() and the ylim() functions after checking the distributions of the two variables. This is a part of informal investigation of the possible relationship between pm10 and pressure, and a more rigorous procedure should be adopted which is beyond the scope of this workshop.
We will adopt the same approach to investigate a possible relationship between humidity and concentration of the pm10 particles.
As a challenge try to use a scatterplot to investigate a possible relationship between the concentration of the p10 particles and the temperature.
Happy R coding!!! 😃🙌💜