Tidyverse, an ideal tool for Descriptive Statistics with R

Post Views: 123

Descriptive statistics is a crucial step in data analysis, providing a detailed overview of the main characteristics of a dataset. R, with its vast ecosystem of packages, offers a powerful and coherent solution to address this phase. Among these, Tidyverse stands out, a set of packages designed to improve data manipulation, analysis and visualization in R.

Tidyverse

Tidyverse is a set of R packages designed to work together cohesively, offering a consistent and intuitive programming style for data analysis. It was developed by statistician Hadley Wickham and his team. Tidyverse includes several packages that are commonly used for data analysis and manipulation of data frames. Some of the most notable packages included are:

ggplot2: A package for creating elegant and flexible graphs based on graph grammar. It is very powerful for creating data visualizations.
dplyr: Provides a set of functions for manipulating data, such as selecting columns, filtering rows, creating new variables, aggregating data, and more.
tidyr: Useful for managing the shape of data, such as transposing, rearranging, and manipulating data in “tidy” format, which is a neatly organized tabular format.
readr: A package for efficiently reading and importing data from various formats, including CSV, TSV, and others.
purrr: Provides tools for functional programming, enabling advanced operations on data and data structures.
tibble: A modern version of R’s core dataframe, offering improved features, such as cleaner printing and indexing of columns by name.
stringr: Useful for string manipulation and textual data cleansing.
forcats: A package for efficiently managing factors in analyses.

all packages included and you will be able to use their features in m

Installing and Loading the Tidyverse

To use tidyverse on R, you need to install it by entering the following command:

install.packages("tidyverse")

Downloading the packages and installing them will take a few minutes. Once done, you can load tidyverse with the command:

library(tidyverse)

Once you load tidyverse, you will have access to all included packages and can use their features in an integrated way to analyze and visualize data more efficiently and consistently.

Use Tidyverse for Descriptive Statistics

Tidyverse is extremely useful in descriptive statistics. The packages contained within it have been designed to facilitate the manipulation, analysis and visualization of data in a consistent manner. Let’s see together a series of steps that briefly describe a process of studying a dataset.

Data manipulation with dplyr

The dplyr package offers a clear and consistent set of functions for filtering, selecting, aggregating and manipulating data. These operations are fundamental in the data preparation phase for descriptive analysis.

For example, we can use the mtcars dataset which is available by default in R. This dataset contains information about different cars. Here’s an example of how to get started with dplyr using the mtcars dataset:

# Load the mtcars dataset
data(mtcars)

# Display the first few rows of the dataset
head(mtcars)

Running the two commands you will get the following result:

Now that we have loaded the dataset, we can take a quick look at what it contains. We have a series of car models in which the values of some characteristics such as the number of cylinders, weight, number of gears, etc. are reported. This dataset is in fact quite simple and intuitive, excellent for starting to become familiar with data manipulation.

To do this there is the dplyr package which is specific for this type of work. For example, we can select just a few columns, filter the rows based on a condition, and aggregate the data. Let’s start by selecting just three columns of the dataset we are interested in: mpg, cyl and disp.

# Load the dplyr package
library(dplyr)

# Select only specific columns
mtcars_selected <- mtcars %>%
  select(mpg, cyl, disp)

head(mtcars_selected)

You get:

As you can see from the result, we removed all the data columns that we were not interested in. We continue with the selection, this time eliminating all the models whose characteristics do not meet certain conditions. For example, if we wanted to select all car models that have a number of cylinders greater than 4, we can use the following commands:

# Filter rows based on a condition (e.g., cars with more than 4 cylinders)
mtcars_filtered <- mtcars_selected %>%
  filter(cyl > 4)

print(mtcars_filtered)

This time, given the small number of selected machine models, we can replace head() with print() which will output all the models present in the dataset, and no longer the first 6 models.

Now that we have selected the data both by row and by column, another typical operation will be to group the models having a common characteristic from which to extract a statistic for each group. For example, we could group models with the same number of cylinders, and know what the average of the other two values (mpg and disp) is.

# Group by and calculate the mean for each number of cylinders
mean_per_cyl <- mtcars_filtered %>%
  group_by(cyl) %>%
  summarise(mean_mpg = mean(mpg),
            mean_disp = mean(disp))

# Display the results
print(mean_per_cyl)

Executing you get the desired results:

This is just an example of how to get started with dplyr. You can further explore the many functions offered by dplyr, such as mutate(), arrange(), rename(), and others, to manipulate and transform data based on your specific needs. The %>% (pipe) syntax is used to chain operations, making the code more readable and sequential.

Data visualization with ggplot2

The ggplot2 package is a powerful tool for creating informative and understandable graphs. In descriptive statistics, data visualization is key to understanding the distribution and structure of data.

Previously, we performed a series of data manipulation operations with dplyr on the mtcars dataset. Now, we can move on to data visualization using ggplot2. We will use the result obtained before, mean_per_cyl, to create some graphs. Here is an example of how to create a bar graph to display the average mpg and disp for each number of cylinders (cyl):

# Load the ggplot2 package
library(ggplot2)

# Grafico a barre della media di mpg e disp per ogni numero di cilindri
ggplot(mean_per_cyl, aes(x = factor(cyl))) +
  geom_bar(aes(y = mean_mpg, fill = "Mean MPG"), stat = "identity", 
              width=0.3, position = position_nudge(x = 0.15)) +
  geom_bar(aes(y = mean_disp, fill = "Mean Disp"), stat = "identity", 
              width=0.3, position = position_nudge(x = - 0.15)) +
  labs(title = "Mean MPG and Mean Disp by Number of Cylinders", 
              x = "Number of Cylinders", y = "Mean Value") +
  scale_fill_manual(values = c("Mean MPG" = "blue", "Mean Disp" = "red")) +
  theme_minimal()

This code creates a bar graph showing the average MPG and Disp for each number of cylinders. You can further customize the graph to your preferences using ggplot2 features. By running you will then get the following bar histogram:

As regards the mtcars dataset, it is possible to have different forms of ggplot2 visualizations. We can create different types of graphs depending on the information we want to represent. Here are some examples.

Histogram of mpg

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "Histogram of MPG", x = "MPG", y = "Frequency")

By running you obtain the following histogram with information on the distribution of MPG values for the various models divided into intervals of 5.

Scatter plot of MPG versus Avail

ggplot(mtcars, aes(x = disp, y = mpg)) +
  geom_point(color = "darkgreen") +
  labs(title = "Scatter Plot of MPG vs. Displacement", x = "Displacement", y = "MPG")

By executing this we will all obtain the distribution of all the models reported in relation between MPG and Availability

Boxplot of mpg for each number of cylinders

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot() +
  labs(title = "Boxplot of MPG by Number of Cylinders", x = "Number of Cylinders", y = "MPG") +
  scale_fill_manual(values = c("red", "blue", "green"))

Efficient data management with tibble

The Tibble package can be used to efficiently manage data, providing an improved data structure over R’s basic dataframes.

If we check the mean_per_cyl value calculated previously we will see that it is already in tibble format.

print(mean_per_cyl)

This is because some methods, such as summarise() used to obtain this value, already produce results in this format. We therefore start from a dataset that is not in tibble format such as mtcars_filtered.

To convert this dataset into tibble format is very simple, you can use as_tibble() as a function.

mtcars_filtered_tibble = as_tibble(mtcars_filtered)
print(mtcars_filtered_tibble)

By performing the conversion, we will see a different formatting of the data in print.

The main difference between a basic R dataframe and a tibble concerns the visualization of the data and some behaviors during analysis.

Here are some of the practical differences between a dataframe and a tibble:

Cleaner printing: When you display a tibble in R, you get more concise output than with a basic dataframe. Tibbles show only the first 10 rows and all columns that fit on the screen, making it easier to explore the data.
Columns of consistent type: Tibbles are designed to treat columns with consistent data types more reliably than dataframes. For example, a tibble does not automatically convert strings to factors.
Consistent results in dplyr operations: Some dplyr behaviors may vary between dataframes and tibbles. For example, using select on a dataframe can return a dataframe with character columns, while with a tibble it retains the original column type.
Handling special characters: Tibbles more consistently handle special characters in column labels, avoiding automatic conversions.
Additional attributes: Tibbles can contain additional attributes, such as more informative column names and the ability to specify the data type.
Consistent text and comparison features: Tibbles provide consistent results when compared or printed, making it easier to work with them in an interactive environment such as an R console or script.

In general, while tibbles are designed to maintain greater consistency and cleanliness when exploring and manipulating data, both can be used similarly in most basic operations. The choice between dataframes and tibbles often depends on personal preferences and specific project requirements.

Conclusion

In conclusion, the tidyverse proves to be an essential tool for anyone involved in descriptive statistics with R. The consistent syntax, powerful functions, and “tidy” data philosophy simplify the workflow, allowing for more efficient analyzes and clearer visualizations. The combination of dplyr, ggplot2, tibble, and other packages within the tidyverse offers a comprehensive suite of tools for exploring and understanding data in detailed ways.