Longitudinal data in statistics and study techniques with Python

Longitudinal Data and study techniques with Python header

Longitudinal data in statistics refers to observations collected on the same study unit (for example, an individual, a family, a company) repeatedly over time. In other words, instead of collecting data from different study units at one point in time, you follow the same units over time to analyze the variations and changes that occur within each unit. In this article we will discover what they are and which study techniques to apply using Python as an analysis tool.

Longitudinal Data

Longitudinal data refers to data collected through repeated observations on a set of study units over time. These observations can be collected at regular or irregular intervals over time and are used to study changes over time, developmental processes, causal relationships, and more.

Here are some examples of longitudinal data:

  • Child Growth Study: A classic example of longitudinal data is a study that follows the physical and cognitive development of children over time. Measurements could include height, weight, motor development, language skills, etc. Observations are collected at regular intervals (for example, every six months or every year) from childhood through adulthood.
  • Clinical Study: In the medical field, longitudinal data is commonly used to monitor the progression of a disease or the effectiveness of a treatment over time. For example, a clinical trial might involve monitoring the blood sugar levels of diabetes patients every month for a period of two years to evaluate the effectiveness of a new drug.
  • Longitudinal family income surveys: These studies follow families over time to collect information on their income, expenses, living conditions, etc. Observations are collected at regular intervals (for example, annually) to analyze changes in household income over time and identify factors that influence those changes.
  • Longitudinal employment study: Another example would be a study that follows individuals throughout their working lives, recording information such as job positions, salary, job satisfaction, etc. This type of data can be used to study career patterns, factors influencing job mobility, and more.

In all these examples, the main goal is to analyze how variables change over time and what factors influence those changes. Longitudinal data provide a unique perspective that allows you to explore temporal dynamics and gain a more complete understanding of the phenomena studied.

An example of Longitudinal Data in Python

To better understand the nature of longitudinal data we can implement a simple example with Python using the pandas module, which is commonly used to manipulate and analyze tabular data. Suppose we have a dataset representing the height of a group of children measured at six-month intervals over the course of three years.

import pandas as pd

# Creating a DataFrame with longitudinal data
data = {
    'ID': [1, 2, 3, 4, 5],            # Children's IDs
    'Age': [3, 3.5, 4, 4.5, 5],        # Children's ages (years)
    'Height_0m': [90, 92, 88, 95, 91],   # Initial height (cm)
    'Height_6m': [93, 95, 90, 98, 94],   # Height at 6 months (cm)
    'Height_1y': [96, 98, 93, 101, 97],  # Height at 1 year (cm)
    'Height_1.5y': [100, 102, 97, 105, 101],  # Height at 1.5 years (cm)
    'Height_2y': [102, 104, 99, 107, 103],    # Height at 2 years (cm)
    'Height_2.5y': [104, 106, 101, 109, 105]  # Height at 2.5 years (cm)
}

df = pd.DataFrame(data)

# Display the DataFrame
df

This code creates a DataFrame with columns for the children’s ID, their age, and height measured at 6-month intervals over the course of three years. Each row represents a child and each column represents a measurement moment. For example, the Height_0m column represents the children’s initial height, while Height_6m represents the children’s height after six months, and so on.

Longitudinal Data - dataframe example data

With this DataFrame, you can perform various longitudinal analyses, such as viewing growth trends over time, calculating children’s average growth over the years, and analyzing correlations between height and age.

Longitudinal study designs

Longitudinal study designs refer to the design of a study that involves observing a set of study units over time, in order to collect repeated data on these units over specific periods. These study designs are used to study changes over time and to better understand developmental processes, causal relationships, and other phenomena that may vary over time.

There are several types of longitudinal study designs, including:

  • Pure longitudinal study: In this type of study, the same study units are observed at different points in time. For example, one could follow the same group of individuals over the course of several years to evaluate how they change over time.
  • Longitudinal panel study: In this case, a sample of study units is selected and these units are observed at more than one point in time. It can be differentiated from the pure longitudinal study in that the sample may change over time, but the units within each sample remain the same.
  • Longitudinal cohort study: This type of study follows a cohort of individuals over time, observing the group at subsequent points in time.
  • Retrospective longitudinal study: This type of study collects data on past events through interviews or records.

Longitudinal data offers numerous advantages, including the ability to evaluate changes over time, analyze developmental processes, and identify causal relationships. However, they also present unique challenges, such as managing loss-to-follow-up rates and controlling variability over time. Analysis of longitudinal data often requires specialized statistical techniques, therefore, it is important to use methods for analyzing longitudinal data, such as mixed models or generalized estimating equation (GEE) models.

Mixed Models vs. Generalized Estimating Equation Models

Mixed models and generalized estimating equation (GEE) models are both statistical methods used to analyze longitudinal data or data that has inherent correlation between observations. However, they have slightly different approaches.

Mixed Effects Models:
Mixed models are a type of statistical model that takes into account the hierarchical structure of longitudinal data. These models incorporate both fixed effects and random effects. Fixed effects are parameters that are assumed to be constant across the population and are estimated directly from the model. Random effects, on the other hand, are considered sampled from a probability distribution and are used to capture variation across study units. In other words, mixed models treat study units as random samples from a larger population. Mixed models are often used to analyze longitudinal data with a hierarchical structure, such as data in which observations are grouped within individuals or other clusters.

Generalized estimating equation (GEE) models:
GEE models, on the other hand, focus on the analysis of group means and provide parameter estimates that are consistent even when the correlation structure between observations is not specified correctly. These models consider only fixed effects and do not incorporate random effects. An advantage of GEE models is that they are robust to misspecification of the correlation structure between observations. GEE models are used when you want to make inference about group means and are not interested in estimating random effects.

In summary, while mixed models are more appropriate when we want to take into account variability between study units and when we want to make inference on random effects, GEE models are more appropriate when we want to make inference on group means and when wants to maintain greater flexibility in specifying the correlation structure. Both types of models are useful tools for analyzing longitudinal data, and the choice between the two depends on the specific research questions and the characteristics of the data themselves.

Python example of Mixed Effects Models

To exemplify the use of mixed models on a longitudinal dataset, we will create a simple dataset that represents repeated measurements of a group of individuals over time. Let’s imagine we have a dataset that contains the heights of children measured at different ages.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 2, 2, 2],
    'Age': [2, 4, 6, 3, 5, 7],
    'Height': [85, 95, 105, 88, 98, 110]
}

df = pd.DataFrame(data)
df

Here is an example of what the dataset might look like:

Longitudinal Data - dataframe example 01

Each row represents an observation of an individual at a given time (age) with his or her respective height recorded.

Now, we’ll use Python along with the statsmodels module to run a mixed effects model on this data. Make sure you have installed the statsmodels module before running the code.

# Definition of the mixed effects model
model = smf.mixedlm("Height ~ Age", df, groups=df["Individual"])

# Model training
result = model.fit()

# Printing the model results
print(result.summary())

In this example:

  • We use Pandas to create a DataFrame with our data.
  • We define a mixed model using the mixedlm function from statsmodels.formula.api. Here we specify the formula Height ~ Age to indicate that we are modeling height as a function of age, and specify the group as Individual to capture the random effects of individuals.
  • We train the model by calling the fit() method.
  • We print the model results using result.summary().

Executing you get the following result:

        Mixed Linear Model Regression Results
=====================================================
Model:            MixedLM Dependent Variable: Height 
No. Observations: 6       Method:             REML   
No. Groups:       2       Scale:              0.5556 
Min. group size:  3       Log-Likelihood:     -7.7385
Max. group size:  3       Converged:          Yes    
Mean group size:  3.0                                
-----------------------------------------------------
           Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept  73.307    1.156 63.389 0.000 71.040 75.574
Age         5.228    0.188 27.738 0.000  4.859  5.597
Group Var   1.051    2.746                           
=====================================================

The result obtained from the summary provides you with a complete overview of the results of your mixed model, allowing you to evaluate the effect of the independent variables on changes in height over time, while simultaneously controlling for the individual effects of the groups (individuals) in the dataset. Here’s how to interpret the various elements:

Model and dependent variable: The model used is a mixed linear model (MixedLM). The dependent (or response) variable is height (Height).

Number of observations and groups:

  • The total number of observations in the dataset is 6.
  • The total number of groups (or individuals) in the dataset is 2.

Estimation method: The method used to estimate model parameters is Restricted Maximum Likelihood (REML), which is a common method for estimating parameters in mixed models.

Model scale: The model scale, which represents the residual variance not explained by the model, is 0.5556.

Log-Likelihood and Convergence: The Log-Likelihood value is -7.7385, which provides a measure of the adequacy of the model. The model has converged, which indicates that the parameter estimation process has been successfully completed.

Estimated parameters:

  • Intercept: The estimated coefficient for the intercept of the model is 73.307. This represents the average starting height of children when age is zero.
  • Age: The estimated coefficient for the Age variable is 5.228. This indicates that, on average, height increases by 5,228 units for each unit increase in age.
  • Group Var: This represents the variance between groups (individuals) not explained by the variables in the model.

Confidence interval: For each estimated coefficient, 95% confidence intervals are provided (0.025 to 0.975).

Now let’s also add some graphical representations to visually understand what we are studying

import matplotlib.pyplot as plt
import seaborn as sns

# Data visualization: scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Age', y='Height', hue='Individual', palette='Set1', s=100)
plt.title('Scatter plot of height measurements by age and individual')
plt.xlabel('Age')
plt.ylabel('Height')
plt.legend(title='Individual')
plt.grid(True)
plt.show()

By running you obtain the following scatterplot of the measurements taken (the longitudinal data).

Longitudinal Data - scatterplot of height measurementes

We will now display the plot of residuals. The residuals plot is a useful tool for evaluating the goodness of fit of the model and identifying any patterns or violations of the model assumptions. By examining the residuals, which are the differences between the observed values and those predicted by the model, we can evaluate whether the model adequately captures the variation in the data and whether there are any residual structures that have not been modeled correctly.

Here are some of the main reasons why a residual plot is used:

  • Evaluate linearity
  • Detecting heteroskedasticity
  • Identify outliers or influencers
  • Check the assumption of normality
# Visualization of model results: residual plot
plt.figure(figsize=(8, 6))
sns.residplot(x=result.fittedvalues, y=result.resid, lowess=True, scatter_kws={'alpha': 0.5})
plt.title('Residual plot of the mixed effects model')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.grid(True)
plt.show()

Running gives you the residual plot of the mixed model

Longitudinal Data - residual plot

Finally, we will graphically report the results of the model, i.e. the coefficients obtained from the mixed model. The coefficient plot shows the estimated values of the mixed model coefficients, along with the confidence intervals. This graph is useful for evaluating the estimated effect of each independent variable and for comparing their impacts on the outcome. It can also be useful for identifying variables that have a significant effect on the outcome versus those that do not. Furthermore, by comparing the estimated coefficients with their confidence intervals, we can determine whether a coefficient is statistically significant.

# Visualization of model results: coefficient plot
plt.figure(figsize=(8, 6))
sns.barplot(x=result.params.index, y=result.params.values)
plt.title('Coefficients of the mixed effects model')
plt.xlabel('Coefficient')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

Executing obtains the graph of the coefficients of the mixed model.

Longitudinal Data - coefficient of the mixed effects model

This is just a very basic example of how to use mixed models on longitudinal data using Python. Mixed models can be extended further to include other variables as covariates and can be adapted to meet the specific needs of your dataset and research question.

Python Example of Generalized Estimating Equations (GEE) Models

We will create an example of how to use Generalized Estimating Equation (GEE) Models on a longitudinal dataset. For the example, let’s say we have a dataset that contains repeated blood pressure measurements of a group of patients over time.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Creating the DataFrame with the data
data = {
    'Patient': [1, 1, 1, 2, 2, 2],
    'Time': [1, 2, 3, 1, 2, 3],
    'Blood_Pressure': [120, 118, 115, 130, 128, 125],
    'Treatment': ['A', 'A', 'A', 'B', 'B', 'B']
}

df = pd.DataFrame(data)
df

Here is an example of what the dataset might look like:

Longitudinal Data - dataframe example 02

Each row represents an observation of a patient at a given point in time (time) with their respective blood pressure and treatment received recorded. Now, we will use Python together with the statsmodels module to run a GEE model on this data. Make sure you have installed the statsmodels module before running the code.

# Definition of the GEE model
model = sm.GEE.from_formula("Blood_Pressure ~ Treatment", groups="Patient", data=df)

# Model training
result = model.fit()

# Print the model results
print(result.summary())

In this example:

  • We use Pandas to create a DataFrame with our data.
  • We define a GEE model using the GEE class of statsmodels.api. Here we specify the formula Blood_Pressure ~ Treatment to indicate that we are modeling blood pressure as a function of the treatment received, and specifying the group as Patient to take into account the correlation between measurements from the same patient.
  • We train the model by calling the fit() method.
  • We print the model results using result.summary().

Executing you will get the following result:

                              GEE Regression Results                              
===================================================================================
Dep. Variable:         Pressione_Sanguigna   No. Observations:                    6
Model:                                 GEE   No. clusters:                        2
Method:                        Generalized   Min. cluster size:                   3
                      Estimating Equations   Max. cluster size:                   3
Family:                           Gaussian   Mean cluster size:                 3.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Tue, 26 Mar 2024   Scale:                           6.333
Covariance type:                    robust   Time:                         09:21:33
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          117.6667   4.74e-15   2.48e+16      0.000     117.667     117.667
Trattamento[T.B]    10.0000    6.7e-15   1.49e+15      0.000      10.000      10.000
==============================================================================
Skew:                         -0.2391   Kurtosis:                      -1.5000
Centered skew:                -0.2391   Centered kurtosis:             -1.5000
==============================================================================

One way to graph the results of a GEE model would be via a bar graph showing the mean blood pressure for each treatment, along with confidence intervals. This allows us to easily view the average differences in blood pressure between different treatments.

import matplotlib.pyplot as plt

# Calculate blood pressure averages for each treatment
mean_pressure = df.groupby('Treatment')['Blood_Pressure'].mean()

# Calculate standard errors for each treatment
std_error = df.groupby('Treatment')['Blood_Pressure'].std() / (df.groupby('Treatment')['Blood_Pressure'].count() ** 0.5)

# Plotting
plt.figure(figsize=(8, 6))
mean_pressure.plot(kind='bar', yerr=std_error, capsize=5, color=['blue', 'green'], alpha=0.7)
plt.title('Average Blood Pressure by Treatment')
plt.xlabel('Treatment')
plt.ylabel('Blood Pressure')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

Executing this will give you the following graph:

Longitudinal Data - average blood pressure

This is just a very basic example of how to use GEE models on longitudinal data using Python. GEE models can be extended further to include other variables as covariates and specify different correlation structures between observations.

Longitudinal data study

Analyzing longitudinal data involves a series of steps and the use of different statistical parameters to understand and interpret the dynamics of the data over time. Here is an overview of some of these metrics and why they are calculated:

  • Attrition rate
  • ANCOVA (Analysis of Covariance)
  • Fixed effects and random effects
  • Growth model

The calculation of these parameters is essential for a correct interpretation of longitudinal data and to obtain a more in-depth understanding of developmental processes, risk and protective factors, as well as the effectiveness of interventions over time. It also helps mitigate potential sources of bias and make the analysis more accurate and reliable. We will see them one by one in the following part of the chapter with some simple examples.

The Attrition Rate

Attrition rate, also known as loss-to-follow-up rate or dropout rate, refers to the percentage of participants in a longitudinal study who are no longer available or cannot be followed over time. This can happen for a variety of reasons, including refusal to continue participation, loss of contact with participants, relocation, or death.

Attrition rate is an important factor to consider when analyzing longitudinal data, as it can influence the validity and reliability of conclusions drawn from the study. A high attrition rate can lead to problems with sample representativeness, bias in results, and reduction in statistical power.

To manage attrition, scholars often adopt several strategies, such as improving data collection methods, offering incentives for participants to stay in the project, maintaining a good relationship with participants over time, and l appropriate analysis of missing data.

In general, it is important to carefully monitor the attrition rate and consider its implications when interpreting longitudinal study results.

Here is an example of how to calculate friction rate in Python using a Pandas DataFrame:

import pandas as pd

# Creating a dummy DataFrame with the data
data = {
    'Individual': [1, 2, 3, 4, 5],
    'Number of Observations': [5, 4, 3, 2, 1]  # Number of observations for each individual
}

df = pd.DataFrame(data)

# Calculating the total number of observations
total_observations = df['Number of Observations'].sum()

# Calculating the total number of individuals
total_individuals = len(df)

# Calculating the number of missing observations
missing_observations = total_observations - total_individuals

# Calculating the attrition rate
attrition_rate = (missing_observations / total_observations) * 100

print("Total number of observations:", total_observations)
print("Total number of individuals:", total_individuals)
print("Number of missing observations:", missing_observations)
print("Attrition rate:", attrition_rate, "%")

In this example, we have a DataFrame that contains the number of observations for each individual. The attrition rate is calculated as the percentage of missing observations out of the total number of observations. Finally, we print the results.

Total number of observations: 15
Total number of individuals: 5
Number of missing observations: 10
Attrition rate: 66.66666666666666 %

There are several classic ways to visualize the attrition rate. One of the most common ways is to use a bar chart or pie chart to show the proportion of missing observations out of total observations. Here is an example of how you can visualize attrition rate using a pie chart in Python.

import matplotlib.pyplot as plt

# Creating a list of labels for the pie chart
labels = ['Completed observations', 'Missing observations']

# Creating a list of values for the pie chart
sizes = [total_individuals, missing_observations]

# Creation of the pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Attrition rate')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

This code will create a pie chart showing the proportion of completed and missing observations in your dataset.

Longitudinal Data - Attrition Rate

Fixed effects and Random effects

Fixed effects and random effects are two fundamental concepts in mixed models used to analyze longitudinal data or data with hierarchical structure. These terms refer to how the variables in the model are considered and the nature of their relationship to the units of study.

Fixed effects:
Fixed effects are parameters that are assumed to be constant across the population and are estimated directly from the model. These effects represent the average effect of an independent variable on a dependent variable. For example, in a model that studies the effect of a treatment on an outcome, the treatment fixed effect would represent the average difference in the outcome between the treated group and the control group.

Random Effects:
Random effects are considered to be sampled from a probability distribution and are used to capture variation across study units. These effects represent individual deviations from the population average. For example, if we are studying the impact of a treatment on different individuals, random effects would capture individual differences in response to treatment, which cannot be explained by the independent variables in the model alone.

In short, fixed effects are parameters that are assumed to be constant across the population and focus on the average effects of the independent variables, while random effects capture individual variation across study units and provide information about how these units differ from the population average. Both effects are important to consider when analyzing longitudinal data to understand both average effects and individual variation over time.

Suppose we have a dataset representing the heights of children measured at different ages. We will use a mixed model to analyze this data, with fixed effects for age and random effects for each individual.

import pandas as pd
import statsmodels.api as sm

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 2, 2, 2],
    'Age': [2, 4, 6, 3, 5, 7],
    'Height': [85, 95, 105, 88, 98, 110]
}

df = pd.DataFrame(data)

# Defining the mixed effects model with fixed effects for age and random effects for individual
model = sm.MixedLM.from_formula("Height ~ Age", groups="Individual", data=df)
result = model.fit()

# Printing the model results
print(result.summary())

In this example, we are using the statsmodels MixedLM module to create a mixed model. The formula specified in the from_formula function indicates that we are modeling height as a function of age, with a fixed effect for age and a random effect for the individual. The Individual variable is specified as group to capture random effects specific to each individual. Finally, the model results are printed using the summary() method. This will give us information about the estimated coefficients, significance tests, and other model parameters.

         Mixed Linear Model Regression Results
========================================================
Model:             MixedLM  Dependent Variable:  Altezza
No. Observations:  6        Method:              REML   
No. Groups:        2        Scale:               0.5556 
Min. group size:   3        Log-Likelihood:      -7.7385
Max. group size:   3        Converged:           Yes    
Mean group size:   3.0                                  
--------------------------------------------------------
              Coef.  Std.Err.   z    P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept     73.307    1.156 63.389 0.000 71.040 75.574
Età            5.228    0.188 27.738 0.000  4.859  5.597
Individuo Var  1.051    2.746                           
========================================================

Now let’s also implement a graphical representation.

# Creation of scatter plot with regression lines
sns.lmplot(x='Age', y='Height', data=df, hue='Individual', ci=None, scatter_kws={"s": 100})
plt.title('Fixed and random effects')
plt.xlabel('Age')
plt.ylabel('Height')
plt.show()

In this code, we are using Seaborn’s sns.lmplot to create a scatterplot with regression lines. Each individual is represented by a different color on the graph.

Longitudinal Data - Fixed and random effects

The regression lines show the fixed effects, while the scatter of points around the lines reflects the random effects. This provides a visual representation of how children’s height varies with age, taking into account both average effects and individual differences.

Analysis of Covariance (ANCOVA)

In the context of longitudinal data, analysis of covariance (ANCOVA) can be extended to take into account the longitudinal structure of the data and to evaluate differences between groups on a continuous dependent variable, while simultaneously controlling for the effect of continuous variables (covariates) on more points over time. This approach is often called longitudinal ANCOVA or time-varying ANCOVA.

Longitudinal ANCOVA considers the variability observed between participants over time and attempts to isolate the effects of interest, controlling for initial or pre-existing differences between groups and other variables that might influence the dependent variable over time.

The main procedure of longitudinal ANCOVA involves the specification of a model that incorporates the effects of groups of interest, control variables, and time. This model can be implemented using multivariate linear regression techniques, such as mixed effects models or generalized linear models.

Some key points to consider when analyzing covariance in longitudinal data include:

  • Correlation Structure: Since longitudinal data usually exhibit correlation between observations of the same individual over time, this correlation structure must be taken into account in the analysis to avoid biased estimates and erroneous inferences.
  • Model assumptions: It is important to verify that model assumptions are met, such as homoskedasticity of errors and residual normality.
  • Controlling for confounding: Longitudinal ANCOVA allows you to control for confounding variables that may influence the outcome over time.
  • Interaction between time and groups: Interactions between group variables and time can be explored to evaluate whether the effects of groups vary over time.

Understanding and managing these considerations are critical to obtaining valid and interpretable results in the analysis of covariance in longitudinal data.

Suppose we have a dataset similar to the one used in the previous examples, where we measured the height of children at different ages, and we want to evaluate whether age affects the height of children, controlling for a covariate, for example gender

import pandas as pd
import statsmodels.api as sm

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 1, 2, 2, 2, 2],
    'Age': [2, 4, 6, 8, 3, 5, 7, 9],
    'Height': [85, 95, 105, 111, 88, 98, 110, 123],
    'Gender': ['M', 'M', 'M', 'M', 'F', 'F', 'F', 'F']
}

df = pd.DataFrame(data)

# Defining the ANCOVA model
model = sm.OLS.from_formula("Height ~ Age + Gender", data=df)
result = model.fit()

# Printing the model results
print(result.summary())

In this example, we are using the statsmodels module to perform an ANCOVA analysis. In the model formula, we are specifying that we want to model height as a function of age and gender as a covariate. The fit() method will train the model and return the results.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Altezza   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.966
Method:                 Least Squares   F-statistic:                     99.27
Date:                Tue, 26 Mar 2024   Prob (F-statistic):           9.46e-05
Time:                        09:46:30   Log-Likelihood:                -16.380
No. Observations:                   8   AIC:                             38.76
Df Residuals:                       5   BIC:                             39.00
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      74.0000      2.543     29.095      0.000      67.462      80.538
Genere[T.M]    -0.6250      1.718     -0.364      0.731      -5.042       3.792
Età             5.1250      0.375     13.667      0.000       4.161       6.089
==============================================================================
Omnibus:                        0.297   Durbin-Watson:                   1.169
Prob(Omnibus):                  0.862   Jarque-Bera (JB):                0.367
Skew:                          -0.320   Prob(JB):                        0.832
Kurtosis:                       2.168   Cond. No.                         19.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

With this code, we will get ANCOVA results that will show whether age and gender have a significant effect on children’s height, controlling for each other.

An example of how to display ANCOVA results is to use a scatter plot with separate regression lines for each level of the categorical variable (for example, gender) and a different color for each level.

import seaborn as sns
import matplotlib.pyplot as plt

# Creation of scatter plot with regression lines for each gender level
sns.lmplot(x='Age', y='Height', hue='Gender', data=df, ci=None)
plt.title('ANCOVA: Height as a function of Age and Gender')
plt.xlabel('Age')
plt.ylabel('Height')
plt.show()

In this code, we are using Seaborn’s lmplot function to create a scatterplot with separate regression lines for each gender level. This will allow us to graphically visualize how height varies as a function of age, controlling for gender. The hue parameter is used to specify the categorical variable (gender) we want to separate and color the regression lines.

Longitudinal Data - ANCOVA analysis

The Growth Model

Growth model is a type of statistical model used to describe and analyze the change of a variable over time. These models are commonly employed in longitudinal research, in which data are collected at multiple points in time for the same individual, group, or study unit.

Growth models are often used to examine and understand the processes of development, change and learning over time. They can be applied to a wide range of phenomena, including cognitive development, physical growth, behavioral changes, and many other areas of interest.

There are several types of growth models, including:

  • Linear growth models: These models assume that change over time is constant and linear. They can be used to study changes that occur at a constant rate over time.
  • Nonlinear growth models: These models allow you to capture changes that are nonlinear over time. They can include S-shaped, exponential, quadratic, or polynomial curves.
  • Hierarchical growth models: These models take into account the hierarchical structure of data, such as data collected about individuals within groups or communities. These models allow us to study both individual and group changes.
  • Latent variable growth models: These models involve the use of latent or unobservable variables to describe change over time. They can be used to model complex constructs such as intelligence, mood or ability.

In summary, growth models are useful tools for understanding processes of change over time and for testing hypotheses regarding the factors that influence such changes. The choice of model depends on the nature of the data and the specific objectives of the research.

Other fundamental parameters in the study of longitudinal data

There are other key parameters that are used in the study of longitudinal data to provide a complete and in-depth understanding of the processes under investigation. Here are some of them:

  • Change over time: This metric measures the direction and magnitude of change in the variables of interest over time. It can be examined through trajectory graphs, regression analysis and growth models.
  • Correlation between observations: It is important to evaluate the correlation between multiple observations of the same individual over time. This can influence the choice of statistical model to use and the validity of inferences.
  • Interaction effects: Interaction effects evaluate whether the effect of a dependent variable varies based on the levels of one or more independent variables over time. This parameter is important for understanding how the relationships between variables can change over time.
  • Explained and Unexplained Variance: These metrics measure the proportion of the total variance in the variables of interest that is explained by the statistical models used. These are important for evaluating how well the model fits the data and how much of the variation is due to factors not included in the model.
  • Model evaluation: Model evaluation involves using criteria such as the coefficient of determination (R²), Akaike information criterion (AIC), and Bayesian information criterion (BIC) to evaluate goodness of fit and complexity of the models.
  • Survival analysis: In some cases, survival analysis can be used to model the time until the event of interest, for example, the time until a disease is diagnosed or the time until unemployment.
  • Cluster analysis: If longitudinal data are collected from different sites or groups, cluster analysis can be used to examine differences between groups and to evaluate the heterogeneity of the longitudinal data.

These are just a few examples of important parameters when studying longitudinal data. The choice of parameters to use depends on the objective of the study, the nature of the data and the specific research questions

Leave a Reply