Abstract
The traj
package
implements the 3-step procedure proposed by Leffondre et al. (2004) to
identify clusters of longitudinal trajectories. The first step
calculates 24 summary measures that describes features of the
trajectories. The second step performs a factor analysis on these 24
measures to select measures that best describenthe main features of the
trajectories. The third step classifies the trajectories into clusters
based on the previously selected factors. The traj
package also offers a wide
variety of plotting function used to visualize the results.
This vignette illustrates the use of the traj
package using simulated
data. A more detailed description of the methods can be found in
Sylvestre et al. (2006) or Leffondre et al. (2004).
Data
Data consist in two dataframes. We only need the first one. The first
dataframe, example.data$data
, contains the
values for each individual trajectory. Each row correspond to a
trajectory.
library(traj)
head(example.data$data)
#> ID X1 X2 X3 X4 X5 X6
#> 1 1 5.658914 9.339839 3.770285 17.360689 8.824336 9.281445
#> 2 2 23.592764 11.752246 7.684052 12.829819 13.001762 9.664881
#> 3 3 15.468982 8.756455 6.493185 11.260783 10.419991 17.405468
#> 4 4 7.311962 11.687510 12.476206 8.890432 6.521589 7.701249
#> 5 5 12.843652 11.087720 7.649965 10.268853 12.453166 11.557388
#> 6 6 3.521960 15.285008 7.860331 7.113819 17.953799 4.167628
Analysis
The first step in the analysis consists of the computing 24 measures of each trajectory.
The 24 measures are:
The 24 measures can be computed using the step1measures function.
s1 = step1measures(example.data$data, ID = TRUE)
#> [1] "Correlation of m5 and m6 : 1"
#> [1] "Correlation of m12 and m13 : 1"
#> [1] "Correlation of m17 and m18 : 0.999"
head(s1$measurments)
#> ID m1 m2 m3 m4 m5 m6 m7
#> 1 1 13.590405 9.039251 4.661120 51.56534 3.622531 0.60375512 0.64014590
#> 2 2 15.908712 13.087587 5.534055 42.28476 -13.927883 -2.32131390 -0.59034555
#> 3 3 10.912283 11.634144 4.107025 35.30148 1.936486 0.32274765 0.12518509
#> 4 4 5.954618 9.098158 2.447025 26.89583 0.389287 0.06488117 0.05323975
#> 5 5 5.193687 10.976791 1.875271 17.08396 -1.286263 -0.21437719 -0.10014778
#> 6 6 14.431839 9.317091 5.954592 63.91042 0.645668 0.10761133 0.18332632
#> m8 m9 m10 m11 m12 m13 m14
#> 1 0.40075561 0.86161571 1.195954e-01 13.590405 8.656240 8.656240 6.366869
#> 2 -1.06420558 -1.73557433 3.442450e-01 5.145766 6.236869 6.236869 4.912661
#> 3 0.16644851 0.55544664 6.401740e-02 6.985477 5.515075 5.515075 4.313933
#> 4 0.04278745 -0.48963151 1.401296e-01 4.375548 3.146345 3.146345 2.459704
#> 5 -0.11718026 0.00811164 6.548737e-05 2.618888 2.598210 2.598210 2.178533
#> 6 0.06929931 0.29966281 8.864000e-03 11.763048 11.197463 11.197463 8.912078
#> m15 m16 m17 m18 m19 m20 m21
#> 1 13.590405 1.5034878 15.773162 10.046521 -0.8059541 14.882664 22.126757
#> 2 11.840519 0.9047136 -6.822248 -3.593548 2.1259093 6.367233 9.213959
#> 3 6.985477 0.6004290 12.576325 9.929082 3.4245010 6.228696 7.826269
#> 4 4.375548 0.4809268 -8.936410 -6.425945 -0.7989718 3.181689 4.374471
#> 5 3.437756 0.3131840 423.805221 320.306329 0.2150385 2.813283 6.056643
#> 6 13.786171 1.4796647 46.005610 37.366875 -6.3873047 15.519633 24.626151
#> m22 m23 m24
#> 1 2.4478528 3.475296 2.337517
#> 2 0.7040228 1.875554 1.296087
#> 3 0.6726983 1.814184 1.443856
#> 4 0.4808084 1.778454 1.293525
#> 5 0.5517681 2.780148 1.291366
#> 6 2.6431159 2.763233 1.741416
Each row in the dataframe returned by step1measures
corresponds to the
trajectory on the same row in the input data (example.data$data
). For each
trajectory, the 24 measures have been calculated and correspond to
columns m1 to m24.
In the second step of the analysis, a factor analysis is performed to select a subset of measures that describes the main features of the trajectories. The function step2factors is used to perform the factor analysis.
s2 = step2factors(s1)
#> [1] "m6 is removed because it is perfectly correlated with m5"
#> [2] "m13 is removed because it is perfectly correlated with m12"
#> [1] "Computing reduced correlation e-values..."
head(s2$factors)
#> ID m4 m5 m21 m24
#> 1 1 51.56534 3.622531 22.126757 2.337517
#> 2 2 42.28476 -13.927883 9.213959 1.296087
#> 3 3 35.30148 1.936486 7.826269 1.443856
#> 4 4 26.89583 0.389287 4.374471 1.293525
#> 5 5 17.08396 -1.286263 6.056643 1.291366
#> 6 6 63.91042 0.645668 24.626151 1.741416
In this example, the step2factors has identified measures 4, 5, 21 and 24 as the main factors of this set of trajectories. Measures 6, 13 and 18 were not considered because they were too correlated with other measures (measures with a correlation higher than 0.95 are omitted from the factor analysis).
Once this step is done, the third step of the procedure consists in clustering the trajectories based on the measures identified in the factor analysis. This step is implemented in the step3clusters function. Two options are available to select the number of clusters. First, the user can a priori decide on the number of clusters, such as in the following example in which the number of clusters is set to 4.
Alternatively, the number of clusters can be left blank in which case
the step3clusters function will rely on the NbClust
function from the NbClust
package to determine the
optimal number of clusters based on one of the criteria available in
NbClust
. Please see NbClust
documentation for more
details.
The function step3clusters assigns each trajectory to one and only one cluster and returns a dataframe that identifies cluster membership.
head(s3$clusters)
#> ID cluster
#> 1 1 3
#> 2 2 3
#> 3 3 3
#> 4 4 3
#> 5 5 3
#> 6 6 3
s3$clust.distr
#>
#> 1 2 3 4
#> 24 30 67 9
The traj
object
returned by the function step3clusters can be plotted by an array of
plotting functions, as described in the next section.
Plotting the traj object
The traj
object created
by step3clusters
can be
plotted by an array of plotting functions.
This function selects 10 random trajectories from each cluster and plots them using randomly selected colours. The user can specify the number of trajectories to plot, the colours or any other generic plotting parameter. The user can request that trajectories from only one cluster be plotted.
The plotMeanTraj
function plots the mean trajectory of every cluster. The user can
request that trajectories from only one cluster be plotted.
The plotMedTraj
function plots the median trajectory of every cluster with 10th and 90th
percentiles. The user can request that trajectories from only one
cluster be plotted.
The plotBoxplotTraj
function will plot the box-plot distribution of every time point in each
cluster. The user can request that trajectories from only one cluster be
plotted.
The plotCombTraj
function will plot the mean or median of all the clusters on one single
graph. Different colours can be selected as well as different line
styles.
Sylvestre MP; et al. (2006). Classification of patterns of delirium severity scores over time in an elderly population. International Psychogeriatrics; 18(4); 667-680. doi:10.1017/S1041610206003334.
Leffondree; K. et al. (2004). Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators. Journal of Clinical Epidemiology; 57; 1049-1062. doi : 10.1016/j.jclinepi.2004.02.012.