Skip to content

select() does not keep variable.labels attribute created by foreign functions #5831

@stragu

Description

@stragu

I am not sure this is a dplyr problem per se, but thought I'd report it to see what you think and if I should take it somewhere else instead.

A dataframe that has column labels stored as a dataframe attribute variable.labels will lose them when using dplyr::select(). An example of such a dataframe is a labelled SPSS file imported with foreign::read.spss(). Other verbs in dplyr (as far as I have tested) do not lose the labels, which is why it is surprising and looks like a bug.

A workaround is to stick to Tidyverse packages and import the data with haven::read_sav(), which stores the label attribute per column.

But would it be possible for dplyr to cater for the first case, as the attribute comes from a base package (even though the square-bracket slicing doesn't even cater for it)? Or is it out of scope?

Here is a reproducible example:

library(foreign)
# import example SAV from foreign
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)

# labels are saved as a variable.labels attribute for the whole dataframe
str(dat_for)
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# dplyr::select() loses the variable.labels attribute:
library(dplyr, warn.conflicts = FALSE)
dat_for %>% select(1:2) %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# other verbs don't lose them, for example:
dat_for %>% mutate(test = "test") %>% str()
#> 'data.frame':    240 obs. of  14 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ test    : chr  "test" "test" "test" "test" ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% filter(AGE > 50) %>% str()
#> 'data.frame':    77 obs. of  13 variables:
#>  $ CASEID  : num  151 161 178 314 344 349 429 460 467 482 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 2 5 3 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  53 54 52 53 52 52 54 53 54 54 ...
#>  $ DBP58   : num  102 93 83 91 107 83 90 99 98 87 ...
#>  $ EDUYR   : num  12 9 9 7 NA 14 8 NA 8 11 ...
#>  $ CHOL58  : num  216 265 269 292 351 292 235 273 215 283 ...
#>  $ CGT58   : num  0 0 15 0 0 10 0 30 20 22 ...
#>  $ HT58    : num  67 65.5 68.8 60.9 68 64.4 74.9 67.3 67.4 68.9 ...
#>  $ WT58    : num  172 165 140 149 160 154 254 168 171 181 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: 3 7 4 NA 6 5 6 4 4 2 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 2 2 2 1 2 1 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 1 2 2 1 2 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% rename(ages = AGE) %>% str()
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ ages    : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# but to be fair, base slicing also loses the attribute:
dat_for[,1:2] %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# import same file with haven
library(haven)
dat_hav <- read_sav(file = sav)

# labels are attributes of columns
str(dat_hav$CASEID)
#>  num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>  - attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>  - attr(*, "format.spss")= chr "F4.0"
#>  - attr(*, "display_width")= int 0

# dplyr::select() keeps them
dat_hav %>% select(1:2) %>% str()
#> tibble[,2] [240 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ CASEID  : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>   ..- attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>   ..- attr(*, "format.spss")= chr "F4.0"
#>   ..- attr(*, "display_width")= int 0
#>  $ FIRSTCHD: dbl+lbl [1:240] 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 6, 2, 3, 5, 3, 3, 3, 3, ...
#>    ..@ label        : chr "FIRST CHD EVENT"
#>    ..@ format.spss  : chr "F1.0"
#>    ..@ display_width: int 0
#>    ..@ labels       : Named num [1:5] 1 2 3 5 6
#>    .. ..- attr(*, "names")= chr [1:5] "NO CHD" "SUDDEN  DEATH" "NONFATALMI" "FATAL   MI" ...
#>  - attr(*, "label")= chr "                       SPSS/PC+"

Created on 2021-03-30 by the reprex package (v1.0.0)

using:

  • R 4.0.4
  • dplyr 1.0.5
  • foreign 0.8-81
  • haven 2.3.1

Metadata

Metadata

Labels

bugan unexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions