Skip to content

merging

Ben Tupper edited this page Jul 7, 2025 · 1 revision

Merging databases

For a given region, such as “chfc”, it is conceivable to read the multi-year data (MY) with the analysis-forecast (ANFC). As long as the datasets are compatible, for the most part they play well together, but there are some pitfalls to consider which we’ll try to point out. Here we simply read the databases and merge them using the multiple = TRUE argument. Easy!

suppressPackageStartupMessages({
  library(andreas)
  library(dplyr)
  library(stars)
})
path = copernicus_path("chfc")
DB = read_database(path, multiple = TRUE)
dplyr::count(DB, product)
## # A tibble: 2 × 2
##   product                                 n
##   <chr>                               <int>
## 1 GLOBAL_ANALYSISFORECAST_PHY_001_024  4858
## 2 GLOBAL_MULTIYEAR_PHY_001_030        82845

We can go a bit further and tally the number of variables per product family.

tally = dplyr::count(DB, product, id, variable)
print(tally, n = nrow(tally))
## # A tibble: 24 × 4
##    product                             id                         variable     n
##    <chr>                               <chr>                      <chr>    <int>
##  1 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-cur_anf… uo          49
##  2 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-cur_anf… vo          49
##  3 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-so_anfc… so          49
##  4 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-thetao_… thetao      49
##  5 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-wcur_an… wo        1141
##  6 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy_anfc_0.… mlotst      49
##  7 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy_anfc_0.… pbo       1141
##  8 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy_anfc_0.… sob       1141
##  9 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy_anfc_0.… tob       1141
## 10 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy_anfc_0.… zos         49
## 11 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… bottomT  10408
## 12 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… mlotst   10408
## 13 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… so       10408
## 14 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… thetao   10408
## 15 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… uo       10408
## 16 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… vo       10408
## 17 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.08… zos      10408
## 18 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… bottomT   1427
## 19 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… mlotst    1427
## 20 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… so        1427
## 21 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… thetao    1427
## 22 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… uo        1427
## 23 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… vo        1427
## 24 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0… zos       1427

Reading rasters

We can filter the database in such a way that we are drawing from two different product suites within the same regional database.

First filter…

db = DB |>
  filter(variable == "thetao") |>
  group_by(product, id) |>
  slice_head(n = 1) |>
  arrange(date)
print(select(db, product, id, date))
## # A tibble: 3 × 3
## # Groups:   product, id [3]
##   product                             id                              date      
##   <chr>                               <chr>                           <date>    
## 1 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_my_0.083deg_… 1993-01-01
## 2 GLOBAL_MULTIYEAR_PHY_001_030        cmems_mod_glo_phy_myint_0.083d… 2021-07-01
## 3 GLOBAL_ANALYSISFORECAST_PHY_001_024 cmems_mod_glo_phy-thetao_anfc_… 2025-05-28

Now read…

s = read_andreas(db, path)
s
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min.  1st Qu.   Median     Mean  3rd Qu.     Max.   NA's
## thetao  -1.722617 5.689129 12.22019 12.83201 20.24998 28.99976 103364
## dimension(s):
##      from  to offset    delta  refsys
## x       1 415 -77.04  0.08333  WGS 84
## y       1 261  56.71 -0.08333  WGS 84
## time    1   3     NA       NA POSIXct
##                                              values x/y
## x                                              NULL [x]
## y                                              NULL [y]
## time 1993-01-01 UTC, 2021-07-01 UTC, 2025-05-28 UTC

Plot…

plot(s)

## Issues to consider

Keeps these issues in mind when trying to work with multiple databases merged into one.

The database must have the product column

If the product is accidentally dropped then we lack the context to read files from the databases. Be sure you check that product is retained.

Slightly mismatched names

And astute observer might notice that two variables for sea water temperature at the floor are tob and bottomT. There are ways to deal with that in your code, but the variable column in the database must retain these values.

Incompatible datasets

Currently, every regional database has one or more product databases - and they are compatible. Incompatible databases include those where the database columns don’t match, or the rasters can’t be easily combined into one compound array (stars).

Don’t try to merge different regions

While the tabular databases can be merged form different regions, their rasters cannot - they have different bounding boxes for starters. Just don’t do it; instead script as if they are separate data streams.

Don’t try to save a merged database

You can, but the simplicity of mergign them (almost not time cost) doesn’t warrant saving a special merged database. Why pay all the overhead of duplicative databases for almost no gain?

Clone this wiki locally