What I want to do:
I'm trying to download daily weather data for two stations, in this case from the Port Hardy A station name. These two stations don't overlap in ranges. Station 202 goes from 1944 until 2013, while station 51319 picks up from 2013 until today. Basically, I would just like a single time-series of data that accounts for where each station leaves off or picks up.
Issue?
Basically, the download is creating a single data-frame but duplicating the two-time series: one for each station ID. While I am getting the real data from each station (which is what I am asking for), I am also getting missing data for each station outside the range for each station. It appears to duplicate NA's for each date I requested.
I'm not sure whether this behaviour for merging data across stations is intended or not. I could attempt to remove the duplicated dates manually, but I might have to do some quality control on that. Suggestions?
Example:
Here's the stations for Port Hardy. Notice Port Hardy A has two station IDs and two different ranges that don't overlap.
stations_search("Port Hardy",interval="day")
Then I download those two stations:
portHardy_pg <- weather_dl(station_ids = c(202, 51319), start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)
And we can start to see the problem as we look at the temperature for station 202 at the start and end of the range
head(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 1975-01-01 3.9 "" 2
2 PORT HARDY A 202 1975-01-02 6.1 "" 3.1
3 PORT HARDY A 202 1975-01-03 3.9 "" 2
4 PORT HARDY A 202 1975-01-04 3.9 "" 2.3
5 PORT HARDY A 202 1975-01-05 5 "" 3.6
6 PORT HARDY A 202 1975-01-06 2.8 "" 0.9
tail(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])
Here we see the duplicated NAs for station 202 at the end of the range
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 2018-12-26 NA "" NA
2 PORT HARDY A 202 2018-12-27 NA "" NA
3 PORT HARDY A 202 2018-12-28 NA "" NA
4 PORT HARDY A 202 2018-12-29 NA "" NA
5 PORT HARDY A 202 2018-12-30 NA "" NA
6 PORT HARDY A 202 2018-12-31 NA "" NA
I get similar issues for station 5139 at the start and end of the range:
head(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)]) # here we see the duplicated NAs for station 5139 at the beginning of the range
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 51319 1975-01-01 NA "" NA
2 PORT HARDY A 51319 1975-01-02 NA "" NA
3 PORT HARDY A 51319 1975-01-03 NA "" NA
4 PORT HARDY A 51319 1975-01-04 NA "" NA
5 PORT HARDY A 51319 1975-01-05 NA "" NA
6 PORT HARDY A 51319 1975-01-06 NA "" NA
tail(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 51319 2018-12-26 5.5 "" 2.8
2 PORT HARDY A 51319 2018-12-27 5.1 "" 2.3
3 PORT HARDY A 51319 2018-12-28 4.4 "" 4
4 PORT HARDY A 51319 2018-12-29 10.8 "" 7.4
5 PORT HARDY A 51319 2018-12-30 7 "" 3.2
6 PORT HARDY A 51319 2018-12-31 5.2 "" 2.1
Interestingly, if I download only one station but specify a "bad range", then the data download trims itself to the observation period.
For example:
new_dl <- weather_dl(station_ids = 202, start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)
tail(new_dl[new_dl$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 2013-06-07 16.4 "" 13.1
2 PORT HARDY A 202 2013-06-08 13.1 "" 11.4
3 PORT HARDY A 202 2013-06-09 13.8 "" 10.1
4 PORT HARDY A 202 2013-06-10 15.1 "" 10.5
5 PORT HARDY A 202 2013-06-11 14.8 "" 12.3
6 PORT HARDY A 202 2013-06-12 15.5 "" 12.5
My Environment
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] weathercan_0.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.10 magrittr_1.5 tidyselect_0.2.5 R6_2.4.1 rlang_0.4.1
[7] fansi_0.4.0 stringr_1.4.0 httr_1.4.1 dplyr_0.8.3 tools_3.6.1 packrat_0.5.0
[13] utf8_1.1.4 cli_1.1.0 ellipsis_0.3.0 assertthat_0.2.1 lifecycle_0.1.0 tibble_2.1.3
[19] crayon_1.3.4 tidyr_1.0.0 purrr_0.3.3 vctrs_0.2.0 curl_4.2 zeallot_0.1.0
[25] glue_1.3.1 stringi_1.4.3 compiler_3.6.1 pillar_1.4.2 backports_1.1.5 lubridate_1.7.4
[31] pkgconfig_2.0.3
What I want to do:
I'm trying to download daily weather data for two stations, in this case from the Port Hardy A station name. These two stations don't overlap in ranges. Station 202 goes from 1944 until 2013, while station 51319 picks up from 2013 until today. Basically, I would just like a single time-series of data that accounts for where each station leaves off or picks up.
Issue?
Basically, the download is creating a single data-frame but duplicating the two-time series: one for each station ID. While I am getting the real data from each station (which is what I am asking for), I am also getting missing data for each station outside the range for each station. It appears to duplicate NA's for each date I requested.
I'm not sure whether this behaviour for merging data across stations is intended or not. I could attempt to remove the duplicated dates manually, but I might have to do some quality control on that. Suggestions?
Example:
Here's the stations for Port Hardy. Notice Port Hardy A has two station IDs and two different ranges that don't overlap.
Then I download those two stations:
And we can start to see the problem as we look at the temperature for station 202 at the start and end of the range
Here we see the duplicated NAs for station 202 at the end of the range
I get similar issues for station 5139 at the start and end of the range:
Interestingly, if I download only one station but specify a "bad range", then the data download trims itself to the observation period.
For example:
My Environment