Stochastic weather simulation models are commonly employed in water
resources management, agricultural applications, forest management,
transportation management, and recreational activities. Stochastic
simulation of multisite precipitation occurrence is a challenge because of
its intermittent characteristics as well as spatial and temporal
cross-correlation. This study proposes a novel simulation method for
multisite precipitation occurrence employing a nonparametric technique, the
discrete version of the

Stochastic simulation of weather variables has been employed for water resources management, hydrological design, agricultural irrigation, forest management, transportation planning and evacuation, recreation activities, filling in missing historical data, simulating data, extending observed records, and simulating different weather conditions. Stochastic simulation models play a key role in producing weather sequences, while preserving the statistical characteristics of observed data. A number of stochastic weather simulation models have been developed using parametric and nonparametric approaches (Lee, 2017; Lee et al., 2012; Wilby et al., 2003; Wilks, 1999; Wilks and Wilby, 1999).

Parametric approaches simulate statistical characteristics of observed weather data with a set of parameters that are determined by fitting (Jeong et al., 2012; Lee, 2016; Zheng and Katz, 2008), whereas in nonparametric approaches, historical analogs with current conditions are searched, following the weather simulation data (Buishand and Brandsma, 2001; Lee et al., 2012). Combinations of parametric and nonparametric approaches have also been proposed (Apipattanavis et al., 2007; Frost et al., 2011).

Among weather variables, precipitation possesses intermittency and zero
values between precipitation events, which make it difficult to properly
reproduce the events (Beersma and Buishand, 2003; Hughes et al., 1999;
Katz and Zheng, 1999). To overcome the problem of intermittency and zero
values, precipitation is simulated separately from other variables. The main
method for reproducing intermittency has been the multiplication of
precipitation occurrence and an amount as

Wilks (1998) presented a multisite simulation model for the occurrence
process (i.e.,

Lee et al. (2010) proposed a nonparametric-based stochastic simulation model
for hydrometeorological variables. Their model overcame the shortcomings of a
previous nonparametric simulation model (Lall and Sharma, 1996), called

While KNNR is employed to find historical analogues of multisite occurrence similar to the current status of a simulation series, GA is applied to use its skill to generate a new descendant from the historical parent chosen with the KNNR. In this procedure, the multisite occurrence of precipitation can be simulated while preserving spatial and temporal correlations. Metaheuristic techniques, such as GA, have been popularly employed in a number of hydrometeorological applications (Chau, 2017; Fotovatikhah et al., 2018; Taormina et al., 2015; Wang et al., 2013). Although a number of variants of KNNR-GA have been applied (Lee et al., 2012; Lee and Park, 2017), none of them can simulate multisite occurrence of precipitation whose characteristics are binary and temporally and spatially related.

Therefore, this study proposes a stochastic simulation method for multisite occurrence of precipitation with the KNNR-GA-based nonparametric approach that (1) simulates multisite occurrence with a simple and direct procedure without parameterization of all the required occurrence probabilities, and (2) reproduces the complex temporal and spatial correlation between stations, as well as the basic occurrence probabilities. The proposed nonparametric model is compared with the popular model proposed by Wilks (1998). Even though the multisite occurrence data generated from the Wilks model preserves various statistical characteristics of the observed data well, significant underestimation of lagged cross-correlation still exists. Furthermore, the relation between standard normal variable and occurrence variable relies on long stochastic simulation.

The paper is organized as follows. The next section presents the mathematical background of existing multisite occurrence modeling and section discusses the modeling procedure. The study area and data are reported in Sect. 4. The model application is presented in Sect. 5. Results of the proposed model are discussed in Sect. 6, and summary and conclusions are presented in Sect. 7.

Let

Wilks (1998) suggested a multisite occurrence model using a standard
normal random number (here denoted as MONR) that is spatially dependent but
serially independent. The correlation of the standard normal variate for a
site pair of

Since direct estimation of

In the current study, a novel multisite simulation model for discrete
occurrence of precipitation variable with the KNNR technique (Lall and Sharma, 1996; Lee and Ouarda, 2011; Lee et
al., 2017) for a discrete case (denoted as discrete KNNR; DKNNR) is proposed
by combining a mixture mechanism with GA. Provided the
number of nearest neighbors,

Estimate the distance between the current (i.e., time index:

Arrange the estimated distances from step (1) in ascending order, select the
first

Randomly select one of the stored

Assume the selected time index from step (3) as

Assign the binary vector of the proceeding index of the selected time
as

Execute the following steps for GA mixing if GA mixing is subjectively
selected. Otherwise, skip this step.

Reproduction: select one additional time index using steps (1) through
(4) and denote this index as

Crossover: replace each element

Mutation: replace each element (i.e., each station,

Repeat steps (1)–(6) until the required data are generated.

The selection of the number of nearest neighbors (

In Appendix A, an example of the DKNNR simulation procedure is explained in detail.

The capability of model to take climate change into account is critical. For
example, the marginal distributions and transition probabilities in Eqs. (5) and (3) can change in future
climate scenarios. It is known that nonparametric simulation models have
difficulty adapting to climate change, since the models employ in general
the current observation sequences. However, the proposed model in the
current study possesses the capability to adapt to the variations of
probabilities by tuning the crossover and mutation probabilities in

For example, the probability of

In addition, further adjustment can be made with the mutation process in Eq. (14) as

For testing the occurrence model, 12 weather stations were selected from the Yeongnam province, which is located in the southeastern part of South Korea, as shown in Fig. 1. Information on longitude and latitude (fourth and fifth columns), as well as order index and the identification number (first and second columns) of these stations operated by Korea Meteorological Administration with the area name (third column), is shown in Table 1. The employed precipitation dataset presents strong seasonality, since this area is dry from late fall to early autumn and humid and rainy during the remaining seasons, especially in summer. The employed stations are not far from each other, at most 100 km apart, and not many high mountains are located in the current study area. Therefore, this region can be considered as a homogeneous region (Lee et al., 2007).

Locations of 12 selected weather stations in the Yeongnam province. See Table 1 for further information about the stations.

Figure 1 illustrates the locations of the selected weather stations. All the stations are inside Yeongnam province, which consists of two different regions (north and south Gyeongsang), as well as the self-governing cities of Busan, Daegu, and Ulsan. Most of the Yeongnam region is drained to the Nakdong River. To validate the proposed model appropriately, test sites must be highly correlated with each other as well as have significant temporal relation. The stations inside the Yeongnam area cover one of the most important watersheds, the Nakdong River basin, where the Nakdong River passes through the entire basin, and its hydrological assessments for agriculture and climate change have a particular value in flood control and water resources management such as floods and droughts.

Information on 12 selected stations from the Yeongnam province, South Korea.

It is important to analyze the impact of weather conditions for planning agricultural operations and water resources management, especially during the summer season, because around 50 %–60 % of the annual precipitation occurs during the summer season from June to September. The length of daily precipitation data record ranges from 1976 to 2015 and the summer season record was employed, since a large number of rainy days occur during summer and it is important to preserve these characteristics. Also, the whole-year dataset was tested and other seasons were further applied but the correlation coefficient was relatively high and its estimated correlation matrix was not a positive semi-definite matrix for the MONR model.

To analyze the performance of the proposed DKNNR model, the occurrence of
precipitation was simulated. The DKNNR simulation was compared with that of
the MONR model. For each model, 100 series of daily occurrence with the same
record length were simulated. The key statistics of observed data and each
generated series, such as transition probabilities (

The 100 simulated statistic values were illustrated with box plots to show
their variability as shown in Figs. 5–7. The box of the box plot represents the
interquartile range (IQR), ranging from the 25th percentile to the 75th percentile. The
whiskers extend up and down to 1.5 times the IQR. The data beyond the whiskers
(

The roles of crossover probability

Testing for different probabilities of crossover

Testing for different probabilities of mutation

We further tested and discuss why the GA mixing is necessary in the proposed
DKNNR model as follows. For example, assume that three weather stations are
considered and observed data only have the occurrence cases of 000,
001, 011, 010, 011, 100, and 111, among

This can be problematic for the simulation purpose in that one of the major simulation purposes is to simulate sequences that might possibly happen in the future. The wet (1) or dry (0) for multisite precipitation occurrence is decided by the spatial distribution of a precipitation weather system. A humid air mass can be distributed randomly, relying on wind velocity and direction, as well as the surrounding air pressure. In general, any combinations of wet and dry stations can be possible, especially when the simulation continues infinitely. Therefore, the patterns of simulated data must be allowed to have any possible combinations (here 4096), even if they have not been observed from the historical records. Also, the probability to have this new pattern must not be high, since it has not been observed in the historical records, and this can be taken into account by low probability of the crossover and mutation.

This drawback of the KNNR model frequently happens in multisite occurrence
as the number of stations increases. Note that the number of patterns
increases as 2

Frequency of the observed
patterns among all the possible cases (2

Note that the data employed in the case study are 40 years and 122 days (summer months) in each year. The total number of the observed data is 4880 and the number of possible cases is 4096. We checked the number of possible cases that were not found in the observed data. The result shows that 3379 cases were not observed at all for the entire cases as shown in Fig. 4.

Box plots of the

Box plots of the

Box plots of the

We further investigated the number of new patterns that were generated with
the probabilities

The data simulated from the proposed DKNNR model and the existing MONR model
were analyzed. The estimated transition probabilities (

Occurrence and transition probabilities of observed data and data simulated by DKNNR and MONR for 12 stations from the Yeongnam province, South Korea, during the summer season. Note that 100 sets with the same record length as the observed data were simulated and the statistics of 100 sets were averaged.

As shown in Fig. 5, the probability

In the DKNNR modeling procedure, the simple distance measurement in Eq. (11) allows to preserve transition probabilities in that
the following multisite occurrence is resampled from the historical data
whose previous states of multisite occurrence (

As shown in Fig. 6, the

The behavior of

Cross-correlation is a measure of the relationship between sites. The preservation of cross-correlation is important for the simulation of precipitation occurrence and is required in the regional analysis for water resources management or agricultural applications. Furthermore, lagged cross-correlation is also as essential as cross-correlation (i.e., contemporaneous correlation). For example, the amount of streamflow for a watershed from a certain precipitation event is highly related to lagged cross-correlation.

Cross-correlation of observed data for 12 stations from the Yeongnam province, South Korea.

Averaged cross-correlation of the 100 simulated series from the DKNNR model for 12 stations from the Yeongnam province, South Korea.

Averaged cross-correlation of 100 simulated series from the MONR model for 12 stations from the Yeongnam province.

Daily precipitation occurrence, in general, shows the strongest serial correlation at lag-1 and its correlation decays as the lag gets longer. This is because a precipitation weather system moves according to the surrounding pressure and wind direction that dynamically change within a day or week. Therefore, we analyzed the lag-1 cross-correlation in the current study as the representative lagged correlation structure.

The difference of RMSE of cross-correlation between MONR and DKNNR. Note that the positive value indicates that the DKNNR model performs better in preserving the cross-correlation, while a negative value (in bold font) shows that the MONR model performs better.

Note that no negative value can be found, implying that the DKNNR model preserves the cross-correlation better than the MONR model.

Lag-1 cross-correlation of observed data for 12 stations from the Yeongnam province, South Korea.

The cross-correlation of observed data is shown in Table 3. High cross-correlation among grouped sites, such as sites 6, 7, and 8 (northern part) and sites 3, 4, and 5, as well as 12 (southeast coastal area, 0.68–0.87), was found. As expected, sites 5 and 12 had the highest cross-correlation (0.87) due to proximity. The northern sites and coastal sites showed low cross-correlation. This observed cross-correlation was well preserved in the data generated from both DKNNR and MONR models, as shown in Fig. 8 as well as Tables 4 and 5. However, consistently slight but significant underestimation of cross-correlation was seen for the data generated by the MONR model (see Fig. 8b). Note that the error bars are extended to upper and lower lines of the circles to 1.95 times the standard deviation. The difference of RMSE in Table 6 showed this characteristic, as most of the values were positive, indicating that the proposed DKNNR model performed better for cross-correlation.

Scatterplot of cross-correlations between 12 weather stations for
the observed data (

The lag-1 cross-correlation of observed data, as shown in
Table 7, ranged from 0.22 to 0.35. The lag-1
cross-correlation for the same site (i.e.,

The difference of RMSE of lag-1 cross-correlation between MONR and DKNNR. Note that a positive value indicates that the DKNNR model performs better in preserving lag-1 cross-correlation, while a negative value (in bold font) shows that the MONR model performs better.

Bias of lag-1 cross-correlation of the generated data from the DKNNR
model. Note that a positive value indicates the overestimation of lag-1
cross-correlation, while a negative value shows underestimation. Note that

Bias of lag-1 cross-correlation of the generated data from the Wilks model. Note that a positive value indicates the overestimation of lag-1 cross-correlation, while a negative value shows underestimation.

The observed lag-1 cross-correlations were well preserved in the data
generated by the DKNNR model, as shown in Fig. 9a, while the MONR model showed significant
underestimation, as seen in Fig. 9b. The difference of RMSE shown in
Table 8 reflects this behavior. In Fig. 9b, some of the lag-1 cross-correlations
were well preserved, that were aligned with the baseline. From
Table 8, the MONR model reproduced the
autocorrelations well with the shaded values. It is because the lag-1
autocorrelation was indirectly parameterized with the transition
probabilities of

Scatterplot of lag-1 cross-correlations between 12 weather stations
for the observed data (

We further tested the performance measurements of mean absolute error (MAE) and bias whose estimates showed that MAE had no difference from RMSE. In addition, bias of lag-1 correlation presented significant negative values, implying its underestimation for the simulated data of the MONR model as shown in Table 9, while Table 10 of the DKNNR model showed a much smaller bias.

Also, the whole-year data instead of the summer season data were tested for model fitting. Note that all the results presented above were for the summer season data (June–September), as mentioned in Sect. 4 in the data description. The lag-1 cross-correlation is shown in Fig. 10, which indicates that the same characteristic was observed as for the summer season, such that the proposed DKNNR model preserved better the lagged cross-correlation than did the existing MONR model. Other statistics, such as correlation matrix and transition probabilities, exhibited the same results (not shown). Also, other seasons were tried but the estimated correlation matrix was not a positive semi-definite matrix and its inverse cannot be made for multivariate normal distribution in the MONR model. It was because the selected stations were close to each other (around 50–100 km) and produced high cross-correlation, especially in the occurrence during dry seasons. Special remedy for the existing MONR model should be applied, such as decreasing cross-correlation by force, but further remedy was not applied in the current study since it was not within the current scope and focus.

Scatterplot of lag-1 cross-correlations between 12 weather stations
for the observed data (

Model adaptability to climate change in hydro-meteorological simulation models is a critical factor, since one of the major applications of the models is to assess the impact of climate change. Therefore, we tested the capability of the proposed model in the current study by adjusting the probabilities of crossover and mutation as in Eqs. (15) and (16). A number of variations can be made with different conditions.

In Fig. 11, the changes of transition and
marginal probabilities are shown, along with the increase of crossover
probability

Transition probabilities and marginal distribution for the selected
five stations along with changing the crossover probability

The changes in transition and marginal probabilities are presented in Fig. 12
with increasing mutation probability

Transition probabilities and marginal distribution along with changing the crossover probability with the condition that the mutation is processed only if the candidate value is 1. See Eq. (16) for details.

As an example, assume that the occurrence probability (

Climate change, however, may refer to a larger phenomenon, which cannot be addressed directly through modifying only the marginal and transition probabilities as in the current study. Further model development on systematically varying temporal and spatial cross-correlations is required to properly address climate change of the regional precipitation system.

In the current study, the discrete version of a nonparametric simulation model, based on KNNR, is proposed to overcome the shortcomings of the existing MONR model, such as long stochastic simulation for parameter estimation and underestimation of the lagged cross-correlation between sites, as well as testing the adaptability for climatic change. Occurrence and transition probabilities and cross-correlation, as well as lag-1 cross-correlation are estimated for both models. Better preservation of cross-correlation and lag-1 cross-correlation with the DKNNR model than the MONR model is observed. For some cases (i.e., the whole-year data and seasons other than the summer season), the estimated cross-correlation matrix is not a positive semi-definite matrix, so the multivariate normal simulation is not applicable for the MONR model, because the tested sites are close to each other with high cross-correlation.

Results of this study indicate that the proposed DKNNR model reproduces the occurrence and transition probabilities satisfactorily and preserves the cross-correlations better than the existing MONR model. Furthermore, not much effort is required to estimate the parameters in the DKNNR model, while the MONR model requires a long stochastic simulation just to estimate each parameter. Thus, the proposed DKNNR model can be a good alternative for simulating multisite precipitation occurrence.

We further tested the enhancement of the proposed model for adapting to
climate change by modifying the mutation and crossover probabilities
(

DKNNR code is written in Matlab and is available as a Supplement.

The precipitation data employed in the current study are downloadable through

In this appendix, one example of DKNNR simulation is presented with observed
dataset in Table A1 (i.e.,

Estimate the distance

The daily index values are sorted according to the smallest distances shown
in the first two columns of Table A3. The sorted
day indices and their corresponding distances are shown in the third and
fourth columns of Table A3. From the

Simulate a uniform random number (

For GA mixture, another set must be chosen as in step (3). Say

With two sets, the crossover and mutation process is performed as follows:

Crossover: for each station, a uniform random number (

Mutation: for each station, a uniform random number (

Repeat steps (1)–(5) until the target simulation length is reached.

Example dataset of daily rainfall with 12 weather stations and 16 days for measured rainfall (mm) in the upper part of this table and its corresponding occurrences in the bottom part of this table.

Example dataset for estimating distances.
The second row presents the current daily precipitation occurrences for
12 stations and the rows below show the absolute difference between the
current occurrences (

Example for selecting one sequence for

Example for GA mixture for

The supplement related to this article is available online at:

TL and VPS conceived of the presented idea. TL developed the theory and programming. VPS supervised the findings of the current work and the writing of the manuscript.

The authors declare that they have no conflict of interest.

This work was supported by the National Research Foundation of Korea (NRF) grant (NRF-2018R1A2B6001799) funded by the South Korean government (MEST).

This paper was edited by Jeffrey Neal and reviewed by two anonymous referees.