A Bayesian Framework Based on Gaussian Mixture Model and Radial Basis Function Fisher Discriminant Analysis ( BayGmmKda V 1 . 1 ) for Spatial Prediction of Flood

In this study, a probabilistic model, named as BayGmmKda, is proposed for flood susceptibility assessment with a 10 study area in Central Vietnam. The new model is a Bayesian framework constructed by a combination of Gaussian Mixture Model (GMM), Radial Basis Function Fisher Discriminant Analysis (RBFDA), and a Geographic Information System (GIS) database. In the Bayesian framework, GMM is used for modeling the data distribution of flood influencing factors in the GIS database, whereas RBFDA is utilized to construct a latent variable aiming to enhance the model performance. As a result, the posterior probability of flood, which is the output of the BayGmmKda model, is used as flood susceptibility index. 15 Experiment results showed that the proposed hybrid framework is superior to other benchmark models including the Adaptive Neuro Fuzzy Inference System and the Support Vector Machine. To facilitate the model implementation, a software program of BayGmmKda has been developed in Matlab. The BayGmmKda program can accurately establish a flood susceptibility map for the study region. Accordingly, local authorities can overlay this susceptibility map onto various land-use maps for the purpose of land-use planning or management. 20


Introduction
Flooding is one of the most destructive natural hazards that cause heavy loss of human lives and property in immense spatial extent (Dottori et al., 2016;Komi et al., 2017).Recent statistics on flood damages for the period of  shows that flooding affected 109 million people around the globe per year (Alfieri et al., 2017) and killed more than 220 000 people (Winsemius et al., 2015).Although the frequency of flooding has decreased in several regions (i.e., in central Asia and America), flood occurrences have increased globally by 42 % (Hirabayashi et al., 2013).
Notably, Southeast Asia is one of the most heavily flooddamaged regions in the world due to monsoonal rainfalls and tropical hurricane patterns (Loo et al., 2015).Located in this region, Vietnam is a storm center on the western Pacific, and this nation has faced the destructive consequence of flooding in many of its provinces.In Vietnam, floods are often triggered by tropical cyclones.More than 71 % of the Vietnam's population and 59 % of the total land area of Vietnam are susceptible to the impacts of these natural hazards (Tien Bui et al., 2016c).Based on a report by Kreft et al. (2014), from 1994 to 2013, Vietnam endured an annual economic loss that is equivalent to USD 2.9 billion.
Additionally, the occurrences of flood in Vietnam are expected to rise rapidly in the near future due to the increases in poorly planned infrastructure developments and urbanization near watercourses, as well as an increased deforestation and climate change.Hence, an accurate model for evaluat-Published by Copernicus Publications on behalf of the European Geosciences Union.D. Tien  ing flood hazards for land-use planning becomes a crucial need for land-use planning as well as establishment of disaster mitigation strategies.Based on flood prediction models, flood-prone areas can be identified and mapped (Tien Bui et al., 2016c).
Needless to say, the identification of susceptible areas can significantly reduce flood damage to the national economy and human lives by avoiding infrastructure developments and densely populated settlements in highly flood-susceptible areas (Zhou et al., 2016).This identification also helps government agencies to issue appropriate flood management policies and to focus its limited financial resources on constructing large-scale flood defense infrastructure in areas that have great economic value but are highly susceptible to flood (Bubeck et al., 2012;Mason et al., 2010).Therefore, a tool for spatial flood modeling is of great usefulness.
To predict flood occurrence, conventional approaches require time series of meteorological and streamflow data at gauging stations (Machado et al., 2015).However, this is difficult for many areas in developing countries where no gauging stations are available.Therefore, new modeling approaches should be explored and investigated.Given these motivations, this study proposes a novel methodology designed for achieving a high prediction accuracy as well as deriving probabilistic evaluations of flood susceptibility on a regional scale.Accordingly, spatial prediction of flooding is carried out based on a statistical assumption that flooding in the future will occur under the same conditions that triggered them in the past (Tien Bui et al., 2016b).In this way, the flood prediction problem boils down to an on-off supervised classification task, where flood inventories are used to define the class of flood occurrence.Moreover, the class nonflood occurrence is derived from areas that have not yet been damaged by flooding.Consequently, spatial prediction of flooding within the study area is achieved based on the probability of pixels belonging to the class of flood occurrences.To yield probabilistic outputs of flood susceptibility, this study proposes a Bayesian framework established on the basis of an integration of a Gaussian mixture model (GMM) and the kernel Fisher discriminant analysis (KFDA).GMM is employed for density approximation to calculate the posterior probability of flood (flood susceptibility index); in addition, KFDA constructs a latent variable based on the geoenvironmental conditions to enhance the performance of the Bayesian model.
In essence, the proposed integrated framework contains two phases of analysis.RBFDA is first employed for latent variable construction.The Bayesian approach assisted by GMM is then used to perform probabilistic pattern recognition.The first level performs pattern discriminant analysis tasks and the second level carries out the prediction process to derive the model output of flood evaluation.Based on previous studies which indicate that hierarchical model structures can produce improved prediction accuracy, the proposed framework could potentially bring about desirable flood assessment results.The subsequent parts of this study are organized in the following order: related works on flood prediction are summarized in Sect. 2. The next section introduces the research method of the current paper, followed by Sect. 4 which describes the proposed Bayesian model for flood susceptibility forecasting.Section 5 reports the model prediction accuracy and comparison.The last section discusses some conclusions on this work.

A review of related works on flood susceptibility prediction
Because of the criticality of flood prediction, this problem has gained an increasing attention from the academic community.Following this trend, various flood analyzing tools have been developed (Winsemius et al., 2013;Papaioannou et al., 2015;Gao et al., 2017;Alfieri et al., 2014).Basically, these tools could be classified into statistical analysis, rainfall-runoff models, and classification models.Statistical analysis uses long-term recorded time series data at gauged stations to establish regression models; accordingly, the constructed regression models are used to transform flood information to ungauged basins (Yue et al., 1999;Cunnane, 1988;McCuen, 2016).Thus, these models are capable of providing discharge predictions both in space and time.However, longterm data are not always available; in many cases, they are generally too short for reliable estimations of extreme quantiles (Seckin et al., 2013b;Nguyen et al., 2014).Rainfall-runoff models, which deal with estimation of runoff from rainfall, are considered to be the most extensively used approach for flood prediction and management (Nayak et al., 2013;Ciabatta et al., 2016;Bennett et al., 2016).Various types of rainfall-runoff models can be found in the literature, varying from empirical models to highly sophisticated physical processes.Empirical models could be established based on statistical techniques (Brocca et al., 2011;Neal et al., 2013) or advanced machine learning algorithms (Lohani et al., 2011); such models can be effectively employed to analyze rainfall and runoff on the basis of historical time series data.In addition, physical-process models focus on simulating hydrological processes in a basin based on a set of mathematical equations governing physical processes of water flow and surfaces (Aronica et al., 2012;Chiew et al., 1993;Beven et al., 1984;Birkel et al., 2010;Grimaldi et al., 2013).In general, rainfall-runoff models require relatively long-term time series data at gauging stations.However, the density of gauging stations in developing countries is very low and this fact creates a great obstacle to the establishment of accurate hydrological models (Fenicia et al., 2008).In addition, large-scale field works and deployments of measuring equipment are necessary for collecting data.
In recent years, a new flood modeling approach called "on-off" classification of flood occurrence has been successfully proposed for spatial prediction of flood (or alternatively called a flood susceptibility index; Tien Bui et al., 2016d;Tehrany et al., 2014Tehrany et al., , 2015b)).Accordingly, no time series data are required for the model calibration, and the establishment of flood models is based on flood inventories (flood class) and nonflood areas (nonflood class).Accordingly, the probability of a pixel in the study area belonging to the flood class is used as flood susceptibility index.Moreover, it is noted that the results of the model depend on the collection of sufficient training data.Although the flood susceptibility map provides no temporal prediction or return period of flood, the flood map is capable delineating highly susceptible areas.Thus, it is a powerful flood analysis tool for decision-makers that could be used in land-use planning and flood management.
The literature review shows that data-driven methods integrated with GIS databases have demonstrated their effectiveness and accuracy in large-scale flood susceptible predictions.An fuzzy-logic-based algorithm, established by Pulvirenti et al. (2011), has been used to develop a map of flooded areas from synthetic aperture radar imagery; this algorithm is used for the operational flood management system in Italy.A model based on the frequency ratio approach and GIS for spatial prediction of flooded regions was first introduced by Lee et al. (2012); the spatial database was constructed by field surveys and maps of the topography, geology, land cover, and infrastructure.
Prediction models with artificial neural networks (ANNs) have been employed for flood susceptibility evaluation by various scholars (Kia et al., 2012;Seckin et al., 2013a;Rezaeianzadeh et al., 2014;Radmehr and Araghinejad, 2014); previous works have shown that an ANN is a capable nonlinear modeling tool.Nevertheless, ANN learning is prone to overfitting, and its performance has been shown to be inferior to that of support vector machines (SVMs; Hoang and Pham, 2016).Kazakis et al. (2015) introduced a multicriteria index to assess flood hazard areas that relies on GIS and analytical hierarchy processes (AHPs); in this methodology, the relative importance of each flood-influencing factor for the occurrence and severity of flood was determined via AHP.More recently, support-vector-machine-based flood susceptibility analysis approaches have been proposed by Tehrany et al. (2015a, b); the research finding is that SVM is more accurate than other benchmark models, including the decision tree classifier and the conventional frequency ratio model.Mukerji et al. (2009) constructed flood forecasting models based on an adaptive neuro-fuzzy interference system (ANFIS), genetic algorithm optimized ANFIS; experiments demonstrated that ANFIS attained the most desirable accuracy.Recently, a metaheuristic optimized neuro-fuzzy inference system, named as MONF, has been introduced by Tien Bui et al. (2016c); this research pointed out that MONF is more capable than decision tree, ANN, SVM, and conventional ANFIS methods.
As can be seen from the literature review, various datadriven and advanced soft-computing approaches have been proposed to construct different flood forecasting models.In most previous studies, the flood prediction was formulated as a binary pattern recognition problem in which the model output is either flood or no flood.Probabilistic models have rarely been examined to cope with the complexity as well as uncertainty of the problem under concern.Therefore, our research aims to enrich the body of knowledge by proposing a novel Bayesian probabilistic model to estimate the flood vulnerability with the use of a GIS database.In this research, Tuong Duong district (central Vietnam) is selected as the study area (see Fig. 1).This is by far one of the most heavily affected flood regions in the country (Reynaud and Nguyen, 2016).The area of the district is approximately 2803 km 2 .The district is located between the longitudes of 18 The district has two separated seasons, namely a cold season (from November to March) and a hot season (from April to October).The yearly rainfall of the district is within the range of 1679-3259 mm.The rainfall amount is primarily intensified during the rainy period which contributes to roughly 90 % of the total annual rainfall.Due to the district's location as well as its topographic and climatic features, the study area is highly susceptible to flood events with immense effects on the rate of human casualties and economic loss.An examination carried out by Reynaud and Nguyen (2016) reported that approximately 40 % of families have been affected by floods and roughly 20 % of families must be relocated away from the flooded areas; the average loss from flooding is up to 24 % of the family income each year.

Flood inventory map
Prediction of flood zones can be based on an assumption that future flood events are governed by the very similar conditions of flooded zones in the past.Therefore, flood inventories and the geoenvironmental conditions (e.g., topological and hydrological features) that produced them must be extensively determined and collected (Tien Bui et al., 2016c;Tehrany et al., 2015b).The first step of this analysis is to establish a flood inventory map for the region under investigation.In this study, the flood inventory map established by Tien Bui et al. (2016c) was used to analyze the relationships between flood occurrences and influencing factors.The flood inventory map stores documentations of past flood events (see Fig. 1).It is noted that the type of floods in this study area are flash floods.This is the main flood type in this region due to characteristics of the terrain.The map was constructed by gathering information of the study area, field works at flood areas, and analyses from results of the Landsat-8 operational land imagery (from 2010 to 2014) with a resolution of 30 m (retrieved from http://earthexplorer. usgs.gov).Furthermore, the location of flood events was also verified by field works carried out in 2014 with handhold GPS devices.In summary, the total number of flood locations during the last 5 years was recorded to be 76.It is noted that flood locations were determined by overlaying the flood polygons in the inventory map and the digital elevation model (DEM).Moreover, only pixels in the map that are associated with flood points are used to extract the influencing factors used for flood prediction.
Although the data for this study were collected from 2010 to 2014, there were recurrent flash floods which occurred during tropical typhoons in this period.Thus, it is reasonable to conclude that all significant flash flood locations in the study area have been revealed and determined.It should be noted that due to the statistical assumption used in this study, the inclusion of flood locations in the distant past (i.e., before the year of 2009) for flood susceptibility analysis may cause bias.It is because the construction of new hydropower dams such as Ban Ve (from 2010) and Nam Non (from 2011) and deforestation or forestation have changed the geoenvironmental conditions in the study area (Dao, 2017;Manley et al., 2013).In other words, the geoenvironmental conditions of the distant past are very different to those of the present time; therefore, flood locations in the distant past should not be included in the current analysis.

Flood-influencing factors
To construct a flood prediction model, besides the flood inventory map, it is crucial to determine the flood-influencing factors (Tehrany et al., 2015a).It is proper to note that the selection of the flood-governing factors varies due to different characteristics of study areas and the availability of data (Papaioannou et al., 2015).Based on the previous work of Tien Bui et al. (2016c), the physical relationships between influencing factors and flood processes have been analyzed.Accordingly, a total of 10 influencing factors were selected in this study; they include slope (IF 1 ), elevation (IF 2 ), curvature (IF 3 ), topographic wetness index (TWI; IF 4 ), stream power index (SPI; IF 5 ), distance to river (IF 6 ), stream density (IF 7 ), normalized difference vegetation index (NDVI; IF 8 ), lithology (IF 9 ), and rainfall (IF 10 ).These factors are used to analyze the flood vulnerability for the studied area, and a GIS database consisting of the flood inventory map and the chosen factors has been established.The description of the 10 influencing factors of flood occurrence employed in this study Table 1.Flood-influencing factors and their categories.

Bayesian framework for flood classification
The flood prediction in this study is considered as a pattern classification problem within which "flood" and "nonflood" are the two class labels of interest.As a result, the probability (posterior probability) of pixels belonging to the flood class, which are derived from the model, will be used as susceptibility indices.These susceptibility indices of the pixels are then used to generate the flood susceptibility map.To cope with the complexity as well as the uncertainty of the problem of interest, a Bayesian framework is employed in this study to evaluate the flood susceptibility of each data sample.Figure 3 demonstrates the general concept of the Bayesian framework used for classification.
The Bayesian framework provides a flexible way for probabilistic modeling.This method features a strong ability for dealing with uncertainty and noisy data (Theodoridis, 2015;Cheng and Hoang, 2016).Nevertheless, previous studies have rarely examined the capability of this approach for inferring flood susceptibility.Basically, pattern classification aims at assigning a pattern to one of M = 2 distinctive class labels C k , in which k is either 1 or 2. C 1 = 1 and C 2 = 0 denote the flood class and the nonflood class, respectively.To recognize an input pattern based on the information supplied by its feature vector X, we need to attain the poste-rior probability P (C k |X), which indicates the likelihood that the feature vector X falls into a certain group C k .Based on such information, the pattern will be categorized to the group with the highest posterior probability.The posterior probability P (C k |X) is calculated as follows (Webb and Copsey, 2011): where P (C k |X) denotes the posterior probability.The term p(X|C k ) represents the likelihood, which is also called the class-conditional probability density function (PDF).P (C k ) denotes the prior probability, which implies the probability of the class before any feature is measured.The denominator p(X) is the evidence factor; this quantity is merely a scale factor for guaranteeing that the posterior probabilities are valid; it can be calculated as follows: Generally, the prior probabilities P (C k ) can be calculated by computing the ratio of training instances in each class.Thus, the bulk of establishing a Bayesian classification model is the calculation of the likelihood p(X|C k ).This likelihood expresses the density of input patterns in the learning space within a certain group of data.In most of situations, www.geosci-model-dev.net/10/3391/2017/Geosci.Model Dev., 10, 3391-3409, 2017  p(X|C k ) is unknown and must be estimated from the available data.In this research, the Gaussian mixture model is utilized for computing the class-conditional probability density function p(X|C k ).
3.3 Gaussian mixture model for density estimation

Gaussian mixture model
It is noted that the posterior probability value (Eq. 1) for each pixel of the study area is used as flood susceptibility index.To obtain the posterior probability, the class-conditional PDF must be estimated.This section presents how PDF is estimated by a Gaussian mixture model.A GMM is selected in this research because it has been shown to be an effective parametric method for modeling of data distribution, especially in high-dimensional space (McLachlan and Peel, 2000;Theodoridis and Koutroumbas, 2009).Previous studies (Paalanen, 2004;Figueiredo and Jain, 2002;Gómez-Losada et al., 2014;Arellano and Dahyot, 2016) point out that any continuous distribution can be approximated arbitrarily well by a finite mixture of Gaussian distributions.Due to their usefulness as a flexible modeling tool, GMMs have received an increasing amount of attention from the academic community (Zhang et al., 2016;Khanmohammadi and Chou, 2016;Ju and Liu, 2012).In a d-dimensional space the Gaussian PDF is defined mathematically in the following form: where µ denotes the vector of variable mean, represents the matrix of covariance, and θ = {µ, } denotes a set of distribution parameter.
A GMM is, in essence, an aggregation of several multivariate normal distributions; hence, its PDF for each data sample is computed as a weighted summation of Gaussian distributions (see Fig. 4): where = {α 1 , α 2 , . .., α k , θ 1 , θ 2 , . .., θ k }. {α 1 , α 2 , . .., α k } is called the mixing coefficients of k Gaussian components and Accordingly, the PDF for all data samples can be expressed as follows (Ju and Liu, 2012): (5) Identifying a GMM's parameters can be considered as an unsupervised learning task within which a dataset of independently distributed data points X = {x 1 , x N }, generated from an integrated distribution dictated via the PDF p(X| ).The goal is to find the most appropriate value of , denoted ... as e , that maximizes the log-likelihood function: Practically, instead of dealing with the log-likelihood function, an equivalent objective function Q is optimized (Ju and Liu, 2012).
where w it is a posteriori probability for the ith class, i = 1, . .., k, and w it satisfies the following conditions: In order to compute e in Eq. ( 6), the Expectation Maximization (EM) algorithm is employed.In addition, an unsupervised learning approach proposed by Figueiredo and Jain (2002) is used for determining .These two algorithms are briefly reviewed in the next section of the paper.

Learning of the finite-mixture model with the expectation maximization algorithm
The expectation maximization (EM) method is a statistical approach to fit a GMM based on historical data; this method converges to a maximum likelihood estimate of model parameters (McLachlan and Krishnan, 2008) These two steps of the EM procedure are stated as follows: (i) E step: estimating the expected classes of all data samples for each class w it based on Eq. ( 8) and (ii) M step, calculating maximum likelihood given the data's class membership distribution using the following equations: w it .(11)

Unsupervised learning of finite-mixture model
The EM algorithm increases the log-likelihood iteratively until convergence is detected, and this approach generally can derive a good set of estimated parameters.Nonetheless, EM suffers from low convergence speed in some datasets, high sensitivity to initialization condition, and suboptimal estimated solutions (Biernacki et al., 2003).Moreover, additional efforts are required to determine an appropriate number of Gaussian distributions within the mixture.
As an attempt to alleviate such drawbacks of EM, Figueiredo and Jain (2002) put forward an unsupervised algorithm for learning a GMM from multivariate data.The algorithm features the capability of identifying a suitable number of Gaussian components autonomously, and through experiments the authors show that the algorithm is not sensitive to initialization.In other words, this unsupervised approach incorporates the tasks of model estimation and model selection in a unified algorithm.Generally, this method can initiate with a large number of components.The initial values for component means can be assigned to all data points in the training set; in an extreme case, it is possible to distribute the component number equal to the data point number.This algorithm gradually fine-tunes the number of mixture components by casting out elements of normal distributions that are irrelevant for the data modeling process (Paalanen, 2004).
Furthermore, Figueiredo and Jain (2002) employed the minimum message length (MML) criterion (Wallace and Dowe, 1999) as an index for model selection; the application of this criterion for the case of GMM learning leads to the following objective function (Figueiredo and Jain, 2002): where n denotes the size of the training set, N represents the number of hyper-parameters needed to construct a Gaussian distribution, and C nz is the number of Gaussian distribution components featuring nonzero weight (α i > 0).Accordingly, the EM method is then utilized to minimize Eq. ( 12) with a fixed number of C nz .
In detail, the EM algorithm is employed to estimate α i as follows: Accordingly, the parameters µ new i and new i are updated based on Eqs. ( 10) and (11), respectively.The algorithm stops when the relative decrease in the objective function ( |X) becomes smaller than a preset threshold (e.g., 10 −5 ).

Radial-basis-function Fisher discriminant analysis for generation of latent variables
In machine learning, the performance of a model may be enhanced if latent variables are used (Yu, 2011).Therefore, latent variable approach is employed in this research.Accordingly, radial-basis-function Fisher discriminant analysis (RBFDA) proposed Mika et al. (1999), an extension of the Fisher Discriminant Analysis for dealing with data nonlinearity, is used to generate a latent factor for flood analysis.Thus, RBFDA is utilized to project the feature from the original learning space to a projected space that expresses a high degree of class reparability (Theodoridis and Koutroumbas, 2009).Using this kernel technique, the data from an input space I is first mapped into a high-dimensional feature space F .Hence, discriminant analysis tasks can be performed nonlinearly in I.
Herein, ϕ(.) is defined as a transformation from an input space I to a high-dimensional feature space F ; to compute w (the projecting vector), it is necessary to maximize the Fisher  discriminant ratio as follows:

GIS data processing
To obtain w, the kernel trick is applied.Thus, one only needs to establish a formulation of the algorithm which only requires dot-product ϕ(x) • ϕ(y) of the training data and employ kernel functions which calculate ϕ(x)•ϕ(y).The widely employed radial-basis kernel function (RBKF) is expressed in the following formula (with σ denoting the kernel function bandwidth): Since a solution of the vector w lies in the span of all data samples in the projected space, the transformation vector w is shown in the following formula: From Eqs. ( 17) and ( 19), we have the following: Taking into account the formulas of J (w), S ϕ B , as well as Eq. ( 20), we can restate the numerator of Eq. ( 14) in the following manner: where Based on the Eq. ( 17) that defines m ϕ k , the denominator of Eq. ( 14) can be demonstrated in the following way: where matrix with a typical element is k x n , x k m , and I represents the identity matrix and 1 l k is a matrix within which all positions are 1/l k .
Considering all Eqs.( 14), (21), and ( 22), the solution of RBFDA can be found by maximizing the following: The optimization problem with the objective function expressed in Eq. ( 23) is found by identifying the primal eigenvector of N −1 M. Based on the optimization results, an input patter in I is projected on to a line defined by the vector w in the following manner: (24) 4 The proposed Bayesian framework for flood susceptibility prediction

The established GIS database
To formulate a flood assessment model, the first stage is to construct a GIS database (see Fig. 5) within which locations of past flood events, maps of topographic feature, Landsat-8 imagery, maps of geological features, and precipitation statistical records are acquired and integrated.In this study, the data acquisition, processing, and integration were performed with ArcGIS (version 10.2) and IDRISI Selva (version 17.01) software packages.Furthermore, a C++ application has been developed by the authors to transform the flood susceptibility indices into a GIS format for ArcGIS implementation.Accordingly, the compiled outcomes are employed to form a database that includes the aforementioned flood-influencing features with two class outputs: flood and nonflood.As mentioned earlier, a total of 76 flood locations have been recorded.To balance the dataset and reliably construct the flood prediction model, 76 locations of nonflood areas are randomly sampled and included for analysis.Hence, the total database consists of 152 data samples.

The proposed model structure
The proposed model for flood susceptibility assessment that incorporates RBFDA, the Bayesian classification framework, and GMM is presented in this section of the study.The overall flowchart of the proposed Bayesian framework based on GMM and RBFDA for flood susceptibility prediction, named as BayGmmKda, is demonstrated in Fig. 6.
Firstly, the whole dataset, including 152 data samples, was separated into two sets: a training set (90 % or 137 samples), employed for model establishing, and a testing set (10 % or 15 samples), used for model testing.It is noted that the input variables of the dataset have been normalized using the minimum-maximum normalization; the purpose of data normalization was to hedge against the situation of unbalanced variable magnitudes.
Secondly, a latent input factor was generated using the RBFDA (explained in Sect.3.4) and added to the training dataset, with the aim of enhancing the classification performance.Subsequently, the feature evaluation was performed to quantify the degree of relevance of each input factors with the flood inventories in the training set.Any nonrelevant factor should be eliminated from the modeling process to reduce noise and enhance the model performance (Tien Bui et al., 2016a, 2017).For this purpose, in this research, the Mutual Information Criterion (Kwak and Choi, 2002;Hoang et al., 2016), a widely employed techniques for feature selection in machine learning, was selected to express the pertinence of each influencing factors to the flood.It is noticed that the larger the mutual information, the stronger the relevancy between the influencing factor and flood.
In the next step, the BayGmmKda model was trained and established using the training set.The purpose of the training process was to find the best parameters for the mixture component (k) used in GMM and the kernel function bandwidth (σ ) used in RBFDA of the BayGmmKda model.To determine the best k, the EM algorithm that employs Akaike information criterion (AIC; Akaike, 1974) was used.Thus, the value of k was varied from 1 to 20, and then AIC was estimated and used to select the model that exhibits the best fit to the data at hand.It is noted that a model with a number of mixture components (k) indicates a lesser degree of complexity (Olivier et al., 1999).In addition, the unsupervised GMM learning (Figueiredo and Jain, 2002) is also used for autonomously determining the best k.Accordingly, the model starts with a maximum component number (k) of 20; the algorithm carries out the model selection process by removing irrelevant mixture components if applicable.To determine the best σ , the grid search procedure is performed and the parameter σ corresponding to the highest classification accuracy rate was selected.
Using the best k and σ in the previous step, the final BayG-mmKda model was finally constructed and the Bayesian classification framework was derived.The Bayesian framework was then used to estimate the posterior probability (flood susceptibility index) for all the pixels in the study areas.The flood susceptibility index was then transferred to a raster format to open in ArcGIS.

The developed MATLAB interface of BayGmmKda
It is noted that the coupling of the GMM with the EM training algorithm is implemented with the MATLAB statistical toolbox (MathWorks, 2012a); meanwhile, the BayGmmKda performs the unsupervised algorithm with the program code provided by Mário A. T. Figueiredo (http://www.lx.it.pt/~mtf/, last access: 1 April 2016).The RBFDA algorithm and the unified BayGmmKda model have been coded in MATLAB by the authors.In addition, a software program with a graphical user interface (GUI; see Fig. 7) for the implementation of the BayGmmKda model has been coded in a MATLAB environment by the authors.The GUI development aims at providing a user-friendly system for performing flood susceptibility predictions.
As shown in Fig. 7  data viewing, and preliminary feature selection with mutual information.In the second module, the users simply provide model parameters, including the kernel function parameter and the GMM training method.The trained model is employed to carry out prediction tasks in the third module, within which the model prediction performance is reported.
5 Experimental results

Feature selection and training of the BayGmmKda model
The outcome of the preliminary examination on the pertinence of flood-influencing factors is reported in Fig.The classification accuracy rate (CAR) is employed to exhibit the rate of correctly classified instances.In addition, a more detailed analysis on the model capability can be pre- sented by calculating true positive rate (TPR), false positive rate (FPR), false negative rate (FNR), and true negative rate (TNR).These four rates are also widely utilized to exhibit the predictive capability of a prediction model (Hoang and Tien-Bui, 2016).In addition to the four rates, the receiver operating characteristic (ROC) curve (van Erkel and Pattynama, 1998) is used to summarize the global performance of the model.The ROC curve basically demonstrates the trade-off between the two aforementioned TPR and FPR, when the threshold for accepting the positive class of flood varies.In addition, the area under the ROC curve (AUC) is employed to quantify the global performance.In generally, a better model is characterized by a larger value of the AUC.
As aforementioned, the dataset is randomly separated into the training set and the testing set which occupy 90 and 10 % of the data samples, respectively.The training set is employed to train the mode; meanwhile, the testing set is used for validating the model capability after being trained.Since one selection of data for the training set and the testing set may not truly demonstrate the model's predictive capability, this study carries out a repetitive subsampling procedure within which 30 experimental runs are carried out.In each experimental run, 10 % of the dataset is retrieved in a random manner from the database to constitute the testing set; the rest of the database is included in the training set.The testing performance of the proposed Bayesian framework for flood susceptibility is reported in Table 2 and Fig. 9, which provides the average ROC curves of the proposed model framework, obtained from the random subsampling process, with two methods of GMM training.Herein, the two Bayesian models that employ the EM algorithm and the unsupervised learning (UL) algorithm for training GMM are denoted as BayGmmKda-EM and BayGmmKda-UL, respectively.It can be seen that the BayGmmKda-UL model demonstrates clearly better predictive performance (CAR = 89.58%, AUC = 0.94, TPR = 0.96, TNR = 0.91) than that of the BayGmmKda-EM model (CAR = 86.67 %, AUC = 0.93, TPR = 0.95, TNR = 0.85).Although the performances of the BayGmmKda-EM model and the BayGmmKda-UL model are comparable in TPR, however, the BayGmmKda-UL model is deemed more accurate than the BayGmmKda-EM model when the two models predict samples with the nonflood class.

Model comparison
Because this is the first time the BayGmmKda model has been proposed for the measurement flood susceptibility, the validity of the proposed model should be assessed.Hence, the benchmarks were used for the comparison, including the support vector machine, adaptive neuro-fuzzy inference system, and the GMM-based Bayesian classifier.The above machine learning techniques were selected because SVM and ANFIS have been recently verified to be effective tools for predicting flood susceptibility (Tien Bui et al., 2016c;Tehrany et al., 2015b).It is noted that the GMM-based Bayesian classifier (BayGmm) is the Bayesian framework for classification which employs GMM for density estimation; however, BayGmm is not integrated with the RBFDA algorithm.BayGmm is used in the performance comparison section to confirm the advantage of the newly constructed BayG-mmKda and to verify the usefulness of RBFDA in enhancing the discriminative capability of the hybrid framework.
To construct the SVM model, the model's hyperparameters of the regularization constant (C) and the parameter of the radial-basis kernel function (σ ) need to be specified.Herein, a grid search process, which is identical to the one used to identify the kernel function bandwidth used in RBFDA, is employed to fine-tune such hyperparameters of the SVM model.It is noted that the SVM method is implemented in a MATLAB package (MathWorks, 2012b).Meanwhile, the ANFIS model is trained with the metaheuristic approach described in the previous work of Tien Bui et al. (2016c).
It is noted that a random subsampling with 30 runs is employed for all models in this experiment.The result comparison between the proposed BayGmmKda model and three benchmark models is shown in Table 3.The result shows that the proposed model yields the best results (CAR = 89.58% and AUC = 0.94).It is followed by the ANFIS model (CAR = 85.63 %, AUC = 0.83); the BayGmm model (85.02 %, AUC = 0.92), and the SVM model (83.75 %, AUC = 0.82).
To confirm the performance of the proposed BayGmmKda model is significantly higher than that of the three benchmark model, the Wilcoxon signed-rank test is employed.The Wilcoxon signed-rank test is widely used to evaluate whether classification outcomes of prediction models are significantly  Bui et al., 2016e).Using this test, the p values that were obtained from experimental results of the four models can be computed using a threshold value of 0.05.The result of the Wilcoxon signed-rank test is shown in Table 4.
It is noted that the signs "++", "+", "--", and "-" represent a significant win, a win, a significant loss, and a loss, respectively.The result confirms that the proposed BayGmmKda model achieves significant wins over the other models.Interpretation of the map shows that 10 % of the Tuong Duong district was classified into the very high class and this class covers 73.68 % of the total historical flood locations.Meanwhile, both the high class and the moderate classes cover 10 % of the region but account for only 15.79 and 7.9 % of the total historical flood locations, respectively, whereas the low class covers 20 % of the district but it contains only 2.63 % of the total historical flood locations.In particular, 50 % of the district, which is categorized to the very low class, contains no flood location.These results indicate that the proposed BayGmmKda model has successfully delineated susceptible flood-prone areas.In other words, the interpretation results confirm the reliability of the proposed Bayesian framework in this work.

Conclusion
This research has developed a new tool, named as BayG-mmKda, for flood susceptibility evaluation, with a case study in a high-frequency flood area in central Vietnam.The newly constructed model is a Bayesian framework that combines GMM and RBFDA for spatial prediction of flooding.A GIS database has been established to train and test the BayG-mmKda method.The training phase of BayGmmKda consists of two steps: (i) discriminant analysis with RBFDA in which a latent factor is generated and (ii) density estimation using GMM.After the training phase, the Bayesian frame- work is employed to compute the posterior probability.The posterior probability was then used as flood susceptibility index.Furthermore, a MATLAB program with GUI has been developed to ease the implementation of the BayGmmKda model in flood vulnerability assessment.
It is noted that in this study, the GMM training is performed with two methods: the EM algorithm and the unsupervised learning approach.Furthermore, a repeated subsampling process with 30 experimental runs is carried out to evaluate the model prediction outcome.The subsampling process verified by statistical test confirms that the GMM method trained by the unsupervised learning approach has attained a better prediction accuracy compared with the EM algorithm.Therefore, this method of GMM learning is strongly recommended for other studies in the same field.
Furthermore, the experiments demonstrate that the latent factor created by RBFDA is really helpful in boosting the classification accuracy of the BayGmmKda model.This melioration in accuracy of the BayGmmKda stems from its integrated learning structure.As described earlier, the classification task is performed by a hybridization of discrimination analysis and a Bayesian framework.The Bayesian model carried out the classification task by consideration of the patterns in the original dataset and an additional factor produced from the discrimination analysis.As result, the performance of the BayGmmKda model is better than those obtained from the three benchmarks (SVM, ANFIS, and BayGmm).
The main limitation in this work is that the BayGmmKda is a data-driven tool; therefore, field works and GIS-based geoenvironmental data are necessary for the model construction phase.This data collection and analysis can be timeconsuming.In addition, the grid search procedure is used for hyper-parameter setting in the BayGmmKda model requires a high computational cost, especially for large-scale datasets.Furthermore, the outcome of this grid search procedure may not be optimal; therefore, more advanced model selection approaches, i.e., metaheuristic optimization algorithms, could be utilized to further improve the model accuracy.
Despite such limitations, the proposed BayGmmKda model, featured by its high predictive accuracy and the capability of delivering probabilistic outputs, is a promising alternative for flood susceptibility prediction.Future extensions of this research may include the model application in flood prediction for other study areas, investigations of other flood-influencing factors (i.e., streamflow and antecedent soil moisture which may be relevant for flood analysis) and improving the current model with other novel soft computing methods, i.e., feature selection, pattern classification, and dimension reduction to alleviate the aforementioned drawbacks as well as to enhance the model performance.

Figure 3 .
Figure 3. General concept of the Bayesian Framework for flood classification.

Figure 4 .
Figure 4. Structure of a Gaussian mixture model.
8a.As mentioned earlier, the relevancies of influencing factors are exhibited by the mutual information criterion.Based on the outcome, IF 5 (SPI) features the highest mutual dependence, followed by IF 7 (stream density) and IF 8 (NVDI).Influenc-ing factors of IF 4 (TWI) and IF 10 (rainfall) exhibit comparatively low values of mutual information.Because all the mutual information values are not null, all influencing factors are deemed to be relevant and should be retained for the subsequent processes of model training and prediction.It is worth keeping in mind that the BayGmmKda's training phase is executed in two consecutive steps, training RBFDA and training GMM.RBFDA analyzes the data in the training set to establish a latent factor which is a onedimensional representation of the original input pattern.
Figure 8b shows the resulted latent factor constructed by RBFDA.In the next step of the training phase, GMM is constructed by the original input patterns with their corresponding labels which consist of 10 input factors and with the RBFDA-based latent factor.

Figure 8 .
Figure 8.(a) Mutual information of flood-influencing factors; (b) RBFDA-based latent factor derived in this study.
TN, FP, and FN represent  the values of true positive, true negative, false positive, and false negative, respectively.

Figure 10 .
Figure 10.The flood susceptibility map using the proposed BayGmmKda model for the study area.
www.geosci-model-dev.net/10/3391/2017/Geosci.Model Dev., 10, 3391-3409, 2017 D. Tien Bui and N.-D.Hoang: BayGmmKda V1.1 Tien Bui and N.-D.Hoang: BayGmmKda V1.1 3399 probabilities p i (x t |θ i ) = N (x i |µ i , i ) that x t generated from the ith mixture component are calculated, and the M step within which the maximum likelihood estimates of θ i are updated.The iteration of EM algorithm terminates when the change value of the objective function is lower than a threshold value.

Table 3 .
Performance comparison of the BayGmmKda model with the three benchmarks, the SVM model, the ANFIS model, and the BayGmm model.

Table 4 .
Model comparison based on the Wilcoxon signed-rank test.