Interactive comment on “ Evaluation of the wind farm parameterization in the Weather Research and Forecasting model ( version 3 . 8 . 1 ) with meteorological and turbine power data ” by

This manuscript tries to evaluate the wind farm parametrization in WRF by comparing WRF power results to SCADA power data from a given onshore wind farm in Iowa and from additional lidar observations from the CWEX-13 experiment. Main result of the study is said to be the fact that the simulated ambient wind speed is the most important parameter for the quality of the simulation of the wind farm yield. The subject is important and is expected to have both scientific and economic impact.


Introduction
In recent years, numerical weather prediction (NWP) models have become an indispensable tool in the wind-energy industry, not only in day-to-day wind-energy production forecasts , but also to support wide-scale wind-power penetration (Marquis et al., 2011) and wind resource assessment. To forecast power production accurately at wind farms, the simulation tools should resolve all physical processes relevant to the wind field, including possible impacts of the wind turbines themselves. Consequently, including the meteorological effects of wind farms in NWP models can improve power-production forecasts.
Researchers have developed various methods to numerically represent wind farms. Via large-eddy simulations (LESs), some investigators assess the meteorological impacts of wind turbines as well as power production (Abkar and Porté-Agel, 2015b;Aitken et al., 2014;Calaf et al., 2010;Churchfield et al., 2012;Jimenez et al., 2007;Na et al., 2016;Sharma et al., 2016;Wu and Porté-Agel, 2011). Simulating wind turbines and their effects in LESs is, while useful, computationally expensive, making wind-farm-scale simulations unreasonable in an operational setting.
At coarser spatial scales, suitable for global, synoptic, or mesoscale models, numerically representing wind turbine effects may involve unrealistic assumptions. For example, researchers have used exaggerated surface roughness to represent the reduction of wind speed (WS) caused by wind farms in a global model (Barrie and Kirk-Davidoff, 2010;Frandsen et al., 2009;Keith et al., 2004). Similarly, the analytical wind park model of Emeis and Frandsen (1993) considers both the downward momentum flux and the momentum loss due to surface roughness. The revised model by Emeis (2010) accounts for the spatially averaged momentum-extraction coefficient by turbines, and the parameters become atmosphericstability dependent. However, these models omit the consid-eration of turbine-scale interactions between the hub and the surface (Abkar and Porté-Agel, 2015a;Fitch et al., 2012Fitch et al., , 2013b. Aside from indirectly representing wind turbines via exaggerated roughness, another common approach is to use the turbine power curve to deduce elevated drag and turbulence production of wind turbines. A power curve illustrates the relationship between inflow WS at hub height and power production of a particular turbine model. This method can model meteorological impacts of wind turbines and the impact of turbine drag force (Baidya Roy, 2011;Blahak et al., 2010). Based on this technique, Fitch et al. (2012) added the consideration of the turbine thrust coefficient to simulate both turbine drag and power loss.
In the wind farm parameterization (WFP) of the Weather Research and Forecasting (WRF) model, wind turbines in each model grid cell are collectively represented as a turbulence source and a momentum sink within the vertical levels of the turbine rotor disk . A fraction of the kinetic energy extracted by the virtual wind turbines is converted to power, and the turbulence generation is derived from the difference between the thrust and power coefficients. In the WFP scheme, the use of the WS-dependent thrust coefficients accounts for the effects of local wind drag on wind-energy extraction as well as on power estimation. The WRF WFP offers flexibility, where users can modify the parameters of a turbine model, such as its hub height, rotor diameter, power curve, and thrust coefficients, and does not require other empirically derived parameters. By simulating wind farms in a mesoscale weather model, WRF users can simulate aggregated effects of wind-turbine wakes and thus the effects of power production of downwind turbines.
An approach similar to the WRF WFP proposed by Abkar and Porté-Agel (2015a) relies on an extra parameter, which is the ratio of the free-stream velocity to the horizontally averaged hub-height velocity of a turbine-containing grid cell. This ratio depends on various factors such as the wind-farm density and layout, and requires preliminary simulation results (Abkar and Porté-Agel, 2015a). Therefore, the publicly available WFP in the WRF model is chosen in this project for observed power comparison. Several approaches are available to incorporate impacts of wind farms into mesoscale simulations. The explicit wake parameterization (EWP) recently designed by Volker et al. (2015) uses classical wake theory to describe the unresolved wake expansion. Both the WRF WFP and the EWP average the drag force within grid cells. Nevertheless, users of the EWP need to adjust the length scales that determine wake expansion in the EWP for different situations.
In this paper, we evaluate the WFP in the WRF model via comparison to actual power-production data. The WRF WFP has been widely used to assess the impacts of onshore and offshore wind farms at different spatial scales and in different stability regimes (Eriksson et al., 2015;Fitch et al., 2013a, b;Jiménez et al., 2015;Lee and Lundquist, 2017;Miller et al., 2015;Vautard et al., 2014). Whereas WFP predictions have been compared to power production of offshore wind farms for a limited set of WSs (Jiménez et al., 2015), here we explore a range of WSs, wind direction (WD), turbulence, and atmospheric stability conditions. The large range of wind conditions induces spatially and temporally diverse power production, thereby providing a basis for a comprehensive evaluation of the WFP. The uniqueness of this project lies in the in-depth assessment of the WRF WFP performance in forecasting and simulating wind energy of a sizable onshore wind farm, using observed power-production data.
We describe the observation data and the model design in Sect. 2. In Sect. 3, we evaluate the simulations by comparing the meteorological and power-generation data with a statistical examination. We close with a proposal of improvements on the WRF WFP in Sect. 4.

Observations
The 2013 Crop Wind Energy eXperiment (CWEX-13) took place in central Iowa at a 200-turbine wind farm to quantify far-wake impacts of multiple rows of turbines . In CWEX-13, measurements from seven surface flux stations, a radiometer, three profiling lidars, and a scanning lidar were collected. This campaign was a component of the larger CWEX project, which explored the interactions of wind turbines with crops, surface fluxes, and near-surface flows in different atmospheric stability regimes in flat terrain (Rajewski et al., 2013). Research facilitated by the CWEX projects include diurnal changes in observed turbine wakes (Rhodes and Lundquist, 2013), turbine interactions with moisture and carbon dioxide fluxes , LES modelling of turbine wakes in changing stability regimes , nocturnal low-level jet (LLJ) occurrences (Vanderwende et al., 2015), diurnal changes of the microclimate near wind turbines (Rajewski et al., 2016), multiple-wake interactions (Bodini et al., 2017), the evolution of turbine wakes during the evening transition (Lee and Lundquist, 2017), and coupled mesoscalemicroscale modelling (Muñoz-Esparza et al., 2017).
This wind farm consists of 200 wind turbines, represented by the red dots in Fig. 1. Half of the wind turbines in the wind farm are General Electric (GE) 1.5 MW super-long extended (SLE) model, and the other half are GE 1.5 MW extra-long extended (XLE) model (Rajewski et al., 2013). The cut-in and cut-out speeds of the SLE model are 3.5 and 25 m s −1 respectively, and the rated speed is 14 m s −1 . With the same cut-in speed, the XLE model has lower rated and cut-out wind speeds at 11.5 and 20 m s −1 . The hub height of both models is 80 m; the rotor diameters of the SLE and the XLE (b) (a) Figure 1. Map of the three domains (d01, d02, and d03) in the WRF simulations (a), with the white "x" representing the CWEX-13 wind farm. Zoom-in map of the wind farm (b), with the black horizontal and vertical lines outlining the WRF grid cells, the red dots as the actual locations of wind turbines, the blue numbers as the number of wind turbines per WRF grid cell, the yellow square as the WC lidar, the green square as the 200S lidar, and the purple square as the surface flux station. Other instruments were deployed in CWEX-13, and only the instruments used herein are shown. model are 77 and 82.5 m respectively. For simplicity, references to the rotor diameter (D) herein refer to the 77 m rotor diameter. Power generated by each turbine is recorded by the Supervisory Control and Data Acquisition (SCADA) system every 10 min, and we sum up the power production of all turbines for wind-farm production for each 10 min period.
Observations of the wind profile are collected by a profiling lidar and a scanning lidar. The WINDCUBE v1 (WC) profiling lidar (yellow square in Fig. 1) is located 528 m, or 6.3 D, south of the nearest turbine. The WC lidar measures winds at about 0.25 Hz from 40 to 220 m above ground level (a.g.l.) every 20 m via the Doppler beam swinging (DBS) method. The WC lidar derives wind components by measuring radial velocities using DBS at an azimuth angle of 28 • . Note that the WC-observed turbulence parameters, namely the turbulence kinetic energy (TKE) and the turbulence intensity (TI), are derived from the variances of the three wind components in 2 min intervals and hence do not represent small-scale turbulence. The turbulence parameters are de-fined by the following: where σ 2 is the 2 min averaged variances of the u, v, and w wind components, and U is the mean horizontal WS (Stull, 1988). In CWEX-11, wind-turbine wake measurements at a different location in this wind farm were collected with these instruments (Rhodes and Lundquist, 2013), and the errors in the WC lidar measurements due to inhomogeneous flow were explored by Bingöl et al. (2009) and Lundquist et al. (2015). The WINDCUBE 200S scanning lidar (green square in Fig. 1) is positioned 437 m, or 5.7 D, north of the nearest turbine row. In CWEX-13, the 200S lidar scanning strategy included velocity azimuth display (VAD) scans that measures winds from ∼ 100 to ∼ 4800 m a.g.l. approximately every 50 m for every 3 min. In this study, we use the 200S 75 • elevation scans (Vanderwende et al., 2015) to estimate hor-izontal winds every 30 min to verify the simulated winds in the boundary layer. In the case study chosen, the dominant WD is south-easterly to south-westerly (Vanderwende et al., 2015), and thus some of the 200S measurements below the rotor top (about 120 m a.g.l.) could be influenced by turbine wakes during conditions when the wakes persist longer than 5 D downwind from the turbine (Bodini et al., 2017). However, the WC measurements are largely unaffected by turbine wakes except when WD is east of 150 • . The closest upwind turbine during this simulation period was located over 2.7 km (33 D) to the south-east.
The measurements from the surface flux station can also quantify model skill. The surface flux station of interest (purple square in Fig. 1) is located 681 m, or 8.8 D, south of the closest turbine. At 8 m a.g.l., the station measures 20 Hz winds via a CSAT3 sonic anemometer, as well as virtual temperature and water-vapour density via a HMP45C probe. After tilt correction (Wilczak et al., 2001), we calculate surface sensible heat flux using a 30 min averaging time period. We use the Obukhov length (L) to categorize atmospheric stability conditions: where T v is the mean virtual temperature, u * is the frictional velocity, k is the von Karman constant, g is the gravity acceleration, and (w T v ) s is the surface virtual temperature flux calculated from the 20 Hz measurements (Stull, 1988). A positive surface sensible heat flux and Obukhov length ratio (z L −1 ), where z is 8 m, indicates a stable atmosphere, whereas a negative ratio represents unstable conditions. From 24 to 27 August 2013, nocturnal LLJs were observed (Vanderwende et al., 2015). No major synoptic events affected the area during this period. Moreover, when the nearsurface flows are southerly, the WC and the surface flux station measure winds unaffected by wind turbines (Muñoz-Esparza et al., 2017). Additionally, no curtailment of wind turbines occurred and the instruments operated normally during the period, making these 4 days ideal for model verification.

Modelling
To establish direct comparison with the observations, we simulate winds with and without the WFP using the Advanced Research WRF (ARW) model (version 3.8.1) (Skamarock and Klemp, 2008). We simulate the winds on each day separately, from 00:00 to 00:00 UTC, after 12 h of spinup time. The ERA-Interim (Dee et al., 2011) and the 0.5 • Global Forecast System (GFS) reanalysis datasets provide boundary conditions for two different sets of model runs. We set three domains in our simulations with horizontal resolutions of 9, 3, and 1 km respectively, where the finest domain covers the state of Iowa (Fig. 1). To capture the westerly syn-optic flow and the southerly near-surface winds, we position the inner grids north-east of the centres of the coarser grids.
The WFP scheme simulates wind farms and their meteorological influences to the atmosphere. We provide a brief summary here, and the details are discussed in Fitch et al. (2012). Wind turbines slow down ambient wind flow and convert a part of the kinetic energy of wind into electrical energy. The WFP represents this wind-turbine drag force as the kinetic energy harvested by the turbine from the atmosphere: where C T is the turbine-specific thrust coefficient (discussed in detail in Fitch, 2015), V is the horizontal velocity vector, ρ is air density, A = π 4 D 2 is the cross-sectional rotor area, and D is the rotor diameter. This kinetic-energy extraction also causes changes in the atmosphere, namely the kinetic energy loss in the grid cell, which is described by the momentum tendency: where i, j , and k represent the zonal, meridional, and vertical grid indices, N ij t is the number of wind turbines per square metre, and z k is the height at model level k. Of the kinetic energy extracted by the turbines, the WFP accounts for the electricity generation with the following: where P ij k is the power output in the grid cell in watts, and C P is the power coefficient. Assuming negligible mechanical and electrical losses, the rest of the kinetic energy harvested turns into TKE: where TKE ij k is the TKE in the grid cell, and C TKE is the difference between C T and C P .
In this study, we employ two resolutions of vertical grids: approximately 12 m and 22 m resolution below 400 m a.g.l., with 80 and 70 total levels respectively (Fig. 2). Three and six vertical levels intersect the atmosphere below and within the rotor layer in the finer vertical grid, while the 22 m grid only allows one full level below and four levels within the rotor layer (Fig. 2). The vertical levels are further stretched beyond the boundary layer. In past research involving the WRF WFP scheme, the selections of vertical resolution within the rotor layer include 9-18 m in , about 10-16 m in Volker et al. (2015), about 15 m in Fitch et al. (2012Fitch et al. ( , 2013a and , about 20 m in Miller et al. (2015) and Vautard et al. (2014),  about 22 m in Lee and Lundquist (2017), and about 40 m in Eriksson et al. (2015) and Jiménez et al. (2015). Moreover, the Mellor-Yamada-Nakanishi-Niino (MYNN) level 2.5 planetary boundary layer (PBL) scheme is required for the WFP in the WRF model version 3.8.1 . Note that substantial upgrades were made on the MYNN PBL schemes in WRF version 3.8 (WRF-ARW, 2016). The MYNN PBL scheme supports TKE advection, active coupling to radiation, cloud mixing from Ito et al. (2015), and mixing of scalar fields. The MYNN scheme also uses the cloud probability density function from Chaboureau and Bechtold (2002), and here we keep the mass-flux scheme deactivated. We summarize the other model configuration details in Table 1.
After verifying the background flow simulated by the WRF model (first four rows in Table 2), virtual turbines are added via the WFP (last four rows in Table 2). We simulate all the turbines using the 1.5 MW PSU generic turbine model (Schmitz, 2012), in which its specifications are based on the GE 1.5 MW SLE model installed at the wind farm. The turbines within the WRF grid cells are located using the latitudes and longitudes provided by the wind-farm owneroperator. The model grid cells within the wind farm, con-  Fig. 1. With the WFP activated, the model simulates the total power production at each time step in each turbinecontaining grid cell, regardless of the number of turbines per cell. To match the 10 min average power data from the turbines, we sample 10 min power from the WFP output. We also estimate the power generation of the WRF simulations without using the WFP. Based on the ambient WS of the turbine-containing grid cells in the control WRF runs, we use the turbine power curve to obtain an assessment of the power every 10 min. We then multiply the power with the number of turbines per cell to calculate power in each grid cell, as would be done in wind-energy forecasting without a wake parameterization. This method of power estimation omits wake effects, in contrast to the WFP.

Ambient flow evaluation
The WRF model simulations without the WFP simulate accurate ambient winds compared to the lidar measurements. Qualitatively, the ERA12 simulation (see Table 2 for a listing of all the simulations) has skill in simulating WS and WD during the 4-day period, including the occurrence, the strength, and the elevation of the nocturnal LLJs (Fig. 3). The 200S records the vertical shear caused by LLJs above 100 m (Fig. 3a), and the WC measures the near-surface winds with high temporal resolution (Fig. 3b). In the observations and 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 Height a.g.l. the simulations of WS (Fig. 3c), the night-time WS profile is stratified, whereas the daytime atmosphere is well mixed. The WD simulations also match well with the measurements, where in the evening the winds veer, or turn clockwise with height ( Fig. 4), while the WD remains relatively constant with height during daytime. Except for the last hours on 24 August, the ERA12 captures the general temporal and vertical fluctuations in WS and WD, when the winds change from south-easterly to south-westerly (Figs. 3 and 4). The 200S measurements above the rotor layer (120 m) are unaffected by turbine wakes (Figs. 3a and 4a); the LLJs observed above the rotor layer resemble those from the ERA12, confirming the skill of the simulations. To evaluate the effects of boundary conditions and vertical resolutions on simulating winds, we compare the four no-WFP runs: ERA12, ERA22, GFS12, and GFS22.
Quantitatively, simulations using finer vertical resolution have more skill in simulating winds than those with coarser resolution (Table 3). In comparison to the 200S and WC observations, the mean absolute errors in WS and WD of the 12 m runs are lower than those of the 22 m runs over the 4day period, by 0.3 m s −1 and 0.8 • on average. In particular, in the ERA12, the errors in WS decrease by at least 19 % relative to the ERA22. Although the GFS22 yields smaller WS 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 Height a.g.l.  errors than the ERA22, refining the vertical grid of the simulations using either boundary condition dataset improves the WS-prediction skill of the WRF model more than changing the boundary conditions (Table 3). The errors in simulating WD remain similar regardless of the choice of boundary condition or vertical grid. Of all our control runs, the ERA12 simulates the most accurate inflow. The RMSEs and biases closest to zero across different days are highlighted in bold.

Power simulations
The simulation omitting the WFP ignores the wake effects on power production of downwind turbines, and therefore overestimates total power. For each 10 min time step, we compare the spatial distribution of power production as well as the total power between the ERA12, the ERA12WF, and the observations; Fig. 5 represents one 10 min time step in the 4-day period. As mentioned above, we calculate the power estimates of ERA12 using the ambient WS, the number of turbines in each grid cell, and the power curve (Fig. 5a). The WRF WFP generates power predictions (Fig. 5b), and we sum up the actual power production in each grid cell (Fig. 5c). We present the total 10 min simulated and observed power of the whole wind farm at the bottom of each panel in Fig. 5, and the total power production of the WFP run matches the observed. We then assemble the 576 10 min total power values over the 4-day period and compare the simulations to the observations (Fig. 6). We also calculate an error and a bias of modelled total power for each 10 min interval, summarizing as the daily root mean-squared errors (RMSEs) and average biases in Tables 4 and 5. The large average biases in Table 5 highlight the consistent power overestimation of the no-WFP runs. Over the 4-day period, the WFP produces total power of the whole wind farm that generally agrees with observation (Fig. 6c). Although the RMSEs between the no-WFP and WFP runs are comparable (Table 4), the average biases are smaller in the WFP simulations (Table 5). For instance, the ERA12WF slightly under-predicts total power by −4.9 MW on average (Figs. 6c and Table 5). The ERA12, by contrast, consistently over-predicts power production by 41.5 MW (Fig. 6a and Table 5). The daily positive biases of the ERA12 in the first 2 days are nearly 20 % of maximum wind farm production ( Table 5). The average positive power bias of 36.2 MW in the ERA22 is also remarkably larger than the mild negative bias of −15.1 MW in the ERA22WF (Fig. 6b, d, Table 5). Furthermore, the ERA12 and the GFS12  generally outperform the ERA22 and the GFS22 in power predictions, particularly in RMSE ( Fig. 6 and Table 5). However, on the last day, with more south-westerly flow, the ERA12 and the ERA22 outperform the ERA12WF and the ERA22WF, while the GFS12WF and the GFS22WF yield smaller errors and biases (Tables 4 and 5). Nonetheless, in aggregate, the simulations using the WFP predict wind-farm power production with more skill than simulations without the WFP. As demonstrated by the average absolute errors (Table 3), the WFP power simulations improve when using 12 m rather than 22 m vertical resolution (Fig. 6). Changing the vertical grid improves the predictions more than changing boundary conditions (Tables 4 and 5). In particular, in the ERA-Interim simulations, the RMSE each day decreases by 19-39 % when switching from ERA22WF to ERA12WF (Table 4, Fig. 6c, d). Since the power-prediction skill of the ERA-Interim-initiated runs and the GFS-initiated runs are comparable, the rest of the paper will focus on the WFP runs using the ERA-Interim as initial and boundary conditions. Moreover, to statistically differentiate the power productions from various model runs, we apply the two-sample Student's t test. The null hypothesis of a two-sample t test is that the two population means are the same, assuming the underlying distributions are Gaussian (Wilks, 2011). Hence, if the resultant p value is equal to or below 0.05, the two distributions are statistically significantly different at the 95 % confidence level. For example, the difference between the 4-day power-production averages from the ERA12 and from the ERA12WF is −46.8 MW, and the respective p value is 0 (Table 6). Thus the difference between the means is statistically significant. In other words, the ERA12 and the ERA12WF yield different power-production distributions at any confidence level. Similarly, the GFS12 and the GFS12WF lead to statistically different power outputs as the p value from t test is 0 as well (Table 7). We also use the two-sample t test to contrast the actual and the modelled power distributions. For instance, all the p values between the no-WFP runs and the observation are 0, implying those simulations yield powergeneration distributions significantly different from the reality (Table 8).
Given the utility of the WFP, assessing the interactions between atmospheric forcing and power production is an important step to further examine the performance of the WFP. As with the ERA12, the ERA12WF adequately simulates the evolution of the meteorological variables over the 4-day pe-   (Fig. 7a-d). Both the ERA12 and the ERA12WF capture the overall trends of hub-height ambient WS and WD measured by the WC (Fig. 7a and b), corresponding to Figs. 3 and 4. However, although the simulations suggest stronger TKE diurnal cycles than the observations, especially in the first 36 h, the simulated values follow the trends of the WCmeasured TKE (Fig. 7c). Although the magnitudes of the surface sensible heat flux of the surface flux station and the simulations differ, their signs change at similar times, particularly in the last 3 days (Fig. 7d). Hence the WRF model is capable of representing diurnal atmospheric stability changes. Note that in Fig. 7c, the lidar derives TKE using 2 min variances, which is intrinsically different from the modelled TKE, as discussed in Kumer et al. (2016) and Rhodes and Lundquist (2013). Hence, readers should focus on the general trends of the TKE time series, rather than their absolute values. The observed WS fluctuates more than the mesoscalesimulated WS during daytime (Fig. 7a). The ramp events, when the WS changes rapidly in a short period (Kamath, 2010;Potter et al., 2009), induce considerable swings in power production (Fig. 7e). The five distinct ramp events with sudden WS increases are from 00:00 to 01:00 UTC on 24 August, from 18:00 to 19:00 UTC 24 August, from 00:00 to 01:00 UTC 25 August, from 00:00 to 02:00 UTC 26 August, and from 00:00 to 02:00 UTC 27 August. Most of the ramp events are related to the LLJs (Fig. 3), and the simulated WS usually lags behind that observed (Fig. 7a). Therefore, the WFP under-predicts total power in nearly all the ramp events (Fig. 7e). Note that the measured WS ranges between the cut-in and rated speed of the wind turbine, when power production is highly sensitive to WS. The strong linkage between the temporal fluctuations of WS and power emphasizes the importance of accurate WS predictions. Along the same line, the WFP power performance changes in different meteorological conditions. To quantify WFP's skill, we use the bias in total power as a benchmark, calculated by subtracting the observed power from the WFP simulated power every 10 min (Fig. 8). In particular, in conditions of strong winds and weak turbulence, the WFP overestimates wake effects and thus underestimates power. However, for calm conditions with moderate or strong turbulence, the WFP tends to underestimate wake effects and thereby overpredicts power (Fig. 8a and c). Besides, the Pearson correlation coefficient between total power bias and WC-observed TKE is 0.48 (not shown).
On the contrary, WD and atmospheric stability have weaker influence on the skill of the WFP in general. The winds gradually rotate from south-easterly to south-westerly over this 4-day period while maintaining similar magnitudes of WS. During this direction shift, the WFP demonstrates a weakly positive power bias when the WD is strictly southerly, while the biases skew negative when the winds have a more easterly or westerly component (Fig. 8b). Similarly, the WFP power bias is generally unresponsive to stability changes, although biases tend to be small in strongly stable conditions (Fig. 8d). Moreover, strongly stable conditions tend to have stronger and more distinct wakes (Abkar and Porté-Agel, 2015b; Lee and Lundquist, 2017;Magnusson and Smedman, 1994;Rhodes and Lundquist, 2013).
To isolate the WFP errors in power predictions from the WRF model errors in simulating ambient winds, we analyse a subset of data where the winds are simulated accurately. When the absolute error in WS is smaller than 1 m s −1 and the absolute error in WD is smaller than 5 • , the relationships between power bias and WS, WD, and TI (Fig. 9ac) remain similar to the general trends shown in Fig. 8ac. The WS-power-bias and TI-power-bias correlations become stronger in this subset (Fig. 9a and c), compared to the correlations using all the data in the 4-day period (Fig. 8a  and c). Moreover, when considering only cases of accurate wind predictions, the correlation between power bias and stability increases from −0.06 (Fig. 8d) to −0.42 (Fig. 9d). In the few (27 10 min time steps) unstable conditions with accurate WS predictions, the power bias is generally positive, given moderate WS and high TI (Fig. 9a, c and d). In the stable regime, the WFP tends to underestimate power, regardless of WD ( Fig. 9b and d): 106 of the 125 stable data points are negatively biased. If the few strongest stability points (z L −1 larger than 0.55) are removed from the subset shown in Fig. 9d, a weakly negative correlation be- Figure 10. Scatter plot between the bias of the ERA12WF 10 min total power compared to observation, and its bias of the simulated hub-height WS in the closest grid cell to the WC. The r represents the Pearson correlation coefficient. tween power bias and stability emerges as the Pearson correlation coefficient becomes −0.61. Additionally, generally south to south-westerly flows yield stronger negative power biases (Fig. 9).
As expected, when the model properly simulates ambient WS, the WFP performs better. When the ERA12WF predicts larger WS than observed, the simulation over-predicts the total power. The positive WFP power bias corresponds to WS overestimation, and the negative bias is associated with WS underestimation (Fig. 10). Interestingly, when the error in simulated total power lies between ±30 MW, the error of the simulated WS is mostly within ±2 m s −1 (Fig. 10). However, the power bias does not seem to be related to WD or to ambient TKE: the correlation between the power bias and the simulated WD (TKE) bias is low, at 0.3 (0.22) (not shown). Although the simulated WD and TKE generally match the WC observations (Fig. 7b and c), the model's skill in simulating WD and TKE does not strongly influence the WFP's power performance.
Although the WFP omits sub-grid-scale wake interactions between the wakes of multiple turbines within a cell, this omission does not affect the accuracy of the ERA12WF in power prediction: the performance of the WFP is insensitive to the number of turbines per model grid cell. The turbinenormalized bias demonstrates no dependence on the number of turbines within the model grid cell (Fig. 11). Each whisker in Fig. 11 marks the maximum, the upper quartile, the median, the lower quartile, and the minimum of the average bias. Despite the large positive biases of the maxima, Figure 11. Box plot of the average bias of the ERA12WF simulated power across different numbers of wind turbine per WRF grid cell ( Fig. 1) every 10 min during the 4-day period. more than half of the average biases fall between ±1.5 MW, regardless of the numbers of turbines per cell (Fig. 11). Simulating one or four turbines in a grid cell (Fig. 1) does not influence the WFP's overall power-prediction performance in the cases shown here.
Furthermore, the WFP performance remains consistent between upwind and downwind turbines, based on their positions against the ambient winds (Fig. 12). Given the square shape of grid cells, we determine the sequential rows of turbines during strictly southerly flows, with WD between 175 and 185 • (Fig. 12a). The bulk of the normalized power biases fall within 0-0.4 MW, regardless of the upwind-downwind positions of turbines (Fig. 12b). Additionally, the power bias is independent of the mean distance between the actual turbine locations and the centre points of their respective grid cells (not shown).

Discussion
Herein, we compare WRF model simulations with different choices of vertical resolutions and boundary conditions. The evidence suggests that, at least for this onshore case with a strong diurnal cycle, the vertical resolution is more crucial than the choice of boundary conditions in simulating accurate winds and wind-power production. Shin et al. (2011) have explored the impacts of the lowest model level on the performance of various PBL schemes in the WRF model, suggesting that increasing the number of model layers can simulate regimes. In this study, we further illustrate that establishing more vertical levels in the boundary layer as well as the rotor layer improves the skill of the WRF model in simulating ambient WS, ambient WD, and wind power (Tables 3, 4, and 5). Furthermore, Carvalho et al. (2014) discussed the effects of different reanalysis datasets on wind-energy production estimates, and found the ERA-Interim presents the most pre-cise initial and boundary conditions, followed by the GFS. Herein, we test the ERA-Interim and the 0.5 • GFS, and both datasets produce simulations that resemble observed winds and power generations. Since the simulated power is sensitive to the resolution of the model vertical grid, particularly near the surface, future WRF WFP users should select vertical levels with care.
Additionally, the outcomes from the statistical tests among the model runs further validate the importance of using the WFP as well as using a fine vertical grid. From the Student's t test, the p values of all the no-WFP and WFP pairs are 0 (Tables 6 and 7), demonstrating that the differences between the power-generation distributions of the no-WFP runs and the WFP runs are statistically significant at any confidence level. Therefore, to accurately simulate power production, applying the WFP is better than not using it, regardless of the choice of vertical resolution and boundary condition, and the corresponding improvements in Tables 4 and 5 are statistically significant. Although the distinction between the GFS12WF and GFS22WF is not statistically significant at the 90 % confidence level (Table 7), switching from ERA22WF to ERA12WF improves power simulations significantly at 99 % confidence (Table 6). In particular, the RMSE drops by 19.1 MW and the bias reduces by 10.2 MW on average in the ERA12WF (Tables 4 and 5), and these are proven statistically significant.
Similarly, results from the statistical tests between the distributions of power from simulations and observations support the value of the WFP applied in a fine vertical grid. The p values of the ERA12WF-observed pair and the GFS12WFobserved pair are 0.106 and 0.167 respectively ( Table 8). The high p values illustrate that the distinctions between the distribution of observed power and the distributions of simulated power from the 12 m WFP simulations are not statistically significant, at the 90 % confidence level. Among all the simulations analysed above, running the WFP over the 12 m vertical grid is the only combination that is not statistically different from observations (Table 8). In other words, the 12 m WFP simulations provide the closest approximations to the actual power production, regardless of the boundarycondition dataset.
One of the objectives of this study is to propose general directions for improvements on the WFP. First of all, as the key determinant of wind-power production, WS plays a critical role. Ramp events pose a challenge to the WRF model in simulating WS as well as to the WFP in predicting power ( Fig. 7a and e). However, windy conditions of WS exceeding 10 m s −1 , although below the rated speed, lead to WFP power underestimation (Fig. 8a). Furthermore, the WFP performance depends more on the horizontal winds and turbulence, rather than their vertical components, since the power bias correlates more strongly with TI than TKE (Fig. 8c). Reducing turbulence diffusion in the WRF model could potentially yield more accurately simulated winds in stable conditions, including LLJs (Sandu et al., 2013); active research in modifying mixing lengths (Jahn et al., 2017) also suggests promising model improvements. More importantly, sharpening the skill of the WRF model in simulating WS can improve the WFP power performance (Fig. 10). Future versions of the WRF model and the WFP should aim to better account for instantaneous horizontal WS variations and the subsequent sub-grid wake interactions.
Besides, necessary improvements in simulating ambient WS, the WFP scheme itself also requires refinements. When background winds are accurately predicted, the power-bias dependence on WS and TI remains strong ( Fig. 9a and c). Moreover, the correlation between the WFP performance and atmospheric stability becomes weakly negative without the strongly stable data (Fig. 9d). Therefore, even when the simulated winds are close to observations, the WFP tends to underestimate power during high WS, low TI, and stable conditions. In contrast, the WFP tends to over-predict power in calm, unstable, and turbulent conditions, with the caveat that a small number of unstable cases are considered here. The WFP scheme appears to overestimate wake loss within a grid cell in stable and windy conditions and underestimate wake effects in an unstable and well mixed atmosphere. Certainly the interactions between WD and wind-farm layout affect the power-bias relationships, and further sensitivity tests can provide more insight into the WFP performance, particularly in intra-cell WS reduction. We demonstrate that inter-cell wake effects are not the critical factor to power error (Fig. 12b); hence, the inability of the WFP to simulate intra-cell wake effects can explain the biases when many of the turbines experience accurately simulated ambient flow.
In contrast, WD has no clear influence on the WFP skill ( Fig. 8b) in this case, although the irregular shape of the wind farm adds uncertainty to this relationship. Similarly, the skill of the WFP for this case is insensitive to the number of virtual turbines per cell, and the downwind position of turbines against inflow (Figs. 11 and 12). Compared to the power overestimation of downwind turbines in the idealized cases described in , both the upwind and downwind turbine-containing cells presented in this study have consistent positive biases on power production (Fig. 12). Our findings suggest that the WFP is skilful in simulating power of aggregate wind turbines and can represent the impact of inter-cell wakes on power. In the end, the primary limitation of the WFP is rooted in the ambient simulated WS in the WRF model.

Conclusion
The WFP scheme in the WRF model (version 3.8.1) provides a convenient way to represent wind farms and their meteorological impacts in the NWP models. However, its power predictions have not been verified for onshore wind farms or in a range of WS conditions. Herein, we evaluate the performance of the WFP in various atmospheric conditions to guide users of the WFP and to suggest future WFP advancements.
Using data from the CWEX-13 campaign, we select a 4day period, from 24 to 27 August 2013, for our case study, due to the consistent nocturnal LLJ occurrences. We use measurements from a profiling lidar, a scanning lidar, and a surface flux station to verify the ambient flows simulated by the WRF model. The wind farm of interest, located in central Iowa, consists of 200 1.5 MW wind turbines.
We explore the role of vertical resolution in the operation of the WRF WFP. We evaluate two vertical grids with 12 and 22 m resolution near the surface. We find that the finer vertical resolution produces simulations that agree better with observed WS, WD, and power than the simulations with coarser vertical resolution. Further, because the WFP accounts for wake effects on power production of downwind turbines, the use of the WFP enables more accurate power prediction, whereas simulations without the WFP generally over-predict power production. Statistically, the WFP simulations with a fine vertical grid, regardless of the boundary conditions, are the most skilful in simulating power.
The skill of the WFP varies with meteorological conditions. When the model simulates WS close to the observations, the WFP predicts power properly, making WS the critical factor in improving the WFP. Rapid temporal fluctuations in WS introduce errors in power simulations, especially during ramp events. Further, in windy, stable, and less turbulent conditions, the WFP tends to overestimate wake effects and thus underestimates power production. However, the WFP performance demonstrates no clear dependence on the number of turbines per model grid cell or the downwind distance of turbines with respect to the upwind ones.
In conclusion, we demonstrate the value of the WRF WFP and the importance of using a fine vertical grid. Since WS greatly affects the skill of the WFP, subsequent research could include evaluating the WFP for an even larger range of WS, especially at WS beyond the turbine cut-out speed (which would be 25 m s −1 in this case; no such high WSs were observed during the CWEX-13 campaign). Evaluating the performance of other wind-farm layouts in locations with complex terrain is also needed. Modifications in the inflow WS considered by the WFP, for example, considering the rotor equivalent wind speed (REWS) (Wagner et al., 2009), may bring promising improvements. More accurate power forecasts will help shape a more competitive wind-energy industry and further facilitate grid integration of wind energy (MacDonald et al., 2016).
Competing interests. The authors declare that they have no conflict of interest.