Mapping ecological indicators of human impact with statistical and machine learning methods: Tests on the California coast
Coastal ecosystems are exposed to multiple anthropogenic stressors such as fishing, pollution, and climate change. Ecosystem-based coastal management requires understanding where the combination of multiple stressors has large cumulative effects and where actions to address impacts are most urgently needed. However, the effects of multiple stressors on coastal and marine ecosystems are often non-linear and interactive. This complexity is not captured by commonly used spatial models for mapping human impacts. Flexible statistical and machine learning models like random forests have thus been used as an alternative modeling approach to identify important stressors and to make spatial predictions of their combined effects. However, tests of such models' prediction skill have been limited. Therefore, we tested how well ten statistical and machine learning methods predicted three ecological indicators of coastal marine ecosystem condition (kelp biodiversity, fish biomass, and rocky intertidal biodiversity) off California, USA. Spatial data representing anthropogenic stressors and ocean uses as well as natural gradients were used as predictors. The models' prediction errors were estimated by double spatial block cross-validation. The best models achieved mean squared errors about 25% lower than a null model for kelp biodiversity and fish biomass; none of the tested models worked well for rocky intertidal biodiversity. The models captured general trends, but not local variability of the indicators. For kelp biodiversity, the best performing method was principal components regression. For fish biomass, the best performing method was boosted regression trees. However, after tuning, this model did not include any interactions between stressors, and ridge regression (a constrained linear model) performed almost as well. While in theory flexible machine learning methods are required to represent the complex stressor-ecosystem state relationships revealed by experimental ecologists, with our data, this flexibility could not be harnessed because more flexible models overfitted due to small sample sizes and low signal-to-noise ratio. The main challenge for harnessing the flexibility of statistical and machine learning methods to link ecological indicators and anthropogenic stressors is obtaining more suitable data. In particular, better data describing the spatial and temporal distribution of human uses and stressors are needed. We conclude by discussing methodological implications for future research.