3 Key considerations
Authors: Natalie G. Nelson, Shih-Ni Prim, Sheila Saia, Khara Grieger, Anders Huseth
Here, we offer some key questions natural resource management practitioners can ask before using or developing a machine learning model to inform day-to-day decision-making. Importantly, these questions are based on our own opinions and experiences, and should not be taken as definitive criteria. The questions are provided to help you start thinking critically through the process of using machine learning model outputs.
Why was a machine learning model used instead of a process-based model or more easily interpretable statistical model?
Because process-based or mechanistic models capture underlying physical, biological, and chemical processes driving agricultural and environmental phenomena, they are, in theory, less vulnerable to making spurious predictions as compared to machine learning models. However, process-based model development is often time- and resource-intensive, and some agricultural and environmental phenomena are not well predicted or explained by process-based models. Generally, the more a system is dominated by physical processes, the more predictable the system’s behaviors are by process-based models (Haefner 2005); physical processes are often well represented by established mathematical formulations. On the other hand, for behaviors that are largely driven by system biology or ecology, process-based models can have poor performance because biological and ecological processes are more challenging to robustly and accurately represent with mathematical equations. For example, it is reasonable to assume that it will be easier to simulate changes in a lake’s water levels than the amount of algae in the lake.
In cases when a physical or process-based model could have been developed instead of a machine learning model, it is reasonable to question whether the machine learning model is suitable to support decision-making. Similarly, if an easily interpretable model like a linear regression model can be used with adequate performance, that is generally preferable, as model users will have greater understanding of how the model operates.
How much data are needed to develop a machine learning model?
There is no minimum dataset size for developing a machine learning model, but machine learning models should generally only be developed from large datasets (e.g., thousands or millions of data points). Since machine learning models learn patterns in data, there has to be enough data from which patterns can be learned. Importantly, the data should also be of high accuracy to ensure the machine learning model is developed from quality information. Because of the dataset size requirements, in environmental and agricultural systems, we often see machine learning models being developed with data from sensors and imagery, as such data collection systems produce large volumes of data.
What data were used to train the model?
This is probably the most important question to ask. Because machine learning models make predictions from patterns in data, the data provided to train the model dictate the types of predictions a machine learning model can make. If a training dataset is narrow (e.g., it includes a short time period, a time period with little environmental variation, a limited number of locations, etc.), its predictions will not generalize. If a machine learning model was developed with data from corn fields in the Midwest, will its predictions apply to corn fields in Georgia? In particular, ask yourself (or the model developers), what is missing from the training data? For example, if you’re interested in predicting the effects of drought, how many drought periods are included in the data, and how severe were the droughts? Even if the training data are spatially extensive, think critically about whether the data may be biased or disproportionately unrepresentative in some way such that the data may not be generalizable.
For example, research has shown that surface water quality monitoring in the Southeast U.S. disproportionately occurs in more affluent areas (Oates et al., 2024), meaning that areas of greater social vulnerability have fewer monitoring stations. If a machine learning model were developed using all available water quality data from the Southeast to predict water quality in unmonitored areas, it might give the impression that the training data comprehensively represent the region. However, the model would still be influenced by any underlying biases in the training data. If a model developer is unable to disclose the data sources (e.g., because of data ownership/privacy concerns), they should still be able to provide information on general data characteristics such that you can assess whether the training data apply to your system of interest.
How was the model tested or evaluated? Were any measures of uncertainty reported?
To understand how well a machine learning model performs, it is important to understand the model’s error, or the difference between the model’s predictions and true measurements. Error can be reported in many ways. One common metric is the Root Mean Square Error (RMSE), which is essentially the average model error. The RMSE is reported in the same units as the target variable, making it useful to interpret. Metrics like the coefficient of determination, or R2, are also commonly reported. Learn more about commonly used error metrics here. Understanding a model’s error is key to assessing its accuracy and reliability. In addition to error, modelers will ideally report measures of uncertainty. Uncertainty describes a range of plausible model outcomes. For example, let’s say you are running a machine learning model that can predict daily plant water demand and one of the predictors is rainfall. If you expect there could be 1-1.5” of rain today, you could run the plant water demand model with there being 1” of rain today, or 1.5” of rain. Both are plausible. When you run the model for the two scenarios, the difference in the predicted plant water demand would be described as the uncertainty in the model output. Consideration of uncertainty is a hallmark sign of responsible modeling.
Are the model developers transparent about limitations and model performance?
Be wary of salesmanship. Responsible modelers will clearly articulate limitations of the training data, and aspects of poor model performance. A model developer should be able to tell you if the model is effective or not at predicting different types of outcomes.
How did the modelers choose predictors? Do they seem to understand some of the underlying science or processes of the agricultural or environmental system they are making predictions for?
Ideally, when predicting agricultural or environmental system behaviors in a machine learning model, there should be sound scientific reasoning regarding the selection of predictors. Judicious selection of predictors can also help to avoid misleading predictions.
What do the modelers know about predictor importance?
Once a machine learning model is developed, there are methods available for identifying the predictors that carry the most weight, or are the most “important”, in the model. It can be helpful to understand which predictors are the most important, as you can then use your own understanding of environmental and agricultural system dynamics to corroborate whether the variables of most importance make sense. If an obscure variable is the most important in the model, it’s reasonable to question why, and whether the importance of an obscure predictor could negatively affect model performance.
How will the outputs of the model be used?
Machine learning models should not be used to replace human decision-making in natural resources management, but can instead be used as a supportive tool. Machine learning models can synthesize a lot of information and identify patterns that are not readily interpretable for the average person, making them great decision-support tools. However, keep in mind that training data are almost always biased in some way, and those biases will propagate to the predictions.
How will the model be run, or how will the outputs be shared?
Developing a machine learning model and creating a user-friendly system for applying and/or accessing outputs from a machine learning model are entirely different tasks. In some cases, a model can be developed and then shared via computer code, but end users may struggle to run the model using provided computer code, depending on their familiarity with programming. If you are working with a partner (e.g., university, consulting group) to produce a model, ensure there is a plan for transitioning the model such that you can use it with ease. Our team has created a resource for those interested in creating apps for disseminating models, Ten Simple Rules for Researchers Who Want to Develop Web Apps, which may provide helpful tips if your team is interested in exploring the use of web apps for model and model output dissemination.