2 What is machine learning?

Authors: Natalie G. Nelson, Shih-Ni Prim, Sheila Saia, Khara Grieger, Anders Huseth

Machine learning methods teach computer models to make predictions from patterns in data (Zhi et al., 2024). To find patterns in data, numerous mathematical calculations are required. Computers are essential for developing machine learning models, and advances in computing have given rise to machine learning (Jordan and Mitchell, 2015).

2.1 How is machine learning useful for natural resources management?

Historically, environmental and agricultural systems were primarily modeled with “process-based” or “mechanistic” models, which are models that simulate key system processes based on their underlying physics, chemistry, or biology (Haefner 2005). Models were often used to fill gaps in direct measurements of various environmental and agricultural phenomena. In the past, collecting any type of environmental or agricultural data was grueling, time consuming, and done by people. Today, some variables still require tough and tedious work to measure, but many variables are now readily measured with sensors and other automated instruments. As an example of how process-based models are used, let’s consider the flow of water through an open channel. If you were estimating the velocity of water through an open channel using a process-based approach, you may use Manning’s Equation, which simulates water velocity as a function of the channel’s physical properties like its slope and cross sectional area. To apply Manning’s Equation, you would have to take measurements of the channel’s dimensions.

However, with machine learning, you do not have to simulate environmental and agricultural variables using underlying processes. Instead, you can create a model that makes predictions from patterns in data. Machine learning models require that you have some measurements of your response or target variable, which is the variable you are seeking to estimate. In our current example, water velocity in the channel is the target variable. You also need measurements of predictor variables, or variables that you will use to predict your response.

As one hypothetical example, a machine learning model of water velocity in an open channel might make predictions based on images instead of physical channel properties. A camera could be situated towards the channel to take time-lapse photos every few minutes. A machine learning model could then be developed to predict water velocity based on the appearance of the water in the images. For a real-world example of this, check out Chapman et al. (2024), Stage and discharge prediction from documentary time-lapse imagery, in PLoS ONE.

To create such a machine learning model, water velocity measurements would be needed to create a training set, or data used to train or develop the machine learning model. The water velocity measurements would be collected at the same time as the images, so the images could then be directly related to water velocity measurements (i.e., when the image looks like this, the water velocity is that). The predictor variables would be derived from the images. A machine learning algorithm could be selected to search for patterns between the images – e.g., the shading/color of individual image pixels, relationships between neighboring pixels – and the water velocity measurements. Once these patterns are established, they can be used to estimate water velocity from new images as they are collected. Importantly, the image-based machine learning model knows nothing about the underlying physics controlling water velocity in the channel; it has simply learned that certain patterns in the images correspond to higher or lower water velocities.

Machine learning models can easily be expanded to include many different types of predictor variables. For example, the image-based machine learning model for water velocity could be further built out to include additional predictors like rainfall, irrigation, and time of year. The flexibility of machine learning models allows them to consider many diverse streams of information when learning patterns from which to make predictions. To see additional examples of machine learning models developed for natural resource management applications, see chapter 4.

2.2 Pros and cons

Because they make predictions from correlative patterns between the predictor and response variables, machine learning models have parallels to simpler statistical models like linear regression. However, machine learning models consist of hundreds, thousands, or millions of individual equations, while a linear regression model consists of only one equation. Because of their many equations, machine learning models are commonly considered “black boxes” since peering into a machine learning model can feel like looking into a pitch black box – you don’t precisely know what’s inside. Methods now exist to help us illuminate machine learning black boxes, and the field of “interpretable” or “explainable” machine learning has made great strides in support of better understanding the inner workings of these models (Samek et al., 2021). Still, machine learning models are substantially more challenging to interpret than most other model alternatives.

While challenges to interpretation are a clear pitfall of machine learning models, their complicated structures allow for them to pick up on relatively subtle or nuanced relationships between datasets, making them effective predictive tools (Thessen 2016). In many cases, machine learning models outperform process-based models (e.g., their predictions can have greater accuracy than the predictions from other models). The ability to create strong predictions is arguably the hallmark strength of machine learning. The previously described flexibility of being able to include many different diverse data types (e.g., images, sensor data, weather station observations) as predictors is also a key strength. But, because they do not necessarily account for underlying processes, machine learning models are at risk of inaccurate or misleading predictions, particularly if developed irresponsibly. Further, even with an accurate model, the data that feed into it need to be appropriate for the machine learning task. In chapter 3, we include questions you can ask to evaluate whether a model was developed responsibly, assess whether it is vulnerable to making misleading predictions, and understand whether the data is suitable for the model.

2.3 Classes of machine learning models

In the previous example on image-based predictions of water velocity, we summarized the use of a supervised learning approach. Simply put, supervised learning is when there are true answers for the model, or there are measurements for the response variable (Sarker 2021). In the water velocity model example, water velocity measurements were used during model development to facilitate the learning of patterns in the images that could specifically be used to predict water velocity. Once a supervised learning model is developed, the predictions from the model can be compared with the measurements to assess how well the model predicts.

Unsupervised learning, on the other hand, performs tasks that do not have answers for the model. For example, instead of using the images of the stream to estimate water velocity, an unsupervised machine learning model could be used to group or cluster images with shared similarities. Unsupervised learning is often used as a computer-assisted way of exploring patterns in data (Sarker 2021). Sometimes, the groups or clusters uncovered by an unsupervised learning model can be assigned labels that are of use for other modeling or analysis efforts.

2.4 An example of supervised and unsupervised learning

The National Land Cover Database or Dataset (NLCD) is created by the U.S. Geological Survey and its partners to map land cover across the contiguous U.S. The data are updated every few years, creating a historical record of land cover change across the country over time. In the 1970s and 1980s, land cover mapping was performed by manually delineating areas from aerial photographs. In 1992, the NLCD premiered a land cover data product that was created from Landsat satellite imagery. Landsat imagery has pixels that are 30 meters by 30 meters. To create the 1992 product, NLCD imagery was clustered into 100 groups using an unsupervised learning approach, and people evaluated the 100 groups and manually assigned them to land cover categories (e.g., forest, developed area). Later, to create the 2001 NLCD product, a supervised learning approach was used in which areas of known land cover were used to train a model to predict land cover categories based on Landsat imagery. Read more about the history of the NLCD program in Chapter 18 of The Nature of Geographic Information (DiBiase et al.).

2.5 Algorithms

Within the two broad classifications of machine learning (supervised and unsupervised), there is an overwhelming number of machine learning algorithms, or specific computer model frameworks for implementing machine learning. The sheer number of machine learning algorithms attests to its power and versatility. This primer does not catalog different machine learning algorithms. As a starting point, you can see some algorithm examples at the resources below. Which machine learning algorithm should I use by SAS blogs Machine Learning Cheat Sheet by Datacamp

2.6 A deeper dive

If you would like to learn more about machine learning, check out An Introduction to Statistical Learning by James et al. (2013). This book provides a wealth of information on machine learning models, and the authors assume the readers are mainly interested in applying, rather than deeply studying, machine learning models. There are free PDF versions of the books available online, and examples are presented in both R and Python.