Machine Learning: Modeling with Random Forest Using Python

May 28, 2024 JK

In my previous post, I introduced stepwise regression to select the best model. I suggested that grain yield = -4616.47 + 10.53 * stem biomass + 41.03 * height, indicating that stem biomass and height are the most important variables affecting grain yield.

■ Stepwise Regression: A Practical Approach for Model Selection using R

Now, I’ll find the best model using machine learning. This is a small dataset, which might not be suitable for machine learning, but it serves as an example to demonstrate the process.

import pandas as pd

data= {
       'height': [86, 102, 104, 99, 97, 97, 109, 104, 99, 91, 97, 99, 107, 
                  104, 97, 97, 102, 99, 97, 104],
       'stem_biomass': [351.2, 327.1, 263.7, 436.8, 358.6, 400.0, 421.4,
                        655.5, 367.0, 424.1, 295.7, 558.7, 480.6, 459.4, 
                        291.9, 391.5, 461.7, 644.3, 488.2, 425.0],
       'agw': [30.6, 49.1, 25.4, 29.2, 26.4, 28.5, 24.7, 23.8, 29.2, 28.7, 
               28.7, 26.7, 23.8, 25.3, 29.4, 28.6, 29.2, 20.8, 27.7, 27.6],
       'gn': [14488, 11556, 10389, 23492, 22889, 20322, 16820, 42889, 17564,
              20569, 15636, 32545, 23712, 30183, 15980, 20807, 20719, 34173,
              23543, 28911],
       'grain_yield': [2844.2, 2710.7, 2604.3, 3757.8, 2393.8, 3373.2, 
                       4163.7, 6199.6, 2480.7, 3433.0, 2776.9, 5549.6, 
                       5440.7, 4420.8, 2982.7, 3825.2, 4487.7, 6445.9, 
                       4661.0, 4329.4]
}

df= pd.DataFrame(data)
df.head(5)

   height  stem_biomass	 agw	 gn	grain_yield
0  86	   351.214	 30.693	 14488	2844.291
1  102	   327.133	 49.118	 11556	2710.761
2  104	   263.747	 25.473  10389	2604.396
3  99	   436.844	 29.237	 23492	3757.808
4  97	   358.629	 26.486	 22889	2393.898

A DataFrame df is created using a dictionary data which contains the features height, stem_biomass, agw, gn, and the target variable grain_yield.

# Splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split, cross_val_score

train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

The dataset df is split into training (train_df) and testing (test_df) sets, with 30% of the data used for testing and a random state set to ensure reproducibility.

Random Forest

1. Creating and Evaluating the RandomForest Model:

# RandomForest Model

from sklearn.ensemble import RandomForestRegressor
rf= RandomForestRegressor()

# Training the model with cross-validation
scores= cross_val_score(rf, train_df.drop(columns='grain_yield'), 
        train_df['grain_yield'], cv=4, n_jobs=-1, verbose=1)

print("Cross-validation scores:", scores)
   [Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent  workers.
   Cross-validation scores: [-0.31028438  0.54567032  0.53145928  0.64151576]  
   [Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    4.8s finished

A RandomForestRegressor rf is created. The model is evaluated using cross-validation with 4 folds on the training data. The cross-validation scores are printed. Cross-validation scores from cross_val_score give you an idea of the model’s performance across different folds of the training data. These scores can help you understand the variability in model performance and ensure that the model generalizes well to unseen data.

2. Fitting the Model:

# Fitting the model
rf.fit(train_df.drop(columns='grain_yield'), train_df['grain_yield'])

Next, the model is trained using the training data.

3. Summarizing Feature Importances:

# Summarizing the model
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': train_df.drop(columns='grain_yield').columns,
    'importance': importances
}).sort_values(by='importance', ascending=False)

print(feature_importance_df)
        feature  importance
1  stem_biomass    0.632753
3            gn    0.161368
2           agw    0.115241
0        height    0.090638

The importance of each feature is extracted and displayed in a DataFrame, sorted by importance. This random forest model I’ve created is a regression model that predicts the grain_yield based on several input features: height, stem biomass, Average grain weight, and grain number. The variable importance indicates how much each variable contributes to the prediction. Higher importance values are more influential in predicting the target variable. This can help us understand which factors are most important for grain yield in the dataset.

4. Making Predictions and Calculating RMSE:

Predictions are made on both the training and testing sets. The Root Mean Squared Error (RMSE) for both sets is calculated and printed.

import numpy as np
from sklearn.metrics import mean_squared_error

# Predicting and calculating RMSE on training set
train_predictions = rf.predict(train_df.drop(columns='grain_yield'))
train_rmse = np.sqrt(mean_squared_error(train_df['grain_yield'], train_predictions))
print("Train RMSE:", train_rmse)

# Predicting and calculating RMSE on testing set
test_predictions = rf.predict(test_df.drop(columns='grain_yield'))
test_rmse = np.sqrt(mean_squared_error(test_df['grain_yield'], test_predictions))
print("Test RMSE:", test_rmse)

Train RMSE: 257.42317721428986
Test RMSE: 482.2626543598101

The predictions both the training and testing sets can be used to analyze how well the model is performing. By comparing the predicted values to the actual grain_yield values, you can identify any patterns or discrepancies.

The root mean squared error (RMSE) on the training set indicates how well the model fits the training data. A lower RMSE value means the model predictions are closer to the actual values in the training set. In the same way, the RMSE on the testing set indicates how well the model performs on unseen data. Comparing the training and testing RMSE values can help you detect overfitting. If the training RMSE is much lower than the testing RMSE, the model might be overfitting the training data.

5. Calculating Permutation Importance:

# Permutation importance
from sklearn.inspection import permutation_importance

result = permutation_importance(rf, test_df.drop(columns='grain_yield'), test_df['grain_yield'], n_repeats=10, random_state=42, n_jobs=-1)
sorted_idx = result.importances_mean.argsort()
importance_df = pd.DataFrame({
    'feature': test_df.drop(columns='grain_yield').columns[sorted_idx],
    'importance': result.importances_mean[sorted_idx]
}).sort_values(by='importance', ascending=False)

print(importance_df)
        feature  importance
3  stem_biomass    0.775397
2            gn    0.145292
1           agw    0.065678
0        height    0.005817

The permutation importance results provide another measure of feature importance by evaluating the effect of shuffling each feature on the model’s performance. This can give you a more robust understanding of feature importance and help confirm the results from the built-in feature importance measure of the random forest.

In other words, permutation importance evaluates the importance of a feature by measuring the increase in the model’s prediction error after permuting the feature. It is done on the test set and gives a more accurate representation of the importance of each feature with respect to the model’s performance on unseen data.

Key Differences between ‘Model Feature Importance’ and ‘Permutation Importance’

Model Feature Importance:
- Calculated during training.
- Measures the importance of each feature in reducing impurity across all trees in the forest.
- Does not account for feature interactions and how the model performs on unseen data.
Permutation Importance:
- Calculated on the test set.
- Measures the impact on the model’s performance when each feature is shuffled.
- Provides a more realistic measure of feature importance in the context of generalization to new data.

If you use rf.feature_importances_ to evaluate the test set, you are not considering how the features interact with each other and how they contribute to the model’s performance on unseen data. This might give you a biased view of feature importance because it is based on the training data.

Therefore, it woule be better to use Permutation Importance for evaluating feature importance on the test set because it provides a more accurate representation of feature importance in the context of model performance on unseen data. Also, it would be better to use Model Feature Importance for understanding how features contribute to the model during training, but be cautious when interpreting these importances as they might not fully capture the feature’s effect on unseen data. In conclusion, while both methods provide valuable insights, permutation importance is generally preferred for evaluating feature importance on the test set because it reflects the model’s performance on new data.

Which Importance to Use

For Model Interpretation: If your goal is to understand how the model makes predictions during training, use rf.feature_importances_. This is useful for understanding the model’s internal workings and how it uses the features provided.
For Generalization and Real-World Performance: If your goal is to understand which features are important for the model’s performance on new, unseen data, use permutation importance. This is more reflective of the model’s behavior in practical applications.

Practical Recommendation

Use permutation importance for evaluating feature importance in the context of the model’s performance on new data. This helps ensure that your conclusions about feature importance are applicable to real-world scenarios and not just specific to the training data.
Consider both importances: It can also be insightful to consider both importances together to get a complete picture. The discrepancies can provide valuable insights into how the model may be overfitting to certain features in the training data and how robust these features are when applied to new data.

Conclusion: Feature Relationships and Potential Overfitting

By examining the feature importance and the predictions, we can gain insights into the relationships between the features and the target variable. This can help us understand the dynamics of the data and potentially identify areas for improvement or further investigation.

Comparing the training and testing RMSE values helps you detect overfitting. If the model performs well on the training data but poorly on the testing data, it may indicate that the model is overfitting and not generalizing well to new data.

Overall, this random forest model can provide valuable insights into the factors affecting grain yield and help us make informed decisions based on the predictions and feature importance analysis.

In stepwise regression, the most important factor was stem biomass and crop height, but in Random Forest model, it was stem biomass and grain number. As a crop physiologist, I more agree with stem biomass and grain number rather than crop height.

full code summary: https://github.com/agronomy4future/python_code/blob/main/Machine_Learning_RandomForest.ipynb

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.