Machine Learning: Predicting Values with Multiple Models- Part II

August 4, 2024 JK

In my previous post, I predicted grain weight from length and width of grains using Random Forest.

■ Machine Learning: Predicting Values with Multiple Models- Part I

Now, my next question is how the model accuracy changes when grain area and genotype are added. If you followed my previous post closely, you should be able to understand the code below.

■ Data upload

import pandas as pd
import requests
from io import StringIO

github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/ML_Practice_Grain_Dimension.csv"
response=requests.get(github)
df=pd.read_csv(StringIO(response.text))

df['Genotype'] = df['Genotype'].replace({'cv1': 0, 'cv2': 1, 'cv3': 2, 'cv4': 3, 'cv5': 4})

print(df.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2  GW_mg
0         0       4.86      2.12       7.56  14.44
1         0       5.01      1.94       7.25  13.67
2         0       5.54      1.97       7.99  15.55
3         0       3.84      1.99       5.86  10.32
4         0       5.30      2.06       8.04  15.66

■ Data Splitting

# input
input= df.drop(columns=['GW_mg'])

print(input.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2
0         0       4.86      2.12       7.56
1         0       5.01      1.94       7.25
2         0       5.54      1.97       7.99
3         0       3.84      1.99       5.86
4         0       5.30      2.06       8.04

# target
target= df['GW_mg']

print(target.head(5))
0    14.44
1    13.67
2    15.55
3    10.32
4    15.66

Unlike the previous data, I have now added genotype and grain area to the model.

# Splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(input, target, test_size=0.3, random_state=42)

■ Machine Learning Algorithm

1. Random Forest

from sklearn.ensemble import RandomForestRegressor
rf= RandomForestRegressor()
rf_model= rf.fit(train_input, train_target)
accuracy= rf_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.9999991892253699

2. K-nearest neighbors

from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor()
knr_model = knr.fit(train_input, train_target)
accuracy = knr_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.9999346802743253

3. Linear Regression

from sklearn.linear_model import LinearRegression
lr= LinearRegression ()
lr_model = lr.fit(train_input, train_target)
accuracy = lr_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.9977757135849683

4. Decision Tree

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt_model= dt.fit(train_input, train_target)
accuracy = dt_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.9999985475334275

5. Ridge regression

from sklearn.linear_model import Ridge
ridge = Ridge()
ridge_model= ridge.fit(train_input, train_target)
accuracy = ridge_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.997775729560088

6. Lasso regression

from sklearn.linear_model import Lasso
lasso = Lasso()
lasso_model= lasso.fit(train_input, train_target)
accuracy = lasso_model.score(test_input, test_target)

print("Accuracy:", accuracy)
Accuracy: 0.9969952312548194

Let’s compare the accuracy when using length and width data versus when using length, width, area, and genotype data. The accuracy increased overall when using length, width, area, and genotype data. I’ll choose the Random Forest model again.

Model	Accuracy (using length, width)	Accuracy (using length, width, area and genotype)
Random Forest	0.967	0.999
K-nearest neighbors	0.963	0.999
Linear Regression	0.878	0.998
Decision Tree	0.955	0.999
Ridge	0.878	0.998
Lasso	0.848	0.997

■ To verify the model

Now, I’ll use a new dataset to compare the actual grain weight with the grain weight predicted by the random forest model.

github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/ML_Practice_Grain_Dimension_Verifying.csv"
response=requests.get(github)
df1=pd.read_csv(StringIO(response.text))

print(df1.head(5))
  Genotype  Length_mm  Width_mm  Area_mm_2  GW_mg
0      cv3       6.40      3.40      15.50  37.46
1      cv3       7.46      3.68      19.81  51.82
2      cv3       7.14      3.75      19.50  50.76
3      cv3       6.60      3.40      15.90  38.75
4      cv3       7.02      3.70      18.92  48.77

df1['Genotype'] = df1['Genotype'].replace({'cv3': 2, 'cv4': 3, 'cv5': 4})

print(df1.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2  GW_mg
0         2       6.40      3.40      15.50  37.46
1         2       7.46      3.68      19.81  51.82
2         2       7.14      3.75      19.50  50.76
3         2       6.60      3.40      15.90  38.75
4         2       7.02      3.70      18.92  48.77

Next, I’ll split the data into training and testing sets using this new dataset.

test_input1= df1.drop(columns=['GW_mg'])

print(test_input1.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2
0         2       6.40      3.40      15.50
1         2       7.46      3.68      19.81
2         2       7.14      3.75      19.50
3         2       6.60      3.40      15.90
4         2       7.02      3.70      18.92

GW_mg = df1['GW_mg']
test_target1 = GW_mg

print(test_target1.head(5))
0    37.46
1    51.82
2    50.76
3    38.75
4    48.77

Random Forest

from sklearn.ensemble import RandomForestRegressor
rf= RandomForestRegressor(random_state=42)
rf_model= rf.fit(train_input, train_target)
accuracy= rf_model.score(test_input1, test_target1)

print("Accuracy:", accuracy)
Accuracy: 0.9989566381418276

I’ll add predicted grain weight data.

y_pred= lasso_model.predict(test_input1)
test_input1['prediction']= y_pred

print(test_input1.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2  prediction
0         2       6.40      3.40      15.50   37.269362
1         2       7.46      3.68      19.81   51.501729
2         2       7.14      3.75      19.50   50.442392
3         2       6.60      3.40      15.90   38.539158
4         2       7.02      3.70      18.92   48.479086

test_input1['Grain_weight']= test_target1

print(test_input1.head(5))
   Genotype  Length_mm  Width_mm  Area_mm_2  prediction  Grain_weight
0         2       6.40      3.40      15.50   37.269362         37.46
1         2       7.46      3.68      19.81   51.501729         51.82
2         2       7.14      3.75      19.50   50.442392         50.76
3         2       6.60      3.40      15.90   38.539158         38.75
4         2       7.02      3.70      18.92   48.479086         48.77

Now, actual and predicted grain weight are added in test_input1 data. I’ll download this data to my PC.

import pandas as pd
from google.colab import files

# Save the DataFrame to a CSV file
test_input1.to_csv('Grain_weight_2024.csv', index=False)

# Use the files module to download the file
files.download('Grain_weight_2024.csv')

Now the predicted grain weight is highly accurate.

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.