Page
Assignment 4
In this assignment you will examine the seeds data set. Link: https://support.scinet.utoronto.ca/~alexey/seeds.csv
This data consists of measurements of geometrical properties of kernels belonging to three different varieties of wheat.
Create a file 'tree.py' that contains the following functionality:
- Use the following commands to load the 'seeds' dataset into your script and divide it into the features and target variables:
import pandas as pd
# Load the data into pandas DataFrame
seeds = pd.read_csv("seeds.csv")
# Remove the 'label' column
myseeds = seeds.drop(['label'], axis=1)
# Extract the column names
feature_names = myseeds.columns.values
# Extract features and targets as numpy arrays
features = myseeds.values
target = seeds['label'].values
If you do not have 'pandas' installed, you can install it using 'pip' or 'conda':
$ conda install pandas
$ pip install pandas - Following the example in class, randomly split the 'seeds' data into training and testing data sets, 80% and 20% respectively.
- Create a decision tree for your training data. Adjust the optional arguments 'min_samples_leaf' and 'max_depth' of 'DecisionTreeClassifier' to achieve the best accuracy on both training and testing data.
- Print out the confusion matrix for the testing data.
- Create a plot of the decision tree. It does not need to be professional quality. Save the plot as the 'my_tree.pdf' file, and include it with your submitted files.
- Using the 'barplot' command from the 'matplotlib' package show the importance of features which are used in determining the tree splits. This can be accessed through 'model.feature_importances_', also use the described 'feature_names' variable to properly label features. Save the chart as the file 'feature_importance.pdf'.
Submit your 'tree.py' script file and the generated charts to the 'Assignment Dropbox'. Notice, that there is no need to submit the 'seeds' dataset.
Assignment will be graded on a 10 point basis.
Due date is December 12, 2019 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of December 19, 2019 at 11:00am.
Last modified: Tuesday, 3 October 2023, 9:07 AM