In this assignment you will examine the seeds data set. Link: https://support.scinet.utoronto.ca/~alexey/seeds.csv

This data consists of measurements of geometrical properties of kernels belonging to three different varieties of wheat.

Create a file 'tree.py' that contains the following functionality:

  1. Use the following commands to load the 'seeds' dataset into your script and divide it into the features and target variables:
    import pandas as pd
    # Load the data into pandas DataFrame
    seeds = pd.read_csv("seeds.csv")
    # Remove the 'label' column
    myseeds = seeds.drop(['label'], axis=1)
    # Extract the column names
    feature_names = myseeds.columns.values
    # Extract features and targets as numpy arrays
    features = myseeds.values
    target = seeds['label'].values
    If you do not have 'pandas' installed, you can install it using 'pip' or 'conda':
    $ conda install pandas
    $ pip install pandas
  2. Following the example in class, randomly split the 'seeds' data into training and testing data sets, 80% and 20% respectively.
  3. Create a decision tree for your training data. Adjust the optional arguments 'min_samples_leaf' and 'max_depth' of 'DecisionTreeClassifier' to achieve the best accuracy on both training and testing data.
  4. Print out the confusion matrix for the testing data.
  5. Create a plot of the decision tree. It does not need to be professional quality. Save the plot as the 'my_tree.pdf' file, and include it with your submitted files.
  6. Using the 'barplot' command from the 'matplotlib' package show the importance of features which are used in determining the tree splits. This can be accessed through 'model.feature_importances_', also use the described 'feature_names' variable to properly label features. Save the chart as the file 'feature_importance.pdf'.

Submit your 'tree.py' script file and the generated charts to the 'Assignment Dropbox'. Notice, that there is no need to submit the 'seeds' dataset.

Assignment will be graded on a 10 point basis.

Due date is December 12, 2019 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of December 19, 2019 at 11:00am.

Last modified: Tuesday, 3 October 2023, 9:07 AM