BCH2203 - Winter 2024: Assignment 5: Seed classification

Opened: Monday, 1 April 2024, 12:00 AM

Due: Monday, 8 April 2024, 11:59 PM

Get the data file seeds_dataset.txt from the zip file that you can download from https://archive.ics.uci.edu/dataset/236/seeds. The file is in tab-separated value format. Each row is a sample of wheat seeds, and the columns have the following meaning: area, perimeter, compactness, length_kernel, width, asymmetry, length_groove, and label. The first 7 are features while the last column contains is 1,2,3 depending on the species of wheat, i.e., Kama, Rosa, or Canadian wheat; this is the target value.

Write a python script that reads this file (you may assume it lives in the current directory) and builds decision trees to predict the label from only 3 features. Since there are 7 features available, different features could be selected. In fact, this can be done in 35 ways, and your script should do all of these. Pick the same maximal depth and minimal samples per leave for all cases.

The script should use the usual separation of training and test data to score the accuracy of the tree, and pick out which decision tree is the most accurate, and print out the names of the three features used in the most accurate tree. In a sense, this tells us which features are the most defining ones for the species.