- Consider the spambase data set, where emails are classi ed as spam or not,
and 57 feature variables are measured on each of them (see full description on
p. 259 of the book).
(a) Split the data set into training and test sets (roughly a 70/30 split). Com
pute a logistic classi er using the training data. (There might be perfect
separation between the groups, but that should not matter as long as you
don t get NA coe¢ cients.)
(b) Find the misclassi cation table for the test data and compute the mis
classi cation rate. - Consider the pendigits data set, which are samples of handwritten digits 0;1;:::;9.
The feature variables in this case are the (x;y) coordinates of the pen tip, dis
cretized at eight time points (see section 7.2.1 of the book for more details).
(a) Split the data set into training and test sets (roughly a 70/30 split). Com
pute the multinomial logistic classi er using the training data.
(b) Construct the misclassi cation table for the test data and compute the
misclassi cation rate. Which digit seems to be the hardest to classify
correctly
Multivariate Statistical Analysis
Full Answer Section
(b) Misclassification Table and Rate
Python
# Predict on the test set
y_pred = model.predict(X_test)
# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Compute the misclassification rate
misclassification_rate = 1 - accuracy_score(y_test, y_pred)
print("Misclassification Rate:", misclassification_rate)
2. Pendigits Data Set: Multinomial Logistic Classification
(a) Data Splitting and Multinomial Logistic Classifier
Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Load the pendigits dataset
# Example of loading from a CSV, adjust the path as needed.
url_train = "https://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tra"
url_test = "https://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tes"
names = ["x" + str(i) for i in range(16)] + ["digit"]
train_data = pd.read_csv(url_train, names=names)
test_data = pd.read_csv(url_test, names=names)
# Combine the train and test data, then split it again.
data = pd.concat([train_data, test_data], ignore_index=True)
# Split into features (X) and target (y)
X = data.drop('digit', axis=1)
y = data['digit']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)
(b) Misclassification Table and Rate, Hardest Digit to Classify
Python
# Predict on the test set
y_pred = model.predict(X_test)
# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Compute the misclassification rate
misclassification_rate = 1 - accuracy_score(y_test, y_pred)
print("Misclassification Rate:", misclassification_rate)
# Find the hardest digit to classify
import numpy as np
class_correct = np.diag(conf_matrix)
class_totals = conf_matrix.sum(axis=1)
class_accuracy = class_correct / class_totals
hardest_digit = np.argmin(class_accuracy)
print("Hardest Digit to Classify:", hardest_digit)
Explanation:
- Spambase:
- We load the data, split it, and train a standard logistic regression model.
- The confusion matrix helps us see true positives, true negatives, false positives, and false negatives.
- The misclassification rate is simply 1 - accuracy.
- Pendigits:
- We load the data, split it, and train a multinomial logistic regression model because we have multiple classes (digits 0-9).
- We use the 'lbfgs' solver, which is suitable for multinomial logistic regression.
- To find the hardest digit, we calculate the accuracy for each digit (diagonal of the confusion matrix divided by the row sum) and then find the digit with the lowest accuracy.
Important Notes:
- You might need to install
pandas
andscikit-learn
(sklearn
) if you don't have them already (pip install pandas scikit-learn
). - Adjust the file paths in the
pd.read_csv()
functions to the correct location of your data files. - Logistic regression models can sometimes take a long time to converge with these datasets, so increasing the max_iter parameter is sometimes needed.
- The random_state parameter in train_test_split is used to make sure that the data is split the same way each time that the code is run.
- The hardest digit to classify will change slightly depending on the random state used for splitting the data.
Sample Answer
. Spambase Data Set: Logistic Classification
(a) Data Splitting and Logistic Classifier
Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# Load the spambase dataset (You might need to download it first)
# Example of loading from a CSV, adjust the path as needed.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = ["word_freq_" + str(i) for i in range(48)] + \
["char_freq_" + str(i) for i in range(6)] + \
["capital_run_length_" + str(i) for i in range(3)] + ["spam"]
data = pd.read_csv(url, names=names)
# Split into features (X) and target (y)
X = data.drop('spam', axis=1)
y = data['spam']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the logistic regression model
model = LogisticRegression(max_iter=10000) # Increase max_iter if needed
model.fit(X_train, y_train)