Problem C: Confirming the Buzz about Hornets
In September 2019, a colony of Vespa mandarinia (also known as the Asian giant hornet) was discovered on Vancouver Island in British Columbia, Canada. The nest was quickly destroyed, but the news of the event spread rapidly throughout the area. Since that time, several confirmed sightings of the pest have occurred in neighboring Washington State, as well as a multitude of mistaken sightings. See Figure 1 below for a map of detections, hornet watches, and public sightings.
Vespa mandarinia is the largest species of hornet in the world, and the occurrence of the nest was alarming. Additionally, the giant hornet is a predator of European honeybees, invading and destroying their nests. A small number of the hornets are capable of destroying a whole colony of European honeybees in a short time. At the same time, they are voracious predators of other insects that are considered agricultural pests.
The life cycle of this hornet is similar to many other wasps. Fertilized queens emerge in the spring and begin a new colony. In the fall, new queens leave the nest and will spend the winter in the soil waiting for the spring. A new queen has a range estimated at 30km for establishing her nest. More detailed information on Asian hornets is included in the problem attachments and can also be found online.
Due to the potential severe impact on local honeybee populations, the presence of Vespa mandarinia can cause a good deal of anxiety. The State of Washington has created helplines and a website for people to report sightings of these hornets. Based on these reports from the public, the state must decide how to prioritize its limited resources to follow-up with additional investigation. While some reports have been determined to be Vespa mandarinia, many other sightings have turned out to be other types of insects.
The primary questions for this problem are “How can we interpret the data provided by the public reports?” and “What strategies can we use to prioritize these public reports for additional investigation given the limited resources of government agencies?”
Address and discuss whether or not the spread of this pest over time can be predicted, and with what level of precision.
Most reported sightings mistake other hornets for the Vespa mandarinia. Use only the data set file provided, and (possibly) the image files provided, to create, analyze, and discuss a model that predicts the likelihood of a mistaken classification.
Use your model to discuss how your classification analyses leads to prioritizing investigation of the reports most likely to be positive sightings.
Address how you could update your model given additional new reports over time, and how often the updates should occur.
Using your model, what would constitute evidence that the pest has been eradicated in Washington State?
Finally, your report should include a two-page memorandum that summarizes your results for the Washington State Department of Agriculture.
- 数据集成
对于现有的图片,其名称与”FileName”相对应,其标签已在”Lab Status”说明。但是并非所有标签对应的”FileName”都存在相应的图片,所以我们首先将读取到的数据按属性”FileName”进行合并,得到都有”FileName”与”Lab Status”属性的数据。 - 数据变换
import pandas as pd
import numpy as np
import os
from PIL import Image
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoaderfilepath1=r"2021MCM_ProblemC_ Images_by_GlobalID.xlsx"
data1=pd.read_excel(filepath1)# 对文件后缀名进行更改
for i in range(len(data1)):if(data1["FileName"][i].split(".")[-1]!="jpg"):data1["FileName"][i]=data1["FileName"][i].split(".")[0]+".jpg"data2=pd.read_excel(filepath2)
data=pd.merge(data1,data2,on="GlobalID")class MyDataset(Dataset):def __init__(self, data_path):self.data_path = data_pathself.img_list = os.listdir(self.data_path)def __getitem__(self, index):img_title = self.img_list[index]# img_label = img_title.split('.')[0]data=pd.merge(data1,data2,on="GlobalID")ind=data[data["FileName"]==img_title].index.tolist()if(data["Lab Status"][ind[0]]=="Unprocessed"):img_label=np.array([0])elif(data["Lab Status"][ind[0]]=="Unverified"):img_label=np.array([1])elif(data["Lab Status"][ind[0]]=="Positive ID"):img_label=np.array([2])else:img_label=np.array([3])img_path = os.path.join(self.data_path, img_title)img = Image.open(img_path)img = np.array(img)# print(img.shape)return img, img_labeldef __len__(self):return len(self.img_list)if __name__ == '__main__':train_path = r'D:\Spyder\MCM_data'dataset = MyDataset(train_path)# train_dataset=dataset.train_size = int(dataset.__len__() * 0.8)test_size = dataset.__len__() - train_sizetrain_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])train_loader=DataLoader(train_dataset,shuffle=False,batch_size=64)test_loader=DataLoader(test_dataset,shuffle=False,batch_size=64)# Input:[Batch, Channels, Height, Width]for batch_idx, data in enumerate(train_loader):input,tar=data# print(type(input),'\t',input.shape)input=input.permute(0,3,1,2)model=Net()output=model(input)_, predicted = torch.max(output.data, dim=1)# print((input),'\t',input.shape)print(predicted.size()[0])if batch_idx==0:break
import torch
from torch.utils.data import DataLoader
from time import *
import torch.nn as nn
import torch.nn.functional as F
import torchsnooper
import torch.optim as optim
from Dataset_make import MyDataset
import matplotlib.pyplot as plt# prepare datasetbatch_size = 64
path = r'D:\Spyder\Data_cut'dataset = MyDataset(path)
train_size = int(dataset.__len__() * 0.8)
test_size = dataset.__len__() - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])train_loader = DataLoader(train_dataset,shuffle=True,batch_size=batch_size)
test_loader = DataLoader(test_dataset,shuffle=False,batch_size=batch_size)class Net(torch.nn.Module):def __init__(self):super(Net, self).__init__()# Input:Batch_size,3,600,450self.conv1 = torch.nn.Conv2d(3,10 , kernel_size=5)self.conv2 = torch.nn.Conv2d(10, 20, kernel_size=5)self.conv3=torch.nn.Conv2d(20,5,kernel_size=3)self.conv4=torch.nn.Conv2d(5,5,kernel_size=3)self.pooling = torch.nn.MaxPool2d(2)self.fc = torch.nn.Linear(350, 4)# @torchsnooper.snoop()def forward(self, x):# 定义了每次执行的计算步骤。 在所有的子类中都需要重写这个函数。# Flatten data from (n, 1, 28, 28) to (n, 784)batch_size = x.size(0)# print(x.size())x = F.relu(self.pooling(self.conv1(x)))# print(x.size())x = F.relu(self.pooling(self.conv2(x)))# print(x.size())x = F.relu(self.pooling(self.conv3(x)))# print(x.size())x = F.relu(self.pooling(self.conv4(x)))# print(x.size())x = x.view(batch_size, -1) # flatten# print(x.size())x = self.fc(x)# print(x.size())return xmodel = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model=model.to(device)criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.00001, momentum=0.5)def train(epoch):# (print(next(model.parameters()).device))running_loss = 0.0for batch_idx, data in enumerate(train_loader):inputs, target = datainputs = inputs.type(torch.FloatTensor)inputs=inputs.permute(0,3,1,2)inputs=inputs.cuda()# print(inputs.is_cuda)target = target.type(torch.LongTensor)target=target.cuda()# print(target.is_cuda)optimizer.zero_grad()#将module中的所有模型参数的梯度设置为0.# forward + backward + updateoutputs = model(inputs)temp=[]for i in range(len(target)):temp.append(target[i].item())# print(len(temp))temp=torch.LongTensor(temp)temp=temp.cuda()# print(temp.shape)loss = criterion(outputs, temp)loss.backward()optimizer.step()running_loss += loss.item()if batch_idx % 10 ==9:print('[%d, %5d] loss: %.3f' % (epoch + 1, batch_idx + 1, running_loss ))running_loss = 0.0torch.save(model.state_dict(), 'cnn_50.pkl')del inputs, target, outputs, loss# torch.cuda.empty_cache()def test():correct = 0total = 0with torch.no_grad():for data in test_loader:inputs, target = datainputs = inputs.type(torch.FloatTensor)inputs=inputs.permute(0,3,1,2)target = target.type(torch.LongTensor)inputs=inputs.cuda()target=target.cuda()outputs = model(inputs)_, predicted = torch.max(outputs.data, dim=1)total += target.size(0)temp=[]for i in range(len(target)):temp.append(target[i].item())temp=torch.LongTensor(temp)temp=temp.cuda()correct += (predicted == temp).sum().item()print('Accuracy on test set: %d %% [%d/%d]' % (100 * correct / total, correct, total))return correct / totalif __name__ == '__main__':auc=[]begin_time = time()for epoch in range(10):train(epoch)auc.append(test())end_time = time()run_time = end_time - begin_timeplt.plot(auc)font={'family' : 'Times New Roman','weight' : 'normal','size' : 30,}font2 = {'family' : 'Times New Roman','weight' : 'normal','size' : 20,}plt.xlabel("Epoch",font2)plt.ylabel("Auc",font2)plt.title("AUC Line",font)plt.gridplt.show()# model=Net()# for batch_idx, data in enumerate(train_loader):# inputs, target = data# inputs = inputs.type(torch.FloatTensor)# inputs=inputs.permute(0,3,1,2)# outputs=model(inputs)