Kaggle实战：Plant Seedlings Classification（植物幼苗分类）

本文主要是介绍Kaggle实战：Plant Seedlings Classification（植物幼苗分类），希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

2018年初从天池大数据竞赛转战Kaggle，发现在这里有更多有趣的项目，也有更多的大神来分享经验和代码，体会良多。

经过数字识别和泰坦尼克号预测的入门，我实战的第一个比赛是Plant Seedlings Classification（植物幼苗分类），我将从以下几个方面来记录这个比赛。

1. 比赛描述

2. 评价指标

3. 迁移学习

4. Keras训练

5. 比赛结果

------------------------------------------------------------------------------------------

1. 比赛描述

比赛的主要人数就是区分农作物幼苗中的杂草，以便获得更好的作物产量和更好的环境管理。比赛所用的数据集为奥胡斯大学信号处理组与丹麦南部大学合作发布的数据集，该数据集包含在几个生长阶段的大约960种属于12种物种的植物图像。下图为下载好的训练集部分样本。

数据集一共包括十二个物种，分别为

Black-grass
Charlock
Cleavers
Common Chickweed
Common wheat
Fat Hen
Loose Silky-bent
Maize
Scentless Mayweed
Shepherds Purse
Small-flowered Cranesbill
Sugar beet

2. 评价指标

以MeanFScore作为评价指标，具体解释见链接：https://en.wikipedia.org/wiki/F1_score。

给定每个类别k的正/负率，得出的得分以以下公式进行计算：

F1分数是精确度和召回率的调和平均值，公式如下：

3. 迁移学习

由于在当时并没有高端的GPU进行深度学习训练，于是准备使用迁移学习看看效果。

首先导入一些必要的库：

%matplotlib inline
import datetime as dt
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 10]
plt.rcParams['font.size'] = 16
import numpy as np
import os
import pandas as pd
import seaborn as sns
from keras.applications import xception
from keras.preprocessing import image
from mpl_toolkits.axes_grid1 import ImageGrid
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from tqdm import tqdm

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

start = dt.datetime.now()

接着使用Keras Pretrained Models数据集（从此处下载：https://www.kaggle.com/gaborfodor/seedlings-pretrained-keras-models/data），须将预训练模型复制到keras的缓存目录（〜/ .keras / models）中：

!ls ../input/keras-pretrained-models/

cache_dir = os.path.expanduser(os.path.join('~', '.keras'))
if not os.path.exists(cache_dir):os.makedirs(cache_dir)
models_dir = os.path.join(cache_dir, 'models')
if not os.path.exists(models_dir):os.makedirs(models_dir)

!cp ../input/keras-pretrained-models/xception* ~/.keras/models/

!ls ~/.keras/models

接下来对数据集进行查看：

!ls ../input/plant-seedlings-classification

CATEGORIES = ['Black-grass', 'Charlock', 'Cleavers', 'Common Chickweed', 'Common wheat', 'Fat Hen', 'Loose Silky-bent','Maize', 'Scentless Mayweed', 'Shepherds Purse', 'Small-flowered Cranesbill', 'Sugar beet']
NUM_CATEGORIES = len(CATEGORIES)

SAMPLE_PER_CATEGORY = 200
SEED = 1987
data_dir = '../input/plant-seedlings-classification/'
train_dir = os.path.join(data_dir, 'train')
test_dir = os.path.join(data_dir, 'test')
sample_submission = pd.read_csv(os.path.join(data_dir, 'sample_submission.csv'))

sample_submission.head(2)

for category in CATEGORIES:print('{} {} images'.format(category, len(os.listdir(os.path.join(train_dir, category)))))

train = []
for category_id, category in enumerate(CATEGORIES):for file in os.listdir(os.path.join(train_dir, category)):train.append(['train/{}/{}'.format(category, file), category_id, category])
train = pd.DataFrame(train, columns=['file', 'category_id', 'category'])
train.head(2)
train.shape

对训练及测试样本进行查看：

train = pd.concat([train[train['category'] == c][:SAMPLE_PER_CATEGORY] for c in CATEGORIES])
train = train.sample(frac=1)
train.index = np.arange(len(train))
train.head(2)
train.shape

test = []
for file in os.listdir(test_dir):test.append(['test/{}'.format(file), file])
test = pd.DataFrame(test, columns=['filepath', 'file'])
test.head(2)
test.shape