Tensorflow 2.0 教程之结构化数据分类Classify structured data

2023年4月18日 165次阅读来源: 海萨

1. Know your data and your purpose

数据背景：We will use a smalldataset provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describe a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task.

数据源：https://storage.googleapis.com/applied-dl/heart.csv

学习目标：在使用tensorflow 2.0 的前提下，This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). We will use Keras to define the model, and feature columns as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:

Load a CSV file using Pandas.
Build an input pipeline to batch and shuffle the rows using tf.data.
Map from columns in the CSV to features used to train the model using feature columns.
Build, train, and evaluate a model using Keras.

本教程用到的库或者模块有：

tensorflow 、numpy、pandas、sklearn，因此需要导入这些库或者模块又或者模块中的某些类。如果不能正常导入请使用 pip install 安装。

如：pip install tensorlfow==2.0.0-alpha0 //安装tensorflow2.0 版本。

pip install numpy、pandas、sklearn //安装numpy、pandas、sklearn。

注意：如果本教程之前安装过tensorflow，且版本不是2.0.0-alpha0，请使用pip uninstall 卸载后老版本再进行安装。另外，keras在tensorflow2.0中已经整合进去，不再是单独的keras库，不能在使用import keras导入，但可以从tensorflow模块里导入，例如from tensorflow import keras。

写在最前面，开始之前别忘记模块加载：

import numpy as np

import pandas as pd

import tensorflow as tf

from tensorflow import feature_column

from tensorflow.keras import layers

from sklearn.model_selection import train_test_split //该方法把数据集分成训练和测试集

2. 数据整理

2.1 数据读取

本教程使用python的pandas库来加载数据，直接使用dataframe来从数据连接中（https://storage.googleapis.com/applied-dl/heart.csv）读取数据。

URL = ‘https://storage.googleapis.com/applied-dl/heart.csv‘

dataframe = pd.read_csv(URL)

或者

dataframe = pd.read_csv(‘https://storage.googleapis.com/applied-dl/heart.csv‘)

2.2 数据检查

dataframe.head() //显示前几行数据，检查数据加载是否正确

pandas中的 dataframe.head(n=5)， head默认返回前5行数据，也可以自己行返回想返回的行数，方法是dataframe.head(int n)

2.3 训练集、验证集、测试集

利用sklearn.model_selection中的train_testsplit方法，将已经加载了CSV数据的dataframe分为train/val/test三个子集。

train_test_split(dataframe, test_size=0.25)，是随机分集函数(shuffle默认为ture)，其test子集默认大小是0.25。

本教程里第一步用了train_test__split将dataframe分为train和test，其中test占比20%，即test_size=0.2。

第二步再用train_test__split将train分为train和val，其中val占比20%，即test_size=0.2。

具体代码如下：

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

2.4 构建数据输入管道（pipeline）

利用tf.data重新封装dataframe类型的数据集train/val/test。这样我们就可以使用tensorflow 中的featurel columns 在原始数据集dataframe中的column和未来模型中的column中搭建一个映射。

因此定义数据转换函数，具体如下：

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')   //目标列
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

开始将数据集train/val/test从dataframe转换成dataset

batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

未完成

    原文作者：海萨
    原文地址: https://zhuanlan.zhihu.com/p/62144722
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。