Datasets for machine learning and importing them.

Data sets

Wikipedia says, a data set (or dataset) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents particular variable, and each row corresponds to a given member of the data set in question.

In Machine Learning we have to care about the features or attributes and labels of any dataset for which we are going to build Machine Learning model. Below is an example that explains features/attributes and label/class.

Features or Attributes and Class or Labels

Features or Attributes

In the below example of a snip of Iris data features: sepal length, sepal width, petal length and petal width are used to describe whether the flower is setosa or virginica or versicolor which is called class/label.

Python app

When we build our machine learning models we will first split our data into training dataset, validation dataset and test dataset, we use training data set to train the model, during training we will feed the model both features and labels so our model can learn about the data and during validation and testing we only feed Features/attributes of test and validation datasets and make our model to predict the class/label of the corresponding datasets.

Datasets used in this course

We will use Iris data set from UCL machine learning dataset repository to get started with machine learning and try out most popular supervised learning algorithms in Scikit learn library.

In Iris data set, Iris flower is divided into three types namely, Iris setosa, Iris virginica and Iris versicolor, which are called labels/classes, based on four different attributes/features: sepal length, sepal width, petal length and petal width.

To learn data shaping and best practices, we will be using audiology data set from UCL as well.

Audiology dataset has 226 instances of hearing status diagnosis information of different people based on 69 different attributes and all the data are in character form rather than numerical form. We will learn how to convert the data into numerical form and do dimensionality reduction of features from 69 to suitable number and feed it to our model and make prediction.

Next topic→