Data loaders and file formats in Machine Learning

Machine Learning | 05 July 2018

#100DaysOfMLCode

First step in any machine learning problem is to load data. In this blog post, we will learn what are the different data loaders and file formats available in Python to do machine learning.

Common file formats used to store data in machine learning as well as deep learning are .csv, .h5 and .pickle. Apart from these, there are some other ways to handle large data files as discussed here.

CSV files

CSV is the most common file format in machine learning. CSV means Comma Separated Values. Each row in the file represents an instance and each column represents an attribute or feature.

When working with CSV files, you need to consider the following factors.

  • Does your csv file have headers which represent column names?
  • Does your csv file have comments specified using any symbol?
  • Does your csv file have delimiter apart from the standard , delimiter?
  • Does your csv file have quotes to represent string values separated by space?

For this tutorial, we will consider the famous Iris dataset. You can download the iris.data csv file here. Go to that page, hit ctrl+a, ctrl+c, ctrl+v and save it as iris_dataset.csv. Below is the first 10 rows of Iris dataset which is represented in .csv format.

iris_dataset.csvcsv
1
2
3
4
5
6
7
8
9
10
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

Load CSV using Python

csv module in Python provides a method called reader() to read .csv files. You can specify the delimiter used in your csv file as well as quoting constraints as shown below. You can also convert the data read by csv reader into a numpy array and then use it to do machine learning. The below example prints the shape of the iris dataset using python’s csv reader and numpy.

load_using_python.pycode
1
2
3
4
5
6
7
8
9
10
11
12
import csv
import numpy as np

# load iris dataset using csv package
file_path = "iris_dataset.csv"
f = open(file_path, "r")
data = csv.reader(f, delimiter=",", quoting=csv.QUOTE_NONE)
d = list(data)

# use numpy to print the shape
data = np.array(d)
print(data.shape) # prints (150, 5)

Load CSV using Numpy

Numpy offers two handy functions to load csv files. numpy.loadtxt() works only if the data present in csv file is of same datatype. numpy.genfromtxt() works for multiple datatype in a single csv file using the argument dtype=None.

load_using_numpy.pycode
1
2
3
4
5
6
7
import numpy as np

# load iris dataset using numpy
file_path = "iris_dataset.csv"
f = open(file_path, "r")
data = np.genfromtxt(f, delimiter=",", dtype=None)
print(data) # prints the iris dataset

Load CSV using Pandas

Pandas offers comprehensive and easier ways to load csv files. pandas.read_csv() method loads the csv data as a pandas dataframe which could directly be used for visualization and machine learning. In the below example, I have specified header names for each column in the csv file and assigned it using names argument in pd.read_csv() function.

load_using_pandas.pycode
1
2
3
4
5
6
7
8
import pandas as pd

# load iris dataset using pandas
file_path = "iris_dataset.csv"
header_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
data = pd.read_csv(file_path, names=header_names)
print(data.shape) # prints (150, 5)
print(type(data)) # prints <class 'pandas.core.frame.DataFrame'>

Advantages of CSV

  • Human-readable format.
  • Easy to edit manually.
  • Easy to implement and parse.
  • Faster to handle for smaller dataset.

Disadvantages of CSV

  • Takes huge time to load large data.
  • Takes huge time to save large data.
  • No standard way to represent binary data and control characters.
  • No distinction between numeric and text values.
  • No way to represent complex hierarchical data.

HDF5 files

An alternative to CSV file format is the HDF5 file format which is now widely used in Deep Learning. HDF5 is a hierarchical data format to store and manage large chunks of data. It is similar to a filesystem data format which can store complex hierarchical data. Please read this introduction about HDF5 file format and then start using it.

HDF5 includes two major object types.

  1. Groups - Container like structures to hold datasets and other groups.
  2. Datasets - Multi-dimensional arrays of a homogeneous type (You could think of a standard numpy array).

Save HDF5 using h5py

You can save any number of numpy arrays in .h5 file format and load it again using a python library called h5py. If you have installed Python via Anaconda, then you should get it by default. Otherwise install it using pip (installing h5py).

The following example creates two random numpy arrays and saves a demo_data.h5 file locally. Notice that we pass the numpy array as a data argument to the create_dataset() function. You can create as many dataset as you want with unique name to each dataset.

If you execute the below code, you would get a demo_data.h5 file which is of size 9.6MB in your localdisk.

save_h5.pycode
1
2
3
4
5
6
7
8
9
10
11
import numpy as np
import h5py

# create two random numpy arrays
a1 = np.random.random(size = (500, 500))
a2 = np.random.random(size = (1000, 1000))

# save it as a HDF5 file
with h5py.File("demo_data.h5", "w") as f:
	f.create_dataset("dataset1", data=a1)
	f.create_dataset("dataset2", data=a2)

Load HDF5 using h5py

You can access the list of keys in the .h5 file using keys() function directly on the file. This returns the list of keys you have created.

You can access the stored dataset using get() function with the dataset’s name passed as the argument. After loading the dataset, you can easily use np.array() function to convert it back to a numpy array.

The following example displays the keys, loads the demo_data.h5 file, loads dataset1, convert it back to numpy array and prints out its shape.

load_h5.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import h5py

# load the saved .h5 file
with h5py.File("demo_data.h5", "r") as f:
	
	# display the list of keys in this file
	keys = list(f.keys())
	print("{0} keys in this file: {1}".format(len(keys), keys))
	# prints 2 keys in this file: ['dataset1', 'dataset2']
	
	# load the numpy array
	data = f.get("dataset1")
	a1   = np.array(data)
	print("Shape of dataset1: {0}".format(a1.shape)) 
	# prints Shape of dataset1: (500, 500)

Save group using h5py

Similar to dataset which holds multi-dimensional numpy arrays, you can create groups to hold datasets and other groups. To do that, you need to use create_group() function and pass in an unique group name.

The following example creates two random numpy arrays, creates two groups with its corresponding dataset. Running this example generates demo_group.h5 file locally that is of size 3.9MB.

save_group_h5.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import h5py

# create two random numpy arrays
a1 = np.random.random(size= (500, 500))
a2 = np.random.random(size= (500, 500))

# create groups and datasets
with h5py.File("demo_group.h5", "w") as f:
	group1 = f.create_group("group1")
	group1.create_dataset("dataset1", data=a1)

	group2 = f.create_group("group2")
	group2.create_dataset("dataset2", data=a2)

Load group using h5py

To load group from a .h5 file, you need to use the same get() function with the group’s name as the argument. As keys() gave datasets in the file, items() will provide the list of groups in the file.

The following example loads the demo_group.h5 file, loads the list of groups in the file using items() function and prints out the contents of group1.

load_group_h5.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import h5py

# read groups and datasets
with h5py.File("demo_group.h5", "r") as f:

	# get the list of items in the file
	groups = list(f.items())
	print("{0} items in this file - {1}".format(len(groups), groups))

	# load group1 and print the items
	group1 = f.get("group1")
	group1_items = list(group1.items())
	print("{0} item in group1 - {1}".format(len(group1_items), group1_items))

	# load dataset1 from group1 and print its shape
	data1 = group1.get("dataset1")
	a1    = np.array(data1)
	print("Shape of dataset1 in group1: {0}".format(a1.shape))
outputterminal
1
2
3
2 items in this file - [('group1', <HDF5 group "/group1" (1 members)>), ('group2', <HDF5 group "/group2" (1 members)>)]
1 item in group1 - [('dataset1', <HDF5 dataset "dataset1": shape (500, 500), type "<f8">)]
Shape of dataset1 in group1: (500, 500)

In case if you found something useful to add to this article or you found a bug in the code or would like to improve some points mentioned, feel free to write it down in the comments. Hope you found something useful here.

Happy learning!