Dataset basics and concepts — PyMVPA 2.4.0 documentation (2023)


This tutorial part is also available for download as an IPython notebook:[ipynb]

A Dataset is the basic data container in PyMVPA. Itserves as the primary form of data storage, but also as a common container forresults returned by most algorithms. In this tutorial part we will take a lookat what a dataset consists of, and how it works.

Most datasets in PyMVPA are represented as a two-dimensional array, where thefirst axis is the samples axis, and the second axis represents thefeatures of the samples. In the simplest case, a dataset onlycontains data that is a matrix of numerical values.

>>> from mvpa2.tutorial_suite import *>>> data = [[ 1, 1, -1],...  [ 2, 0, 0],...  [ 3, 1, 1],...  [ 4, 0, -1]]>>> ds = Dataset(data)>>> ds.shape(4, 3)>>> len(ds)4>>> ds.nfeatures3>>> ds.samplesarray([[ 1, 1, -1], [ 2, 0, 0], [ 3, 1, 1], [ 4, 0, -1]])

In the above example, every row vector in the data matrix becomes anobservation, a sample, in the dataset, and every column vectorrepresents an individual variable, a feature. The concepts of samplesand features are essential for a dataset, hence we take a closer look.

The dataset assumes that the first axis of the data is to be used to defineindividual samples. If the dataset is created using a one-dimensional vector it willtherefore have as many samples as elements in the vector, and only one feature.

>>> one_d = [ 0, 1, 2, 3 ]>>> one_ds = Dataset(one_d)>>> one_ds.shape(4, 1)

On the other hand, if a dataset is created from multi-dimensional data, only itssecond axis represents the features

>>> import numpy as np>>> m_ds = Dataset(np.random.random((3, 4, 2, 3)))>>> m_ds.shape(3, 4, 2, 3)>>> m_ds.nfeatures4

In this case we have a dataset with three samples and four features, where eachfeature is a 2x3 matrix. In case somebody is wondering now why not simplytreat each value in the data array as its own feature (yielding 24 features) –stay tuned, as this is going to be of importance later on.


What we have seen so far does not really warrant the use of a dataset over aplain array or a matrix with samples. However, in the MVPA context we oftenneed to know more about each sample than just the value of its features. Forexample, in order to train a supervised-learning algorithm to discriminate twoclasses of samples we need per-sample target values to label eachsample with its respective class. Such information can then be used in orderto, for example, split a dataset into specific groups of samples. For thistype of auxiliary information a dataset can also contain collections of threetypes of attributes: a sample attribute, a featureattribute, and a dataset attribute.

For samples

Each sample in a dataset can have an arbitrary number of additionalattributes. They are stored as vectors of the same length as the number ofsamples in a collection, and are accessible via the sa attribute. Acollection is similar to a standard Python dict, and hence adding sampleattributes works just like adding elements to a dictionary:

>>>['some_attr'] = [ 0., 1, 1, 3 ]>>>['some_attr']

However, sample attributes are not directly stored as plain data, but forvarious reasons as a so-called Collectable that inturn embeds a NumPy array with the actual attribute:

>>> type(['some_attr'])<class 'mvpa2.base.collections.ArrayCollectable'>>>>['some_attr'].valuearray([ 0., 1., 1., 3.])

This “complication” is done to be able to extend attributes with additionalfunctionality that is often needed and can offer a significant speed-up ofprocessing. For example, sample attributes carry a list of their unique values.This list is only computed once (upon first request) and can subsequently beaccessed directly without repeated and expensive searches:

However, for most interactive uses of PyMVPA this type of access to attributes’.value is relatively cumbersome (too much typing), therefore collectionssupport direct access by name:

>>>[ 0., 1., 1., 3.])

Another purpose of the sample attribute collection is to preserve dataintegrity, by disallowing improper attributes:

>>>['invalid'] = 4Traceback (most recent call last): File "/usr/lib/python2.6/", line 1253, in __run compileflags, 1) in test.globs File "<doctest tutorial_datasets.rst[20]>", line 1, in <module>['invalid'] = 4 File "/home/test/pymvpa/mvpa2/base/", line 459, in __setitem__ value = ArrayCollectable(value) File "/home/test/pymvpa/mvpa2/base/", line 171, in __init__ % self.__class__.__name__)ValueError: ArrayCollectable only takes sequences as value.Traceback (most recent call last): File "/usr/lib/python2.6/", line 1253, in __run compileflags, 1) in test.globs File "<doctest tutorial_datasets.rst[20]>", line 1, in <module>['invalid'] = 4 File "/home/test/pymvpa/mvpa2/base/", line 459, in __setitem__ value = ArrayCollectable(value) File "/home/test/pymvpa/mvpa2/base/", line 171, in __init__ % self.__class__.__name__)ValueError: ArrayCollectable only takes sequences as value.
>>>['invalid'] = [ 1, 2, 3, 4, 5, 6 ]Traceback (most recent call last): File "/usr/lib/python2.6/", line 1253, in __run compileflags, 1) in test.globs File "<doctest tutorial_datasets.rst[21]>", line 1, in <module>['invalid'] = [ 1, 2, 3, 4, 5, 6 ] File "/home/test/pymvpa/mvpa2/base/", line 468, in __setitem__ str(self)))ValueError: Collectable 'invalid' with length [6] does not match the required length [4] of collection '<SampleAttributesCollection: some_attr>'.Traceback (most recent call last): File "/usr/lib/python2.6/", line 1253, in __run compileflags, 1) in test.globs File "<doctest tutorial_datasets.rst[21]>", line 1, in <module>['invalid'] = [ 1, 2, 3, 4, 5, 6 ] File "/home/test/pymvpa/mvpa2/base/", line 468, in __setitem__ str(self)))ValueError: Collectable 'invalid' with length [6] does not match the required length [4] of collection '<SampleAttributesCollection: some_attr>'.

But other than basic plausibility checks, no further constraints on values ofsamples attributes exist. As long as the length of the attribute vector matchesthe number of samples in the dataset, and the attributes values can be storedin a NumPy array, any value is allowed. Consequently, it is even possible tohave n-dimensional arrays, not just vectors, as attributes – as long as theirfirst axis matched the number of samples in a dataset. Moreover, it isperfectly possible and supported to store literal (non-numerical) attributes.It should also be noted that each attribute may have its own individual datatype, hence it is possible to have literal and numeric attributes in the samedataset.

>>>['literal'] = ['one', 'two', 'three', 'four']>>> sorted(['literal', 'some_attr']>>> for attr in  print "%s: %s" % (attr,[attr] string40some_attr: float64

For features

Feature attributes are almost identical to sample attributes, the only difference is that instead of having one attribute value persample, feature attributes have one value per (guess what? ...) feature.Moreover, they are stored in a separate collection in the dataset that iscalled fa:

>>> ds.nfeatures3>>> ds.fa['my_fav'] = [0, 1, 0]>>> ds.fa['responsible'] = ['me', 'you', 'nobody']>>> sorted(ds.fa.keys())['my_fav', 'responsible']

For the entire dataset

Lastly, there can be also attributes, not per-sample, or per-feature, but forthe dataset as a whole: so called dataset attributes. Both assigningsuch attributes and accessing them later on work in exactly the same way as forthe other two types of attributes, except that dataset attributes are stored intheir own collection which is accessible via the a property of the dataset.However, in contrast to sample and feature attribute, no constraints on thetype or size are imposed – anything can be stored. Let’s store a list with thenames of all files in the current directory, just because we can:

>>> from glob import glob>>> ds.a['pointless'] = glob("*")>>> '' in ds.a.pointlessTrue

Slicing, resampling, feature selection

At this point we can already construct a dataset from simple arrays and enrichit with an arbitrary number of additional attributes. But just having a datasetisn’t enough. We often need to be able to select subsets of a dataset forfurther processing.

Slicing a dataset (i.e. selecting specific subsets) is very similar toslicing a NumPy array. It actually works almost identically. A datasetsupports Python’s slice syntax, but also selection by boolean masks andindices. The following three slicing operations result in equivalent outputdatasets, by always selecting every other samples in the dataset:

>>> # original>>> ds.samplesarray([[ 1, 1, -1], [ 2, 0, 0], [ 3, 1, 1], [ 4, 0, -1]])>>>>>> # Python-style slicing>>> ds[::2].samplesarray([[ 1, 1, -1], [ 3, 1, 1]])>>>>>> # Boolean mask array>>> mask = np.array([True, False, True, False])>>> ds[mask].samplesarray([[ 1, 1, -1], [ 3, 1, 1]])>>>>>> # Slicing by index -- Python indexing start with 0 !!>>> ds[[0, 2]].samplesarray([[ 1, 1, -1], [ 3, 1, 1]])


Search the NumPy documentation for thedifference between “basic slicing” and “advanced indexing”. The aspect ofmemory consumption, especially, applies to dataset slicing as well, and beingaware of this fact might help to write more efficient analysis scripts. Whichof the three slicing approaches above is the most memory-efficient? Which ofthe three slicing approaches above might lead to unexpected side-effects ifthe output dataset gets modified?

All three slicing-styles are equally applicable to the selection of featuresubsets within a dataset. Remember, features are represented on the second axisof a dataset.

>>> ds[:, [1,2]].samplesarray([[ 1, -1], [ 0, 0], [ 1, 1], [ 0, -1]])

By applying a selection by indices to the second axis, we can easily getthe last two features of our example dataset. Please note that the : is suppliedfor the first axis slicing. This is the Python way to indicate take everythingalong this axis, thus including all samples.

As you can guess, it is also possible to select subsets of samples andfeatures at the same time.

>>> subds = ds[[0,1], [0,2]]>>> subds.samplesarray([[ 1, -1], [ 2, 0]])

If you have prior experience with NumPy you might be confused now. What youmight have expected is this:

>>> ds.samples[[0,1], [0,2]]array([1, 0])

The above code applies the same slicing directly to the NumPy array of.samples, and the result is fundamentally different. For NumPy arraysthis style of slicing allows selection of specific elements by their indices oneach axis of an array. For PyMVPA’s datasets this mode is not very useful,instead we typically want to select rows and columns, i.e. samples andfeatures given by their indices.


Try to select samples [0,1] and features [0,2] simultaneously usingdataset slicing. Now apply the same slicing to the samples array itself(ds.samples) – make sure that the result doesn’t surprise you and finda pure NumPy way to achieve similar selection.

One last interesting thing to look at, in the context of dataset slicing,are the attributes. What happens to them when a subset of samples and/orfeatures is chosen? Our original dataset had both samples and feature attributes:

>>> print[ 0. 1. 1. 3.]>>> print ds.fa.responsible['me' 'you' 'nobody']

Now let’s look at what they became in the subset-dataset we previouslycreated:

>>> print[ 0. 1.]>>> print subds.fa.responsible['me' 'nobody']

We see that both attributes are still there and, moreover, also thecorresponding subsets have been selected. It makes it convenient to selectsubsets of the dataset matching specific values of sample or feature attributes,or both:

>>> subds = ds[ == 1., ds.fa.responsible == 'me']>>> print subds.shape(2, 1)

To simplify such selections based on the values of attributes, it is possibleto specify the desired selection as a dictionary for either samples of featuresdimensions, where each key corresponds to an attribute name, and each valuespecifies a list of desired attribute values. Specifying multiple keys foreither dimension can be used to obtain the intersection of matching elements:

>>> subds = ds[{'some_attr': [1., 0.], 'literal': ['two']}, {'responsible': ['me', 'you']}]>>> print,, subds.fa.responsible[ 1.] ['two'] ['me' 'you']


Check the documentation of the select() methodthat can also be used to implement such a selection, but provides anadditional argument strict. Modify the example above to selectnon-existing elements via [], and compare to the result to the outputof select() with strict=False.

Load fMRI data

Enough theoretical foreplay – let’s look at a concrete example of loading anfMRI dataset. PyMVPA has several helper functions to load data from specializedformats, and the one for fMRI data is fmri_dataset(). Theexample dataset we are going to look at is a single subject from Haxby et al.(2001). For more convenience and less typing, we have a short cut for thepath of the directory with the fMRI data: tutorial_data_path`.

In the simplest case, we now let fmri_dataset do its job,by just pointing it to the fMRI data file. The data is stored as a NIfTI filethat has all volumes of one experiment concatenated into a single file.

>>> bold_fname = os.path.join(tutorial_data_path, 'haxby2001', 'sub001',...  'BOLD', 'task001_run001', 'bold.nii.gz')>>> ds = fmri_dataset(bold_fname)>>> len(ds)121>>> ds.nfeatures163840>>> ds.shape(121, 163840)

We can notice two things. First – it worked! Second, we obtained atwo-dimensional dataset with 121 samples (these are volumes in the NIfTI file),and over 160k features (these are voxels in the volume). The voxels arerepresented as a one-dimensional vector, and it seems that they have lost theirassociation with the 3D-voxel-space. However, this is not the case, as we willsee later. PyMVPA represents data in this simple format to make it compatiblewith a vast range of generic algorithms that expect data to be a simple matrix.

We loaded all data from that NIfTI file, but usually we would be interested ina subset only, i.e. “brain voxels”. fmri_dataset iscapable of performing data masking. We just need to specify a mask image. Sucha mask image is generated in pretty much any fMRI analysis pipeline – may itbe a full-brain mask computed during skull-stripping, or an activation map froma functional localizer. We are going to use the original GLM-based localizermask of ventral temporal cortex from Haxby et al. (2001). Let’s reload thedataset:

>>> mask_fname = os.path.join(tutorial_data_path, 'haxby2001', 'sub001',...  'masks', 'orig', 'vt.nii.gz')>>> ds = fmri_dataset(bold_fname, mask=mask_fname)>>> len(ds)121>>> ds.nfeatures577

As expected, we get the same number of samples, but now only 577 features– voxels corresponding to non-zero elements in the mask image. Now, let’sexplore this dataset a little further.


Explore the dataset attribute collections. What kind of information do theycontain?

Besides samples, the dataset offers a number of attributes that enhance thedata with information that is present in the NIfTI image file header.Each sample has information about its volume index in the time series and theactual acquisition time (relative to the beginning of the file). Moreover, theoriginal voxel index (sometimes referred to as ijk) for each feature isavailable too. Finally, the dataset also contains information about thedimensionality of the input volumes, voxel size, and any other NIfTI-specificinformation since it also includes a dump of the full NIfTI image header.

>>>[:5]array([0, 1, 2, 3, 4])>>>[:5]array([ 0. , 2.5, 5. , 7.5, 10. ])>>> ds.fa.voxel_indices[:5]array([[ 6, 23, 24], [ 7, 18, 25], [ 7, 18, 26], [ 7, 18, 27], [ 7, 19, 25]])>>> ds.a.voxel_eldim(3.5, 3.75, 3.75)>>> ds.a.voxel_dim(40, 64, 64)>>> 'imghdr' in ds.aTrue

In addition to all this information, the dataset also carries a key additionalattribute: the mapper. A mapper is an important concept in PyMVPA, and hencehas its own tutorial chapter.

>>> print ds.a.mapper<Chain: <Flatten>-<StaticFeatureSelection>>

Having all these attributes being part of a dataset is often a useful thing tohave, but in some cases (e.g. when it comes to efficiency, and/or very largedatasets) one might want to have a leaner dataset with just the informationthat is really necessary. One way to achieve this, is to strip all unwantedattributes. The Dataset class’ copy()method can help with that.

>>> stripped = ds.copy(deep=False, sa=['time_coords'], fa=[], a=[])>>> print stripped<Dataset: 121x577@int16, <sa: time_coords>>

We can see that all attributes besides time_coords have been filtered out.Setting the deep arguments to False causes the copy function to reusethe data from the source dataset to generate the new stripped one, withoutduplicating all data in memory – meaning both datasets now share the sampledata and any change done to ds will also affect stripped.

Intermediate storage

Some data preprocessing can take a long time. One would rather prevent havingto do it over and over again, and instead just store the preprocessed data intoa file for subsequent analyses. PyMVPA offers functionality to store a largevariety of objects, including datasets, into HDF5 files. A variant of thisformat is also used by recent versions of Matlab to store data.

For HDF5 support, PyMVPA depends on the h5py package. If it is available, anydataset can be saved to a file by simply callingsave() with the desired filename.

>>> import tempfile, shutil>>> # create a temporary directory>>> tempdir = tempfile.mkdtemp()>>>, 'mydataset.hdf5'))

HDF5 is a flexible format that also supports, for example, data compression. Toenable it, you can pass additional arguments tosave() that are supported by h5py’sGroup.create_dataset(). Instead of usingsave() one can also use theh5save() function in a similar way. Saving the same datasetwith maximum gzip-compression looks like this:

>>>, 'mydataset.gzipped.hdf5'), compression=9)>>> h5save(os.path.join(tempdir, 'mydataset.gzipped.hdf5'), ds, compression=9)

Loading datasets from a file is easy too. h5load() takes afilename as an argument and returns the stored dataset. Compressed data will behandled transparently.

>>> loaded = h5load(os.path.join(tempdir, 'mydataset.hdf5'))>>> np.all(ds.samples == loaded.samples)True>>> # cleanup the temporary directory, and everything it includes>>> shutil.rmtree(tempdir, ignore_errors=True)

Note that this type of dataset storage is not appropriate from long-termarchival of data, as it relies on a stable software environment. For long-termstorage, use other formats.


How do you create a data set in Python? ›

In this short guide, you'll see two different methods to create Pandas DataFrame:
  1. By typing the values in Python itself to create the DataFrame.
  2. By importing the values from a file (such as a CSV file), and then creating the DataFrame in Python based on the values imported.

What does dataset mean in Python? ›

A Dataset is the basic data container in PyMVPA. It serves as the primary form of data storage, but also as a common container for results returned by most algorithms.

How do I manually create a dataset in Python? ›

  1. Enter Data Manually in Editor Window. The first step is to load pandas package and use DataFrame function. ...
  2. Read Data from Clipboard. ...
  3. Entering Data into Python like SAS. ...
  4. Prepare Data using sequence of numeric and character values. ...
  5. Generate Random Data. ...
  6. Create Categorical Variables. ...
  7. Import CSV or Excel File.

How do I create my own data set? ›

Steps to Constructing Your Dataset
  1. Collect the raw data.
  2. Identify feature and label sources.
  3. Select a sampling strategy.
  4. Split the data.
Jul 18, 2022

What is difference between data and dataset? ›

Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia. A dataset is a structured collection of data generally associated with a unique body of work. A database is an organized collection of data stored as multiple datasets.

Does Python have built-in datasets? ›

In order to use the free inbuilt datasets available in Python, we need to install the library using the command given below. If you are using the windows operating system, open command prompt and type the command given below.

What is an example of dataset? ›

A data set is a collection of numbers or values that relate to a particular subject. For example, the test scores of each student in a particular class is a data set. The number of fish eaten by each dolphin at an aquarium is a data set.

How to create CSV file in Python? ›

How to Create CSV File in Python?
  1. Open the CSV file in writing (w mode) with the help of open() function.
  2. Create a CSV writer object by calling the writer() function of the csv module.
  3. Write data to CSV file by calling either the writerow() or writerows() method of the CSV writer object.
  4. Finally, close the CSV file.

How to extract data from dataset in Python? ›

How to Extract Data From Existing Series and DataFrame in Pandas
  1. Scenario 1. Create a Series from an existing Series. ...
  2. Scenario 2. Create a Series from multiple Series in a DataFrame. ...
  3. Scenario 3. Create multiple Series from an existing Series. ...
  4. Scenario 4. Create Multiple Series From Multiple Series (i.e., DataFrame)
Sep 5, 2020

How to convert list to dataset in Python? ›

How to Convert List to DataFrame in Python?
  1. 2) Using a list with index & column names. We can create the data frame by giving the name to the column and indexing the rows. ...
  2. 3) Using zip() function. ...
  3. 4) Creating from the multi-dimensional list. ...
  4. 5) Using multi-dimensional list with column name. ...
  5. 6) Using a list in the dictionary.
Apr 18, 2023

What are the two main steps for creating a dataset? ›

The process of creating a dataset involves three important steps:
  • Data Acquisition.
  • Data Cleaning.
  • Data Labeling.
Feb 10, 2021

What makes a good dataset? ›

A good data set is one that has either well-labeled fields and members or a data dictionary so you can relabel the data yourself. Think of Superstore—it's immediately obvious what the fields and their values are, such as Category and its members Technology, Furniture, and Office Supplies.

How do I create a set in pandas Python? ›

We can create a set from a series of pandas by using set() , Series. unique() function. The set object is used to store multiple items which are heterogeneous. Just like a list, tuple, and dictionary, the set is another built-in data type in python which is used to store elements without duplicates.

How is a data type created in Python? ›

The set is created by using a built-in function set(), or a sequence of elements is passed in the curly braces and separated by the comma. It can contain various types of values. Consider the following example.

How does set () work in Python? ›

The set() function is used to create a set in python. The set in python takes one parameter. It is an optional parameter that contains values to be inserted into the set. The values in the parameter should be a b (string, tuple) or collection(set, dictionary), or an iterator object.


Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated: 16/08/2023

Views: 5605

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.