Intro: Datasets

There are some test datasets with simulated data, which can be used to experiment with different functionalities:

>>> ds = datasets.get_uts()

In order to find out what a Dataset contains, use ds.head() to see the first 10 cases, or just look at the "official" string representation by typing the Dataset’s name:

>>> ds.head()
A    B    rm    ind   Y          YBin   YCat
--------------------------------------------
a0   b0   R00   R00   2.0977     c1     c1
a0   b0   R01   R01   1.8942     c1     c1
a0   b0   R02   R02   0.77358    c2     c2
a0   b0   R03   R03   2.554      c1     c3
a0   b0   R04   R04   1.0135     c1     c2
a0   b0   R05   R05   -3.5303    c2     c2
a0   b0   R06   R06   1.6037     c1     c1
a0   b0   R07   R07   0.71308    c1     c1
a0   b0   R08   R08   -0.84538   c1     c2
a0   b0   R09   R09   2.6804     c2     c3

>>> ds
<Dataset n_cases=60 {'A':F, 'B':F, 'rm':F, 'ind':F, 'Y':V, 'YBin':F, 'YCat':F, 'uts':Vnd}>

The latter also lists the 'uts':Vnd which is not visible in the table. The reason it is not visible is because it is an NDVar, i.e. a multidimensional variable, so there is no simple way of fitting it into a 2d table. Examine its content by retrieving it from the Dataset:

>>> ds['uts']
<NDVar 'uts': 60 case, 100 time>

In addition to retrieving variables by name, individual cases (rows) can be retrieved as dictionaries:

>>> ds[0]
{'A': 'a0',
 'B': 'b0',
 'rm': 'R00',
 'ind': 'R00',
 'Y': 2.097726673341931,
 'YBin': 'c1',
 'YCat': 'c1',
 'uts': <NDVar 'uts': 100 time>}

A subset of rows can be accessed using arrays:

>>> ds['rm'] == 'R00'
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False])

>>> print(ds[ds['rm'] == 'R00'])
A    B    rm    ind   Y        YBin   YCat
------------------------------------------
a0   b0   R00   R00   2.0977   c1     c1
a0   b1   R00   R15   1.919    c1     c2
a1   b0   R00   R30   1.326    c1     c1
a1   b1   R00   R45   2.0916   c1     c2

The Dataset.sub() method allows writing the index expression as code string, thus the following code has the same effect as the above but is more readable (especially when multiple variables are involved):

>>> print(ds.sub("rm == 'R00'"))
A    B    rm    ind   Y        YBin   YCat
------------------------------------------
a0   b0   R00   R00   2.0977   c1     c1
a0   b1   R00   R15   1.919    c1     c2
a1   b0   R00   R30   1.326    c1     c1
a1   b1   R00   R45   2.0916   c1     c2

Multiple Datasets containing the same variables can be combined using the combine() function:

>>> ds0 = ds.sub("rm == 'R00'")
>>> ds1 = ds.sub("rm == 'R01'")
>>> ds10 = combine([ds1, ds0])
>>> print(ds10)
A    B    rm    ind   Y          YBin   YCat
--------------------------------------------
a0   b0   R01   R01   1.8942     c1     c1
a0   b1   R01   R16   0.77832    c1     c3
a1   b0   R01   R31   -0.85264   c2     c1
a1   b1   R01   R46   1.0406     c1     c2
a0   b0   R00   R00   2.0977     c1     c1
a0   b1   R00   R15   1.919      c1     c2
a1   b0   R00   R30   1.326      c1     c1
a1   b1   R00   R45   2.0916     c1     c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intro: Datasets

Clone this wiki locally