-
Notifications
You must be signed in to change notification settings - Fork 27
Intro: Datasets
There are some test datasets with simulated data, which can be used to experiment with different functionalities:
>>> ds = datasets.get_uts()
In order to find out what a Dataset contains, use ds.head()
to see the first 10 cases, or just look at the "official" string representation by typing the Dataset’s name:
>>> ds.head()
A B rm ind Y YBin YCat
--------------------------------------------
a0 b0 R00 R00 2.0977 c1 c1
a0 b0 R01 R01 1.8942 c1 c1
a0 b0 R02 R02 0.77358 c2 c2
a0 b0 R03 R03 2.554 c1 c3
a0 b0 R04 R04 1.0135 c1 c2
a0 b0 R05 R05 -3.5303 c2 c2
a0 b0 R06 R06 1.6037 c1 c1
a0 b0 R07 R07 0.71308 c1 c1
a0 b0 R08 R08 -0.84538 c1 c2
a0 b0 R09 R09 2.6804 c2 c3
>>> ds
<Dataset n_cases=60 {'A':F, 'B':F, 'rm':F, 'ind':F, 'Y':V, 'YBin':F, 'YCat':F, 'uts':Vnd}>
The latter also lists the 'uts':Vnd
which is not visible in the table. The reason it is not visible is because it is an NDVar
, i.e. a multidimensional variable, so there is no simple way of fitting it into a 2d table. Examine its content by retrieving it from the Dataset:
>>> ds['uts']
<NDVar 'uts': 60 case, 100 time>
In addition to retrieving variables by name, individual cases (rows) can be retrieved as dictionaries:
>>> ds[0]
{'A': 'a0',
'B': 'b0',
'rm': 'R00',
'ind': 'R00',
'Y': 2.097726673341931,
'YBin': 'c1',
'YCat': 'c1',
'uts': <NDVar 'uts': 100 time>}
A subset of rows can be accessed using arrays:
>>> ds['rm'] == 'R00'
array([ True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False])
>>> print(ds[ds['rm'] == 'R00'])
A B rm ind Y YBin YCat
------------------------------------------
a0 b0 R00 R00 2.0977 c1 c1
a0 b1 R00 R15 1.919 c1 c2
a1 b0 R00 R30 1.326 c1 c1
a1 b1 R00 R45 2.0916 c1 c2
The Dataset.sub() method allows writing the index expression as code string, thus the following code has the same effect as the above but is more readable (especially when multiple variables are involved):
>>> print(ds.sub("rm == 'R00'"))
A B rm ind Y YBin YCat
------------------------------------------
a0 b0 R00 R00 2.0977 c1 c1
a0 b1 R00 R15 1.919 c1 c2
a1 b0 R00 R30 1.326 c1 c1
a1 b1 R00 R45 2.0916 c1 c2
Multiple Datasets containing the same variables can be combined using the combine()
function:
>>> ds0 = ds.sub("rm == 'R00'")
>>> ds1 = ds.sub("rm == 'R01'")
>>> ds10 = combine([ds1, ds0])
>>> print(ds10)
A B rm ind Y YBin YCat
--------------------------------------------
a0 b0 R01 R01 1.8942 c1 c1
a0 b1 R01 R16 0.77832 c1 c3
a1 b0 R01 R31 -0.85264 c2 c1
a1 b1 R01 R46 1.0406 c1 c2
a0 b0 R00 R00 2.0977 c1 c1
a0 b1 R00 R15 1.919 c1 c2
a1 b0 R00 R30 1.326 c1 c1
a1 b1 R00 R45 2.0916 c1 c2