ehrapy.preprocessing.encode#
- ehrapy.preprocessing.encode(edata, autodetect=False, encodings='one-hot', *, layer=None)[source]#
Encode categoricals of a data object.
Categorical values could be either passed via parameters or are autodetected on the fly. The categorical values are also stored in obs and uns (for keeping the original, unencoded values). The current encoding modes for each variable are also stored in edata.var[‘encoding_mode’]. Variable names in var are updated according to the encoding modes used. A variable name starting with ehrapycat_ indicates an encoded column (or part of it).
- Autodetect mode:
By using this mode, every column that contains non-numerical values is encoded. In addition, every binary column will be encoded too. These are those columns which contain only 1’s and 0’s (could be either integers or floats).
- Available encodings are:
For 3D longitudinal layers of shape
(n_obs, n_vars, n_time)the encoder is fit on values stacked across the time axis so the category space is consistent over time. The encoded result keeps the time axis, andobsstores the first-timepoint value of each encoded categorical column.- Parameters:
edata (
EHRData) – Central data object.autodetect (
bool|dict, default:False) – Whether to autodetect categorical values that will be encoded.encodings (
dict[str,list[str]] |str|None, default:'one-hot') – Only needed if autodetect set to False. A dict containing the encoding mode and categorical name for the respective column or the specified encoding that will be applied to all columns.
- Return type:
- Returns:
A data object with the encoded values in .X if layer is None, otherwise a data object with the encoded values in layer.
Examples
>>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> edata_encoded = ep.pp.encode(edata, autodetect=True, encodings="one-hot")
>>> # Example using custom encodings per columns: >>> import ehrdata as ed >>> import ehrapy as ep >>> edata = ed.dt.mimic_2() >>> # encode col1 and col2 using label encoding and encode col3 using one hot encoding >>> edata_encoded = ep.pp.encode( ... edata, autodetect=False, encodings={"label": ["col1", "col2"], "one-hot": ["col3"]} ... )