ehrapy.preprocessing.encode(adata, autodetect=False, encodings='one-hot')[source]#

Encode categoricals of an AnnData object.

Categorical values could be either passed via parameters or are autodetected on the fly. The categorical values are also stored in obs and uns (for keeping the original, unencoded values). The current encoding modes for each variable are also stored in uns (var_to_encoding key). Variable names in var are updated according to the encoding modes used. A variable name starting with ehrapycat_ indicates an encoded column (or part of it).

Autodetect mode:

This can be used for convenience and when there are many columns that need to be encoded. Note that missing values do not influence the result. By using this mode, every column that contains non-numerical values is encoded. In addition, every binary column will be encoded too. These are those columns which contain only 1’s and 0’s (could be either integers or floats).

Available encodings are:
  1. one-hot (

  2. label (

  3. count (

  4. hash (

  • adata (AnnData) – A AnnData object.

  • autodetect (bool | dict) – Whether to autodetect categorical values that will be encoded.

  • encodings (dict[str, dict[str, list[str]]] | dict[str, list[str]] | str | None) – Only needed if autodetect set to False. A dict containing the encoding mode and categorical name for the respective column or the specified encoding that will be applied to all columns.

Return type:



An AnnData object with the encoded values in X.


>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2()
>>> adata_encoded = ep.pp.encode(adata, autodetect=True, encodings="one_hot_encoding")
>>> # Example using custom encodings per columns:
>>> import ehrapy as ep
>>> adata = ep.dt.mimic_2()
>>> # encode col1 and col2 using label encoding and encode col3 using one hot encoding
>>> adata_encoded = ep.pp.encode(adata,
>>>                              autodetect=False,
>>>                              encodings={'label': ['col1', 'col2'], 'one-hot': ['col3']})