API Reference

datasets

IO module for DREXML.

drexml.datasets.fetch_file(key, env, version='latest')

Retrieve file from the environment.

Parameters:

key (str) – Key of the file to retrieve.
env (dict) – Environment.
version (str) – Version of the file to retrieve.

Returns:

Path to the file.

Return type:

pathlib.Path

Raises:

NotImplementedError – Not implemented yet.

drexml.datasets.get_data(disease, debug)

Load disease data and metadata.

Parameters:

disease (path-like) – Path to disease config file.
debug (bool) – _description_, by default False.
scale (bool, optional) – _description_, by default False.

Returns:

pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.

drexml.datasets.get_disease_data(disease)

Get data for a disease.

Parameters:

disease (pathlib.Path) – Path to the disease configuration file.

Returns:

pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.

drexml.datasets.get_gda(disease_id, k_top=40)

Retrieve the list of genes associated to a disease according to the Disgenet curated list of gene-disease associations.

Parameters:

disease_id (str) – Disease ID.
k_top (int) – Retrieve at most k_top genes based on the GDA score.

Returns:

List of gene IDs.

Return type:

list

drexml.datasets.get_index_name_options(key)

Returns a list of possible index names based on the input key.

Parameters:: key (str) – The key for the data frame.
Returns:: A list of possible index names based on the input key.
Return type:: list of str

Examples

>>> get_index_name_options("circuits")
["hipathia_id", "hipathia", "circuits_id", "index"]

Notes

This function returns a list of possible index names based on the input key. If the key is “circuits”, it returns a list of four possible index names. If the key is “genes”, it returns a list of three possible index names. Otherwise, it returns a list with only one element, “index”.

drexml.datasets.load_atc()

Load the ATC table.

Returns:: ATC table.
Return type:: pd.DataFrame

drexml.datasets.load_df(path, key=None)

Load dataframe from file. At the moment: stv, tsv compressed or feather.

Parameters:: path (pathlib.Path) – Path to file.
Returns:: Dataframe.
Return type:: pandas.DataFrame
Raises:: NotImplementedError – Not implemented yet.

drexml.datasets.load_disgenet()

Download if necessary and load the Disgenet curated list of gene-disease associations.

Returns:: Disgenet curated dataset of gene-disease associations.
Return type:: pd.DataFrame

drexml.datasets.load_drugbank()

Download if necessary and load the drugbank table.

Returns:: Drugbank table.
Return type:: pd.DataFrame

drexml.datasets.load_physiological_circuits()

Load the list of physiological circuits.

Returns:: List of physiological circuit IDs.
Return type:: list

drexml.datasets.preprocess_activities(frame)

Preprocess an activities data frame.

Parameters:: frame (pandas.DataFrame) – The activities data frame to preprocess.
Returns:: The preprocessed activities data frame.
Return type:: pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"-": [1, 2], "Activity 1": [3, 4]})
>>> preprocess_activities(df)
   .  Activity.1
0  1          3
1  2          4

Notes

This function replaces hyphens and spaces in the column names of the input data frame with periods and returns the resulting data frame.

drexml.datasets.preprocess_frame(res, env, key)

Preprocess the input data frame.

Parameters:

res (pandas.DataFrame) – The input data frame.
env (dict) – The environment variables.
key (str) – The key for the data frame.

Returns:

The preprocessed data frame.

Return type:

pandas.DataFrame

drexml.datasets.preprocess_genes(frame, genes_column)

Preprocess a gene expression data frame.

Parameters:

frame (pandas.DataFrame) – The gene expression data frame to preprocess.
genes_column (str) – The name of the column containing gene information.

Returns:

The preprocessed gene expression data frame.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"Gene": ["A", "B", "C"], "Value": [1, 2, 3]})
>>> preprocess_genes(df, "Gene")
  Gene  Value
0    A      1
1    B      2
2    C      3

Notes

This function selects rows from the input data frame based on the values in the specified genes column and returns the resulting data frame.

drexml.datasets.preprocess_gexp(frame)

Preprocess a gene expression data frame.

Parameters:: frame (pandas.DataFrame) – The gene expression data frame to preprocess.
Returns:: The preprocessed gene expression data frame.
Return type:: pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"X1": [1, 2], "X2": [3, 4]})
>>> preprocess_gexp(df)
   1  2
0  1  3
1  2  4

Notes

This function removes the “X” prefix from the column names of the input data frame and returns the resulting data frame.

drexml.datasets.preprocess_map(frame, disease_seed_genes, circuits_column, use_physio, circuits_dict=None)

Preprocess a map data frame.

Parameters:

frame (pandas.DataFrame) – The map data frame to preprocess.
disease_seed_genes (str) – The comma separated list of disease seed genes.
circuits_column (str) – The name of the column containing circuit information.

Returns:

The list of circuits.

Return type:

list of str

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"in_disease": [True, False], "hipathia": ["A", "B"]})
>>> preprocess_map(df, "A,B", "in_disease")
['A', 'B']

Notes

This function replaces hyphens and spaces in the index names of the input data frame: with periods and returns the resulting list of circuits.

models

Model definition.

drexml.models.extract_estimator(model)

Extract the final estimator from a sklearn pipeline.

Parameters:: model (sklearn Pipeline, Estimator, Optimizer) – Fitted model.
Returns:: The final estimator.
Return type:: sklearn Estimator

drexml.models.get_model(n_features, n_targets, n_jobs, debug, n_iters=0, use_imputer=False)

Create a model.

Parameters:

n_features (int) – Number of features (KDTs / gene input targets).
n_targets (int) – Number of targets (circuits).
n_jobs (int) – The number of jobs to run in parallel.
debug (bool) – Debug flag.
n_iters (int, optional) – Number of iterations for hyperparameter optimization, by default 0.
use_imputer (bool, optional) – Flag to fit an imputer, by default False.

Returns:

The model to be fitted.

Return type:

sklearn.ensemble.RandomForestRegressor

drexml.models.get_rf_space(): Retrieve minimal hyperparameter space for a Random Forest whose number of base learners are going to be used as an expandable resource while optimizing.

pystab

Implementation of Nogueira’s stability measure. See: [1] S. Nogueira, K. Sechidis, and G. Brown, “On the Stability of Feature Selection Algorithms,” Journal of Machine Learning Research, vol. 18, no. 174, pp. 1–54, 2018.

class drexml.pystab.NogueiraTest(estimator, upper, lower, var, error, alpha)

alpha: Alias for field number 5

error: Alias for field number 4

estimator: Alias for field number 0

lower: Alias for field number 2

upper: Alias for field number 1

var: Alias for field number 3

drexml.pystab.fdr(p_vals)

False Discovery Rate p values adjustment.

Parameters:: p_vals (array like (n_runs, )) – The list of p values.
Returns:: FDR-adjusted p values.
Return type:: array (n_runs, )

drexml.pystab.nogueria_test(pop_mat, alpha=0.05, as_dict=False)

Let X be a feature space of dimension n_features and pop_mat a binary matrix of dimension (n_samples, n_features) representing n_samples runs of a feature selection algorithm over X (with respect to a response). This function computes the Nogueira stability estimate, error, variance and confidence interval.

Parameters:

pop_mat (2d-array like) – A (n_samples, n_features) binary matrix, each row is a sample of the FS algorithm applied on a n_features space, where a 1 in position (i,j) means that the feature j has been selected for the i-th run.
alpha (scalar) – Level of significance for the CI.

Returns:

A named tuple with the results of the stability test.

Return type:

NogueiraTest

explain

Explainability module for multi-task framework.

drexml.explain.build_stability_dict(z_mat, scores, alpha=0.05)

Adapt NogueiraTest to old version of drexml (use dicts).

Parameters:

z_mat (ndarray [n_model_samples, n_features]) – The stability matrix.
scores (ndarray [n_model_samples]) – The metric scores over the test sets.
alpha (float, optional) – Signficance level for Nogueira’s test, by default 0.05

Returns:

Dictionary with the Nogueira test results and test metric scores.

Return type:

dict

drexml.explain.compute_corr_sign(x, y)

Coompute the correlation sign.

Parameters:

x (ndarray [n_samples, n_features]) – The feature dataset.
y (ndarray [n_samples, n_tasks]) – The task dataset.

Returns:

SHAP feature-task (linear) interaction sign.

Return type:

ndarray [n_features, n_tasks]

drexml.explain.compute_shap_fs(relevances, model=None, X=None, Y=None, q='r2', by_circuit=False)

Compute the feature selection scores.

Parameters:

relevances (pandas.DataFrame [n_features, n_tasks]) – The relevance scores.
model (sklearn.base.BaseEstimator, optional) – The model to explain the data.
X (pandas.DataFrame [n_samples, n_features], optional) – The feature dataset to explain, by default None.
Y (pandas.DataFrame [n_samples, n_tasks], optional) – The task dataset to explain, by default None
q (float or str, optional) – Either a metric string to discriminate fs tasks or predefined quantile, by default “r2”
by_circuit (bool, optional) – Feature selection by circuit or globally, by default False

Returns:

The feature selection scores.

Return type:

pandas.Series [n_features]

drexml.explain.compute_shap_relevance(shap_values, X, Y)

Convert the SHAP values to relevance scores.

Parameters:

shap_values (ndarray [n_samples_new, n_features, n_tasks]) – The SHAP values.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.

Returns:

The task-wise feature relevance scores.

Return type:

pandas.DataFrame [n_features, n_tasks]

drexml.explain.compute_shap_values_(x, explainer, check_add, gpu_id=None)

Partial function to compute the SHAP values.

Parameters:

x (ndarray [n_samples, n_features]) – The feature dataset.
explainer (shap.TreeExplainer or shap.GPUTreeExplainer) – The SHAP explainer.
check_add (bool) – Check if the SHAP values add up to the model output.
gpu_id (int) – The GPU ID.

Returns:

shap_values – The SHAP values.

Return type:

ndarray [n_samples, n_features, n_tasks]

drexml.explain.get_quantile_by_circuit(model, X, Y, threshold=0.5)

Get the selection quantile of the model by circuit (or globally). Select features whose relevance score is above said quantile.

Parameters:

model (sklearn.base.BaseEstimator) – Fitted model.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.
threshold (float, optional) – Threshold to use to discriminate ill-conditioned circuits when performing feature selection, by default 0.5

Returns:

Qauntile to use.

Return type:

float

drexml.explain.matcorr(features, targets)

Fast correlation matrix computation.

Parameters:

features (ndarray [n_samples, n_features]) – A matrix of observations.
targets (ndarray [n_samples, n_tasks]) – A matrix of predictions.

Returns:

The correlation matrix.

Return type:

ndarray

utils

Utilities module.

drexml.utils.build_circuits_fname(config)

Build circuits filename.

Parameters:: config (dict) – Config dict.
Returns:: Filename.
Return type:: str

drexml.utils.build_gene_exp_fname(config)

Build gene_exp filename.

Parameters:: config (dict) – Config dict.
Returns:: Filename.
Return type:: str

drexml.utils.build_genes_fname(config)

Build genes filename.

Parameters:: config (dict) – Config dict.
Returns:: Filename.
Return type:: str

drexml.utils.build_pathvals_fname(config)

Build pathvals filename.

Parameters:: config (dict) – Config dict.
Returns:: Filename.
Return type:: str

drexml.utils.check_cli_arg_is_bool(arg)

Check if argument is a boolean.

Parameters:: arg (str) – Argument.
Returns:: Argument.
Return type:: bool

drexml.utils.check_gputree_availability(): Check if GPUTree has been correctly compiled.

drexml.utils.convert_names(dataset, keys, axis)

Convert names in the dataset.

Parameters:

dataset (pandas.DataFrame) – Dataset.
keys (list) – List of keys.
axis (list) – List of axis.

Returns:

Dataset.

Return type:

panda.DataFrame

Raises:

NotImplementedError – If key is not supported.

Examples

>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]})
>>> keys = ["circuits", "genes"]
>>> axis = [0, 1]
>>> convert_names(dataset, keys, axis)
   circuits  genes
0        C1      1
1        C2      2

>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]})
>>> keys = ["circuits", "genes"]
>>> axis = [0, 1]
>>> convert_names(dataset, keys, axis)
   circuits  genes
0        C1      1
1        C2      2

drexml.utils.ensure_zenodo(name, record_id='6020480')

Ensure file availability and download it from zenodo

Parameters:

name (str) – file name
record_id (str) – deposition identifier

Returns:

path – PosixPath to downloaded file

Return type:

path-like

drexml.utils.get_cuda_lib(): Get CUDA library name.

drexml.utils.get_cuda_version(): Get CUDA version.

drexml.utils.get_latest_record(record_id)

Get latest zenodo record ID from a given deposition identifier

Parameters:: record_id (str) – deposition identifier
Returns:: latest record ID
Return type:: str

drexml.utils.get_number_cuda_devices(): Get number of CUDA devices.

drexml.utils.get_number_cuda_devices_(): Get number of CUDA devices.

drexml.utils.get_out_path(disease)

Construct the path where the model must be saved.

Returns:: The desired path.
Return type:: pathlib.Path

drexml.utils.get_resource_path(fname): Get path to example disease env path. :returns: Path to file. :rtype: pathlib.PosixPath

drexml.utils.get_stab(data_folder, n_splits, n_cpus, debug, n_iters)

Get stab data.

Parameters:

data_folder (path-like) – Path to data folder.
n_splits (int) – Number of splits.
n_cpus (int) – Number of CPUs.
debug (bool) – Debug flag, by default False.
n_iters (int) – Number of hyperparameter optimization iterations.

Returns:

drexml.models.Model – Model.
list – List of splits.
panda.DataFrame – Gene expression data.
panda.DataFrame – Circuit activation data (hipathia).

drexml.utils.get_version(): Get drexml version.

drexml.utils.parse_stab(argv)

Parse stab arguments. :param argv: List of arguments. :type argv: list

Returns:

path-like – Path to data folder.
int – Number of hyperparameter optimizations.
int – Number of GPUs.
int – Number of CPUs.
int – Number of splits.
bool – Debug flag.

drexml.utils.read_activity_normalizer(config)

Read activity_normalizer from config file. It expects a boolean.

Parameters:: config (dict) – Parsed config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

drexml.utils.read_circuits_column(config)

Read circuits column.

Parameters:: config (dict) – Config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

drexml.utils.read_disease_config(disease)

Read disease config file.

Parameters:: disease (str) – Path to disease config file.
Returns:: Config dictionary.
Return type:: dict

drexml.utils.read_disease_id(config)

Read disease id from config file. It expects a disease id using the UMLS.

Parameters:: config (dict) – Parsed config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

drexml.utils.read_path_based(config, key, data_path)

Read path based.

Parameters:

config (dict) – Config dict.
key (str) – Key in config dict.
data_path (path-like) – Storage path.

Returns:

Updated config dict.

Return type:

dict

Raises:

ValueError – Raise error if key is not present in config dict.
FileNotFoundError – Raise error if path does not exist.

drexml.utils.read_seed_genes(config)

Read seed genes from config file. It expect a comma-separated list of entrez ids.

Parameters:: config (dict) – Parsed config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

drexml.utils.read_use_physio(config)

Read use_physio from config file. It expects a boolean.

Parameters:: config (dict) – Parsed config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

drexml.utils.read_version_based(config, key, version_dict)

Read version based.

Parameters:

config (dict) – Config dict.
key (str) – Key in config dict.
version_dict (dict) – Version dict.

Returns:

Updated config dict.

Return type:

dict

Raises:

ValueError – Raise error if format is unsupported.

drexml.utils.update_circuits(config)

Update circuits key from config.

Parameters:: config (dict) – Config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

Notes

If circuits is not provided, it will be built from the other keys.

If circuits is provided, it will be checked if it is a path.

If circuits is a path, it will be checked if it is a zenodo resource.

drexml.utils.update_gene_exp(config)

Update gene_exp key from config.

Parameters:: config (dict) – Config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

Notes

If gene_exp is not provided, it will be built from the other keys.

If gene_exp is provided, it will be checked if it is a path.

If gene_exp is a path, it will be checked if it is a zenodo resource.

drexml.utils.update_genes(config)

Update genes key from config.

Parameters:: config (dict) – Config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

Notes

If genes is not provided, it will be built from the other keys.

If genes is provided, it will be checked if it is a path.

If genes is a path, it will be checked if it is a zenodo resource.

drexml.utils.update_pathvals(config)

Update pathvals key from config.

Parameters:: config (dict) – Config dict.
Returns:: Updated config dict.
Return type:: dict
Raises:: ValueError – Raise error if format is unsupported.

Notes

If pathvals is not provided, it will be built from the other keys.

If pathvals is provided, it will be checked if it is a path.

If pathvals is a path, it will be checked if it is a zenodo resource.

plotting

Plotting module for DREXML.

class drexml.plotting.RepurposingResult(sel_mat: pd.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, score_mat: pd.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, stab_mat: pd.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>)

Class for storing the results of the DREXML analysis.

filter_scores(remove_unstable=True)

Filter the scores to only the selected genes and stable circuits.

Parameters:: remove_unstable (bool, optional) – Remove unstable circuits, by default True
Returns:: scores_filt – Filtered scores.
Return type:: pandas.DataFrame

get_stable_circuits()

Get the stable circuits.

Returns:: stable_circuits – List of stable circuits.
Return type:: list

plot_gene_profile(gene: str, output_folder=None)

Plot the gene profile.

Parameters:

gene (str) – Gene name.
output_folder (str, optional) – Output folder, by default None

Return type:

None.

plot_metrics(width=3.3, output_folder=None)

Read the drexml results TSV file and plot it. The R^2 confidence interval for the mean go to y-axis, whereas the x-axis shows the 95% interval for the Nogueiras’s stability estimate.

Parameters:

width (float, optional) – Width of the plot.
output_folder (str, optional) – Path to the output folder. If None, the output folder is the same as the input folder.

Return type:

None.

plot_relevance_heatmap(remove_unstable=True, output_folder=None)

Plot the relevance heatmap of the scores.

Parameters:

remove_unstable (bool, optional) – Remove unstable circuits, by default True
output_folder (str, optional) – Output folder, by default None

Return type:

None.

score_mat: alias of DataFrame

sel_mat: alias of DataFrame

stab_mat: alias of DataFrame