API Reference
datasets
IO module for DREXML.
- drexml.datasets.fetch_file(key, env, version='latest')
Retrieve file from the environment.
- Parameters:
key (str) – Key of the file to retrieve.
env (dict) – Environment.
version (str) – Version of the file to retrieve.
- Returns:
Path to the file.
- Return type:
pathlib.Path
- Raises:
NotImplementedError – Not implemented yet.
- drexml.datasets.get_data(disease, debug)
Load disease data and metadata.
- Parameters:
disease (path-like) – Path to disease config file.
debug (bool) – _description_, by default False.
scale (bool, optional) – _description_, by default False.
- Returns:
pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.
- drexml.datasets.get_disease_data(disease)
Get data for a disease.
- Parameters:
disease (pathlib.Path) – Path to the disease configuration file.
- Returns:
pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.
- drexml.datasets.get_gda(disease_id, k_top=40)
Retrieve the list of genes associated to a disease according to the Disgenet curated list of gene-disease associations.
- Parameters:
disease_id (str) – Disease ID.
k_top (int) – Retrieve at most k_top genes based on the GDA score.
- Returns:
List of gene IDs.
- Return type:
list
- drexml.datasets.get_index_name_options(key)
Returns a list of possible index names based on the input key.
- Parameters:
key (str) – The key for the data frame.
- Returns:
A list of possible index names based on the input key.
- Return type:
list of str
Examples
>>> get_index_name_options("circuits") ["hipathia_id", "hipathia", "circuits_id", "index"]
Notes
This function returns a list of possible index names based on the input key. If the key is “circuits”, it returns a list of four possible index names. If the key is “genes”, it returns a list of three possible index names. Otherwise, it returns a list with only one element, “index”.
- drexml.datasets.load_atc()
Load the ATC table.
- Returns:
ATC table.
- Return type:
pd.DataFrame
- drexml.datasets.load_df(path, key=None)
Load dataframe from file. At the moment: stv, tsv compressed or feather.
- Parameters:
path (pathlib.Path) – Path to file.
- Returns:
Dataframe.
- Return type:
pandas.DataFrame
- Raises:
NotImplementedError – Not implemented yet.
- drexml.datasets.load_disgenet()
Download if necessary and load the Disgenet curated list of gene-disease associations.
- Returns:
Disgenet curated dataset of gene-disease associations.
- Return type:
pd.DataFrame
- drexml.datasets.load_drugbank()
Download if necessary and load the drugbank table.
- Returns:
Drugbank table.
- Return type:
pd.DataFrame
- drexml.datasets.load_physiological_circuits()
Load the list of physiological circuits.
- Returns:
List of physiological circuit IDs.
- Return type:
list
- drexml.datasets.preprocess_activities(frame)
Preprocess an activities data frame.
- Parameters:
frame (pandas.DataFrame) – The activities data frame to preprocess.
- Returns:
The preprocessed activities data frame.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"-": [1, 2], "Activity 1": [3, 4]}) >>> preprocess_activities(df) . Activity.1 0 1 3 1 2 4
Notes
This function replaces hyphens and spaces in the column names of the input data frame with periods and returns the resulting data frame.
- drexml.datasets.preprocess_frame(res, env, key)
Preprocess the input data frame.
- Parameters:
res (pandas.DataFrame) – The input data frame.
env (dict) – The environment variables.
key (str) – The key for the data frame.
- Returns:
The preprocessed data frame.
- Return type:
pandas.DataFrame
- drexml.datasets.preprocess_genes(frame, genes_column)
Preprocess a gene expression data frame.
- Parameters:
frame (pandas.DataFrame) – The gene expression data frame to preprocess.
genes_column (str) – The name of the column containing gene information.
- Returns:
The preprocessed gene expression data frame.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"Gene": ["A", "B", "C"], "Value": [1, 2, 3]}) >>> preprocess_genes(df, "Gene") Gene Value 0 A 1 1 B 2 2 C 3
Notes
This function selects rows from the input data frame based on the values in the specified genes column and returns the resulting data frame.
- drexml.datasets.preprocess_gexp(frame)
Preprocess a gene expression data frame.
- Parameters:
frame (pandas.DataFrame) – The gene expression data frame to preprocess.
- Returns:
The preprocessed gene expression data frame.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"X1": [1, 2], "X2": [3, 4]}) >>> preprocess_gexp(df) 1 2 0 1 3 1 2 4
Notes
This function removes the “X” prefix from the column names of the input data frame and returns the resulting data frame.
- drexml.datasets.preprocess_map(frame, disease_seed_genes, circuits_column, use_physio, circuits_dict=None)
Preprocess a map data frame.
- Parameters:
frame (pandas.DataFrame) – The map data frame to preprocess.
disease_seed_genes (str) – The comma separated list of disease seed genes.
circuits_column (str) – The name of the column containing circuit information.
- Returns:
The list of circuits.
- Return type:
list of str
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"in_disease": [True, False], "hipathia": ["A", "B"]}) >>> preprocess_map(df, "A,B", "in_disease") ['A', 'B']
Notes
- This function replaces hyphens and spaces in the index names of the input data frame
with periods and returns the resulting list of circuits.
models
Model definition.
- drexml.models.extract_estimator(model)
Extract the final estimator from a sklearn pipeline.
- Parameters:
model (sklearn Pipeline, Estimator, Optimizer) – Fitted model.
- Returns:
The final estimator.
- Return type:
sklearn Estimator
- drexml.models.get_model(n_features, n_targets, n_jobs, debug, n_iters=0, use_imputer=False)
Create a model.
- Parameters:
n_features (int) – Number of features (KDTs / gene input targets).
n_targets (int) – Number of targets (circuits).
n_jobs (int) – The number of jobs to run in parallel.
debug (bool) – Debug flag.
n_iters (int, optional) – Number of iterations for hyperparameter optimization, by default 0.
use_imputer (bool, optional) – Flag to fit an imputer, by default False.
- Returns:
The model to be fitted.
- Return type:
sklearn.ensemble.RandomForestRegressor
- drexml.models.get_rf_space()
Retrieve minimal hyperparameter space for a Random Forest whose number of base learners are going to be used as an expandable resource while optimizing.
pystab
Implementation of Nogueira’s stability measure. See: [1] S. Nogueira, K. Sechidis, and G. Brown, “On the Stability of Feature Selection Algorithms,” Journal of Machine Learning Research, vol. 18, no. 174, pp. 1–54, 2018.
- class drexml.pystab.NogueiraTest(estimator, upper, lower, var, error, alpha)
- alpha
Alias for field number 5
- error
Alias for field number 4
- estimator
Alias for field number 0
- lower
Alias for field number 2
- upper
Alias for field number 1
- var
Alias for field number 3
- drexml.pystab.fdr(p_vals)
False Discovery Rate p values adjustment.
- Parameters:
p_vals (array like (n_runs, )) – The list of p values.
- Returns:
FDR-adjusted p values.
- Return type:
array (n_runs, )
- drexml.pystab.nogueria_test(pop_mat, alpha=0.05, as_dict=False)
Let X be a feature space of dimension n_features and pop_mat a binary matrix of dimension (n_samples, n_features) representing n_samples runs of a feature selection algorithm over X (with respect to a response). This function computes the Nogueira stability estimate, error, variance and confidence interval.
- Parameters:
pop_mat (2d-array like) – A (n_samples, n_features) binary matrix, each row is a sample of the FS algorithm applied on a n_features space, where a 1 in position (i,j) means that the feature j has been selected for the i-th run.
alpha (scalar) – Level of significance for the CI.
- Returns:
A named tuple with the results of the stability test.
- Return type:
explain
Explainability module for multi-task framework.
- drexml.explain.build_stability_dict(z_mat, scores, alpha=0.05)
Adapt NogueiraTest to old version of drexml (use dicts).
- Parameters:
z_mat (ndarray [n_model_samples, n_features]) – The stability matrix.
scores (ndarray [n_model_samples]) – The metric scores over the test sets.
alpha (float, optional) – Signficance level for Nogueira’s test, by default 0.05
- Returns:
Dictionary with the Nogueira test results and test metric scores.
- Return type:
dict
- drexml.explain.compute_corr_sign(x, y)
Coompute the correlation sign.
- Parameters:
x (ndarray [n_samples, n_features]) – The feature dataset.
y (ndarray [n_samples, n_tasks]) – The task dataset.
- Returns:
SHAP feature-task (linear) interaction sign.
- Return type:
ndarray [n_features, n_tasks]
- drexml.explain.compute_shap_fs(relevances, model=None, X=None, Y=None, q='r2', by_circuit=False)
Compute the feature selection scores.
- Parameters:
relevances (pandas.DataFrame [n_features, n_tasks]) – The relevance scores.
model (sklearn.base.BaseEstimator, optional) – The model to explain the data.
X (pandas.DataFrame [n_samples, n_features], optional) – The feature dataset to explain, by default None.
Y (pandas.DataFrame [n_samples, n_tasks], optional) – The task dataset to explain, by default None
q (float or str, optional) – Either a metric string to discriminate fs tasks or predefined quantile, by default “r2”
by_circuit (bool, optional) – Feature selection by circuit or globally, by default False
- Returns:
The feature selection scores.
- Return type:
pandas.Series [n_features]
- drexml.explain.compute_shap_relevance(shap_values, X, Y)
Convert the SHAP values to relevance scores.
- Parameters:
shap_values (ndarray [n_samples_new, n_features, n_tasks]) – The SHAP values.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.
- Returns:
The task-wise feature relevance scores.
- Return type:
pandas.DataFrame [n_features, n_tasks]
- drexml.explain.compute_shap_values_(x, explainer, check_add, gpu_id=None)
Partial function to compute the SHAP values.
- Parameters:
x (ndarray [n_samples, n_features]) – The feature dataset.
explainer (shap.TreeExplainer or shap.GPUTreeExplainer) – The SHAP explainer.
check_add (bool) – Check if the SHAP values add up to the model output.
gpu_id (int) – The GPU ID.
- Returns:
shap_values – The SHAP values.
- Return type:
ndarray [n_samples, n_features, n_tasks]
- drexml.explain.get_quantile_by_circuit(model, X, Y, threshold=0.5)
Get the selection quantile of the model by circuit (or globally). Select features whose relevance score is above said quantile.
- Parameters:
model (sklearn.base.BaseEstimator) – Fitted model.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.
threshold (float, optional) – Threshold to use to discriminate ill-conditioned circuits when performing feature selection, by default 0.5
- Returns:
Qauntile to use.
- Return type:
float
- drexml.explain.matcorr(features, targets)
Fast correlation matrix computation.
- Parameters:
features (ndarray [n_samples, n_features]) – A matrix of observations.
targets (ndarray [n_samples, n_tasks]) – A matrix of predictions.
- Returns:
The correlation matrix.
- Return type:
ndarray
utils
Utilities module.
- drexml.utils.build_circuits_fname(config)
Build circuits filename.
- Parameters:
config (dict) – Config dict.
- Returns:
Filename.
- Return type:
str
- drexml.utils.build_gene_exp_fname(config)
Build gene_exp filename.
- Parameters:
config (dict) – Config dict.
- Returns:
Filename.
- Return type:
str
- drexml.utils.build_genes_fname(config)
Build genes filename.
- Parameters:
config (dict) – Config dict.
- Returns:
Filename.
- Return type:
str
- drexml.utils.build_pathvals_fname(config)
Build pathvals filename.
- Parameters:
config (dict) – Config dict.
- Returns:
Filename.
- Return type:
str
- drexml.utils.check_cli_arg_is_bool(arg)
Check if argument is a boolean.
- Parameters:
arg (str) – Argument.
- Returns:
Argument.
- Return type:
bool
- drexml.utils.check_gputree_availability()
Check if GPUTree has been correctly compiled.
- drexml.utils.convert_names(dataset, keys, axis)
Convert names in the dataset.
- Parameters:
dataset (pandas.DataFrame) – Dataset.
keys (list) – List of keys.
axis (list) – List of axis.
- Returns:
Dataset.
- Return type:
panda.DataFrame
- Raises:
NotImplementedError – If key is not supported.
Examples
>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]}) >>> keys = ["circuits", "genes"] >>> axis = [0, 1] >>> convert_names(dataset, keys, axis) circuits genes 0 C1 1 1 C2 2
>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]}) >>> keys = ["circuits", "genes"] >>> axis = [0, 1] >>> convert_names(dataset, keys, axis) circuits genes 0 C1 1 1 C2 2
- drexml.utils.ensure_zenodo(name, record_id='6020480')
Ensure file availability and download it from zenodo
- Parameters:
name (str) – file name
record_id (str) – deposition identifier
- Returns:
path – PosixPath to downloaded file
- Return type:
path-like
- drexml.utils.get_cuda_lib()
Get CUDA library name.
- drexml.utils.get_cuda_version()
Get CUDA version.
- drexml.utils.get_latest_record(record_id)
Get latest zenodo record ID from a given deposition identifier
- Parameters:
record_id (str) – deposition identifier
- Returns:
latest record ID
- Return type:
str
- drexml.utils.get_number_cuda_devices()
Get number of CUDA devices.
- drexml.utils.get_number_cuda_devices_()
Get number of CUDA devices.
- drexml.utils.get_out_path(disease)
Construct the path where the model must be saved.
- Returns:
The desired path.
- Return type:
pathlib.Path
- drexml.utils.get_resource_path(fname)
Get path to example disease env path. :returns: Path to file. :rtype: pathlib.PosixPath
- drexml.utils.get_stab(data_folder, n_splits, n_cpus, debug, n_iters)
Get stab data.
- Parameters:
data_folder (path-like) – Path to data folder.
n_splits (int) – Number of splits.
n_cpus (int) – Number of CPUs.
debug (bool) – Debug flag, by default False.
n_iters (int) – Number of hyperparameter optimization iterations.
- Returns:
drexml.models.Model – Model.
list – List of splits.
panda.DataFrame – Gene expression data.
panda.DataFrame – Circuit activation data (hipathia).
- drexml.utils.get_version()
Get drexml version.
- drexml.utils.parse_stab(argv)
Parse stab arguments. :param argv: List of arguments. :type argv: list
- Returns:
path-like – Path to data folder.
int – Number of hyperparameter optimizations.
int – Number of GPUs.
int – Number of CPUs.
int – Number of splits.
bool – Debug flag.
- drexml.utils.read_activity_normalizer(config)
Read activity_normalizer from config file. It expects a boolean.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_circuits_column(config)
Read circuits column.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_disease_config(disease)
Read disease config file.
- Parameters:
disease (str) – Path to disease config file.
- Returns:
Config dictionary.
- Return type:
dict
- drexml.utils.read_disease_id(config)
Read disease id from config file. It expects a disease id using the UMLS.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_path_based(config, key, data_path)
Read path based.
- Parameters:
config (dict) – Config dict.
key (str) – Key in config dict.
data_path (path-like) – Storage path.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if key is not present in config dict.
FileNotFoundError – Raise error if path does not exist.
- drexml.utils.read_seed_genes(config)
Read seed genes from config file. It expect a comma-separated list of entrez ids.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_use_physio(config)
Read use_physio from config file. It expects a boolean.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_version_based(config, key, version_dict)
Read version based.
- Parameters:
config (dict) – Config dict.
key (str) – Key in config dict.
version_dict (dict) – Version dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.update_circuits(config)
Update circuits key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
Notes
If circuits is not provided, it will be built from the other keys.
If circuits is provided, it will be checked if it is a path.
If circuits is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_gene_exp(config)
Update gene_exp key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
Notes
If gene_exp is not provided, it will be built from the other keys.
If gene_exp is provided, it will be checked if it is a path.
If gene_exp is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_genes(config)
Update genes key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
Notes
If genes is not provided, it will be built from the other keys.
If genes is provided, it will be checked if it is a path.
If genes is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_pathvals(config)
Update pathvals key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
dict
- Raises:
ValueError – Raise error if format is unsupported.
Notes
If pathvals is not provided, it will be built from the other keys.
If pathvals is provided, it will be checked if it is a path.
If pathvals is a path, it will be checked if it is a zenodo resource.
plotting
Plotting module for DREXML.
- class drexml.plotting.RepurposingResult(sel_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, score_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, stab_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>)
Class for storing the results of the DREXML analysis.
- filter_scores(remove_unstable=True)
Filter the scores to only the selected genes and stable circuits.
- Parameters:
remove_unstable (bool, optional) – Remove unstable circuits, by default True
- Returns:
scores_filt – Filtered scores.
- Return type:
pandas.DataFrame
- get_stable_circuits()
Get the stable circuits.
- Returns:
stable_circuits – List of stable circuits.
- Return type:
list
- plot_gene_profile(gene: str, output_folder=None)
Plot the gene profile.
- Parameters:
gene (str) – Gene name.
output_folder (str, optional) – Output folder, by default None
- Return type:
None.
- plot_metrics(width=3.3, output_folder=None)
Read the drexml results TSV file and plot it. The R^2 confidence interval for the mean go to y-axis, whereas the x-axis shows the 95% interval for the Nogueiras’s stability estimate.
- Parameters:
width (float, optional) – Width of the plot.
output_folder (str, optional) – Path to the output folder. If None, the output folder is the same as the input folder.
- Return type:
None.
- plot_relevance_heatmap(remove_unstable=True, output_folder=None)
Plot the relevance heatmap of the scores.
- Parameters:
remove_unstable (bool, optional) – Remove unstable circuits, by default True
output_folder (str, optional) – Output folder, by default None
- Return type:
None.
- score_mat
alias of
DataFrame
- sel_mat
alias of
DataFrame
- stab_mat
alias of
DataFrame