API Reference
IO module for DREXML.
- drexml.datasets.fetch_file(key, env, version='latest')
Retrieve file from the environment.
- Parameters:
key (str) – Key of the file to retrieve.
env (dict) – Environment.
version (str) – Version of the file to retrieve.
- Returns:
Path to the file.
- Return type:
- Raises:
NotImplementedError – Not implemented yet.
- drexml.datasets.get_data(disease, debug)
Load disease data and metadata.
- Parameters:
disease (path-like) – Path to disease config file.
debug (bool) – _description_, by default False.
scale (bool, optional) – _description_, by default False.
- Returns:
pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.
- drexml.datasets.get_disease_data(disease)
Get data for a disease.
- Parameters:
disease (pathlib.Path) – Path to the disease configuration file.
- Returns:
pandas.DataFrame – Gene expression data.
pandas.DataFrame – Circuit activation data (hipathia).
pandas.DataFrame – Circuit definition binary matrix.
pandas.DataFrame – KDT definition binary matrix.
- drexml.datasets.get_gda(disease_id, k_top=40)
Retrieve the list of genes associated to a disease according to the Disgenet curated list of gene-disease associations.
- Parameters:
disease_id (str) – Disease ID.
k_top (int) – Retrieve at most k_top genes based on the GDA score.
- Returns:
List of gene IDs.
- Return type:
- drexml.datasets.get_index_name_options(key)
Returns a list of possible index names based on the input key.
- Parameters:
key (str) – The key for the data frame.
- Returns:
A list of possible index names based on the input key.
- Return type:
list of str
>>> get_index_name_options("circuits") ["hipathia_id", "hipathia", "circuits_id", "index"]
This function returns a list of possible index names based on the input key. If the key is “circuits”, it returns a list of four possible index names. If the key is “genes”, it returns a list of three possible index names. Otherwise, it returns a list with only one element, “index”.
- drexml.datasets.load_atc()
Load the ATC table.
- Returns:
ATC table.
- Return type:
- drexml.datasets.load_df(path, key=None)
Load dataframe from file. At the moment: stv, tsv compressed or feather.
- Parameters:
path (pathlib.Path) – Path to file.
- Returns:
- Return type:
- Raises:
NotImplementedError – Not implemented yet.
- drexml.datasets.load_disgenet()
Download if necessary and load the Disgenet curated list of gene-disease associations.
- Returns:
Disgenet curated dataset of gene-disease associations.
- Return type:
- drexml.datasets.load_drugbank()
Download if necessary and load the drugbank table.
- Returns:
Drugbank table.
- Return type:
- drexml.datasets.load_physiological_circuits()
Load the list of physiological circuits.
- Returns:
List of physiological circuit IDs.
- Return type:
- drexml.datasets.preprocess_activities(frame)
Preprocess an activities data frame.
- Parameters:
frame (pandas.DataFrame) – The activities data frame to preprocess.
- Returns:
The preprocessed activities data frame.
- Return type:
>>> import pandas as pd >>> df = pd.DataFrame({"-": [1, 2], "Activity 1": [3, 4]}) >>> preprocess_activities(df) . Activity.1 0 1 3 1 2 4
This function replaces hyphens and spaces in the column names of the input data frame with periods and returns the resulting data frame.
- drexml.datasets.preprocess_frame(res, env, key)
Preprocess the input data frame.
- Parameters:
res (pandas.DataFrame) – The input data frame.
env (dict) – The environment variables.
key (str) – The key for the data frame.
- Returns:
The preprocessed data frame.
- Return type:
- drexml.datasets.preprocess_genes(frame, genes_column)
Preprocess a gene expression data frame.
- Parameters:
frame (pandas.DataFrame) – The gene expression data frame to preprocess.
genes_column (str) – The name of the column containing gene information.
- Returns:
The preprocessed gene expression data frame.
- Return type:
>>> import pandas as pd >>> df = pd.DataFrame({"Gene": ["A", "B", "C"], "Value": [1, 2, 3]}) >>> preprocess_genes(df, "Gene") Gene Value 0 A 1 1 B 2 2 C 3
This function selects rows from the input data frame based on the values in the specified genes column and returns the resulting data frame.
- drexml.datasets.preprocess_gexp(frame)
Preprocess a gene expression data frame.
- Parameters:
frame (pandas.DataFrame) – The gene expression data frame to preprocess.
- Returns:
The preprocessed gene expression data frame.
- Return type:
>>> import pandas as pd >>> df = pd.DataFrame({"X1": [1, 2], "X2": [3, 4]}) >>> preprocess_gexp(df) 1 2 0 1 3 1 2 4
This function removes the “X” prefix from the column names of the input data frame and returns the resulting data frame.
- drexml.datasets.preprocess_map(frame, disease_seed_genes, circuits_column, use_physio, circuits_dict=None)
Preprocess a map data frame.
- Parameters:
frame (pandas.DataFrame) – The map data frame to preprocess.
disease_seed_genes (str) – The comma separated list of disease seed genes.
circuits_column (str) – The name of the column containing circuit information.
- Returns:
The list of circuits.
- Return type:
list of str
>>> import pandas as pd >>> df = pd.DataFrame({"in_disease": [True, False], "hipathia": ["A", "B"]}) >>> preprocess_map(df, "A,B", "in_disease") ['A', 'B']
- This function replaces hyphens and spaces in the index names of the input data frame
with periods and returns the resulting list of circuits.
Model definition.
- drexml.models.extract_estimator(model)
Extract the final estimator from a sklearn pipeline.
- Parameters:
model (sklearn Pipeline, Estimator, Optimizer) – Fitted model.
- Returns:
The final estimator.
- Return type:
sklearn Estimator
- drexml.models.get_model(n_features, n_targets, n_jobs, debug, n_iters=0, use_imputer=False)
Create a model.
- Parameters:
n_features (int) – Number of features (KDTs / gene input targets).
n_targets (int) – Number of targets (circuits).
n_jobs (int) – The number of jobs to run in parallel.
debug (bool) – Debug flag.
n_iters (int, optional) – Number of iterations for hyperparameter optimization, by default 0.
use_imputer (bool, optional) – Flag to fit an imputer, by default False.
- Returns:
The model to be fitted.
- Return type:
- drexml.models.get_rf_space()
Retrieve minimal hyperparameter space for a Random Forest whose number of base learners are going to be used as an expandable resource while optimizing.
Implementation of Nogueira’s stability measure. See: [1] S. Nogueira, K. Sechidis, and G. Brown, “On the Stability of Feature Selection Algorithms,” Journal of Machine Learning Research, vol. 18, no. 174, pp. 1–54, 2018.
- class drexml.pystab.NogueiraTest(estimator, upper, lower, var, error, alpha)
- alpha
Alias for field number 5
- error
Alias for field number 4
- estimator
Alias for field number 0
- lower
Alias for field number 2
- upper
Alias for field number 1
- var
Alias for field number 3
- drexml.pystab.fdr(p_vals)
False Discovery Rate p values adjustment.
- Parameters:
p_vals (array like (n_runs, )) – The list of p values.
- Returns:
FDR-adjusted p values.
- Return type:
array (n_runs, )
- drexml.pystab.nogueria_test(pop_mat, alpha=0.05, as_dict=False)
Let X be a feature space of dimension n_features and pop_mat a binary matrix of dimension (n_samples, n_features) representing n_samples runs of a feature selection algorithm over X (with respect to a response). This function computes the Nogueira stability estimate, error, variance and confidence interval.
- Parameters:
pop_mat (2d-array like) – A (n_samples, n_features) binary matrix, each row is a sample of the FS algorithm applied on a n_features space, where a 1 in position (i,j) means that the feature j has been selected for the i-th run.
alpha (scalar) – Level of significance for the CI.
- Returns:
A named tuple with the results of the stability test.
- Return type:
Explainability module for multi-task framework.
- drexml.explain.build_stability_dict(z_mat, scores, alpha=0.05)
Adapt NogueiraTest to old version of drexml (use dicts).
- Parameters:
z_mat (ndarray [n_model_samples, n_features]) – The stability matrix.
scores (ndarray [n_model_samples]) – The metric scores over the test sets.
alpha (float, optional) – Signficance level for Nogueira’s test, by default 0.05
- Returns:
Dictionary with the Nogueira test results and test metric scores.
- Return type:
- drexml.explain.compute_corr_sign(x, y)
Coompute the correlation sign.
- Parameters:
x (ndarray [n_samples, n_features]) – The feature dataset.
y (ndarray [n_samples, n_tasks]) – The task dataset.
- Returns:
SHAP feature-task (linear) interaction sign.
- Return type:
ndarray [n_features, n_tasks]
- drexml.explain.compute_shap_fs(relevances, model=None, X=None, Y=None, q='r2', by_circuit=False)
Compute the feature selection scores.
- Parameters:
relevances (pandas.DataFrame [n_features, n_tasks]) – The relevance scores.
model (sklearn.base.BaseEstimator, optional) – The model to explain the data.
X (pandas.DataFrame [n_samples, n_features], optional) – The feature dataset to explain, by default None.
Y (pandas.DataFrame [n_samples, n_tasks], optional) – The task dataset to explain, by default None
q (float or str, optional) – Either a metric string to discriminate fs tasks or predefined quantile, by default “r2”
by_circuit (bool, optional) – Feature selection by circuit or globally, by default False
- Returns:
The feature selection scores.
- Return type:
pandas.Series [n_features]
- drexml.explain.compute_shap_relevance(shap_values, X, Y)
Convert the SHAP values to relevance scores.
- Parameters:
shap_values (ndarray [n_samples_new, n_features, n_tasks]) – The SHAP values.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.
- Returns:
The task-wise feature relevance scores.
- Return type:
pandas.DataFrame [n_features, n_tasks]
- drexml.explain.compute_shap_values_(x, explainer, check_add, gpu_id=None)
Partial function to compute the SHAP values.
- Parameters:
x (ndarray [n_samples, n_features]) – The feature dataset.
explainer (shap.TreeExplainer or shap.GPUTreeExplainer) – The SHAP explainer.
check_add (bool) – Check if the SHAP values add up to the model output.
gpu_id (int) – The GPU ID.
- Returns:
shap_values – The SHAP values.
- Return type:
ndarray [n_samples, n_features, n_tasks]
- drexml.explain.get_quantile_by_circuit(model, X, Y, threshold=0.5)
Get the selection quantile of the model by circuit (or globally). Select features whose relevance score is above said quantile.
- Parameters:
model (sklearn.base.BaseEstimator) – Fitted model.
X (pandas.DataFrame [n_samples, n_features]) – The feature dataset to explain.
Y (pandas.DataFrame [n_samples, n_tasks]) – The task dataset to explain.
threshold (float, optional) – Threshold to use to discriminate ill-conditioned circuits when performing feature selection, by default 0.5
- Returns:
Qauntile to use.
- Return type:
- drexml.explain.matcorr(features, targets)
Fast correlation matrix computation.
- Parameters:
features (ndarray [n_samples, n_features]) – A matrix of observations.
targets (ndarray [n_samples, n_tasks]) – A matrix of predictions.
- Returns:
The correlation matrix.
- Return type:
Utilities module.
- drexml.utils.build_circuits_fname(config)
Build circuits filename.
- Parameters:
config (dict) – Config dict.
- Returns:
- Return type:
- drexml.utils.build_gene_exp_fname(config)
Build gene_exp filename.
- Parameters:
config (dict) – Config dict.
- Returns:
- Return type:
- drexml.utils.build_genes_fname(config)
Build genes filename.
- Parameters:
config (dict) – Config dict.
- Returns:
- Return type:
- drexml.utils.build_pathvals_fname(config)
Build pathvals filename.
- Parameters:
config (dict) – Config dict.
- Returns:
- Return type:
- drexml.utils.check_cli_arg_is_bool(arg)
Check if argument is a boolean.
- Parameters:
arg (str) – Argument.
- Returns:
- Return type:
- drexml.utils.check_gputree_availability()
Check if GPUTree has been correctly compiled.
- drexml.utils.convert_names(dataset, keys, axis)
Convert names in the dataset.
- Parameters:
dataset (pandas.DataFrame) – Dataset.
keys (list) – List of keys.
axis (list) – List of axis.
- Returns:
- Return type:
- Raises:
NotImplementedError – If key is not supported.
>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]}) >>> keys = ["circuits", "genes"] >>> axis = [0, 1] >>> convert_names(dataset, keys, axis) circuits genes 0 C1 1 1 C2 2
>>> dataset = pd.DataFrame({"circuits": ["C1", "C2"], "genes": [1, 2]}) >>> keys = ["circuits", "genes"] >>> axis = [0, 1] >>> convert_names(dataset, keys, axis) circuits genes 0 C1 1 1 C2 2
- drexml.utils.ensure_zenodo(name, record_id='6020480')
Ensure file availability and download it from zenodo
- Parameters:
name (str) – file name
record_id (str) – deposition identifier
- Returns:
path – PosixPath to downloaded file
- Return type:
- drexml.utils.get_cuda_lib()
Get CUDA library name.
- drexml.utils.get_cuda_version()
Get CUDA version.
- drexml.utils.get_latest_record(record_id)
Get latest zenodo record ID from a given deposition identifier
- Parameters:
record_id (str) – deposition identifier
- Returns:
latest record ID
- Return type:
- drexml.utils.get_number_cuda_devices()
Get number of CUDA devices.
- drexml.utils.get_number_cuda_devices_()
Get number of CUDA devices.
- drexml.utils.get_out_path(disease)
Construct the path where the model must be saved.
- Returns:
The desired path.
- Return type:
- drexml.utils.get_resource_path(fname)
Get path to example disease env path. :returns: Path to file. :rtype: pathlib.PosixPath
- drexml.utils.get_stab(data_folder, n_splits, n_cpus, debug, n_iters)
Get stab data.
- Parameters:
data_folder (path-like) – Path to data folder.
n_splits (int) – Number of splits.
n_cpus (int) – Number of CPUs.
debug (bool) – Debug flag, by default False.
n_iters (int) – Number of hyperparameter optimization iterations.
- Returns:
drexml.models.Model – Model.
list – List of splits.
panda.DataFrame – Gene expression data.
panda.DataFrame – Circuit activation data (hipathia).
- drexml.utils.get_version()
Get drexml version.
- drexml.utils.parse_stab(argv)
Parse stab arguments. :param argv: List of arguments. :type argv: list
- Returns:
path-like – Path to data folder.
int – Number of hyperparameter optimizations.
int – Number of GPUs.
int – Number of CPUs.
int – Number of splits.
bool – Debug flag.
- drexml.utils.read_activity_normalizer(config)
Read activity_normalizer from config file. It expects a boolean.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_circuits_column(config)
Read circuits column.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_disease_config(disease)
Read disease config file.
- Parameters:
disease (str) – Path to disease config file.
- Returns:
Config dictionary.
- Return type:
- drexml.utils.read_disease_id(config)
Read disease id from config file. It expects a disease id using the UMLS.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_path_based(config, key, data_path)
Read path based.
- Parameters:
config (dict) – Config dict.
key (str) – Key in config dict.
data_path (path-like) – Storage path.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if key is not present in config dict.
FileNotFoundError – Raise error if path does not exist.
- drexml.utils.read_seed_genes(config)
Read seed genes from config file. It expect a comma-separated list of entrez ids.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_use_physio(config)
Read use_physio from config file. It expects a boolean.
- Parameters:
config (dict) – Parsed config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.read_version_based(config, key, version_dict)
Read version based.
- Parameters:
config (dict) – Config dict.
key (str) – Key in config dict.
version_dict (dict) – Version dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
- drexml.utils.update_circuits(config)
Update circuits key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
If circuits is not provided, it will be built from the other keys.
If circuits is provided, it will be checked if it is a path.
If circuits is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_gene_exp(config)
Update gene_exp key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
If gene_exp is not provided, it will be built from the other keys.
If gene_exp is provided, it will be checked if it is a path.
If gene_exp is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_genes(config)
Update genes key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
If genes is not provided, it will be built from the other keys.
If genes is provided, it will be checked if it is a path.
If genes is a path, it will be checked if it is a zenodo resource.
- drexml.utils.update_pathvals(config)
Update pathvals key from config.
- Parameters:
config (dict) – Config dict.
- Returns:
Updated config dict.
- Return type:
- Raises:
ValueError – Raise error if format is unsupported.
If pathvals is not provided, it will be built from the other keys.
If pathvals is provided, it will be checked if it is a path.
If pathvals is a path, it will be checked if it is a zenodo resource.
Plotting module for DREXML.
- class drexml.plotting.RepurposingResult(sel_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, score_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>, stab_mat: pandas.core.frame.DataFrame | pathlib.Path | str = <class 'pandas.core.frame.DataFrame'>)
Class for storing the results of the DREXML analysis.
- filter_scores(remove_unstable=True)
Filter the scores to only the selected genes and stable circuits.
- Parameters:
remove_unstable (bool, optional) – Remove unstable circuits, by default True
- Returns:
scores_filt – Filtered scores.
- Return type:
- get_stable_circuits()
Get the stable circuits.
- Returns:
stable_circuits – List of stable circuits.
- Return type:
- plot_gene_profile(gene: str, output_folder=None)
Plot the gene profile.
- Parameters:
gene (str) – Gene name.
output_folder (str, optional) – Output folder, by default None
- Return type:
- plot_metrics(width=3.3, output_folder=None)
Read the drexml results TSV file and plot it. The R^2 confidence interval for the mean go to y-axis, whereas the x-axis shows the 95% interval for the Nogueiras’s stability estimate.
- Parameters:
width (float, optional) – Width of the plot.
output_folder (str, optional) – Path to the output folder. If None, the output folder is the same as the input folder.
- Return type:
- plot_relevance_heatmap(remove_unstable=True, output_folder=None)
Plot the relevance heatmap of the scores.
- Parameters:
remove_unstable (bool, optional) – Remove unstable circuits, by default True
output_folder (str, optional) – Output folder, by default None
- Return type:
- score_mat
alias of
- sel_mat
alias of
- stab_mat
alias of