metaloci.misc package

Miscellaneous functions for METALoci.

Submodules

metaloci.misc.misc module

Miscellaneous functions for METALoci

metaloci.misc.misc.bedparser(gene_file_f: Path, name: str, extend: int, resolution: int, strand: bool = False) tuple[dict, dict, dict, str]

Parses a bed file and returns the information of the genes, excluding artifacts.

Parameters:
  • gene_file (path) – Path to the file that contains the genes.

  • name (str) – Name of the project

  • extend (int) – Extent of the region to be analyzed.

  • resolution (int) – Resolution at which to split the genome.

  • strand (bool, optional) – If the bed file contains the strand information, by default False.

Returns:

  • id_chrom (dict) – Dictionary linking the gene id to the chromosome.

  • id_tss (dict) – Dictionary linking the gene id to the tss.

  • id_name (dict) – Dictionary linking the gene id to the name.

  • filename (str) – Name of the end file.

metaloci.misc.misc.binsearcher(id_tss: dict, id_chrom: dict, id_name: dict, bin_genome: DataFrame) DataFrame

Searches the bin index where the gene is located.

Parameters:
  • id_tss (dict) – Dictionary linking the gene id to the tss.

  • id_chrom (dict) – Dictionary linking the gene id to the chromosome.

  • id_name (dict) – Dictionary linking the gene id to the name.

  • bin_genome (pd.DataFrame) – DataFrame containing the bins of the genome.

Returns:

data – DataFrame containing the information of the genes and the bin index

Return type:

pd.DataFrame

metaloci.misc.misc.check_names(hic_file: Path, data: Path, coords: Path, resolution: int = None) list

Checks if the chromosome names in the signal, cool/mcool/hic and chromosome sizes files are the same.

Parameters:
  • hic_file (Path) – Path to the cooler file.

  • data (Path) – Path to the signal file.

  • coords (Path) – Path to the chromosome sizes file.

  • resolution (int, optional) – Resolution to choose on the mcool file.

Returns:

chrom_list – List of chromosomes in the cooler file.

Return type:

list

metaloci.misc.misc.clean_matrix(mlobject: MetalociObject) MetalociObject

Clean a given Hi-C matrix. It checks if the matrix has too many zeroes at he diagonal, removes values that are zero at the diagonal but are not in the rest of the matrix, adds pseudocounts to zeroes depending on the min value, scales all values depending on the min value and computes the log10 of all values.

Parameters:

mlo (mlo.MetalociObject) – METALoci object with a matrix in it.

Returns:

mlobject – mlo object with the assigned clean matrix.

Return type:

mlo.MetalociObject

metaloci.misc.misc.create_version_log(subprogram: str, work_dir)

Function to create a log of the metaloci version used in each subprogram.

Parameters:
  • subprogram (str) – Subprogram to log the version.

  • work_dir (str) – Path to the working directory.

metaloci.misc.misc.get_poi_data(line: Series, args: Series)

Function to extract data from the METALoci objects and parse it into a table.

Parameters:
  • line (pd.Series) – Row of the DataFrame.

  • args (pd.Series) – Arguments from the command line.

metaloci.misc.misc.gtfparser(gene_file: Path, name: str, extend: int, resolution: int) tuple[dict, dict, dict, str]

Parses a gtf file and returns the information of the genes, excluding artifacts

Parameters:
  • gene_file (str) – Path to the file that contains the genes.

  • name (str) – Name of the project.

  • extend (int) – Extent of the region to be analyzed.

  • resolution (int) – Resolution at which to split the genome.

Returns:

  • id_chrom (dict) – Dictionary linking the gene id to the chromosome.

  • id_tss (dict) – Dictionary linking the gene id to the tss.

  • id_name (dict) – Dictionary linking the gene id to the name.

  • filename (str) – Name of the end file

metaloci.misc.misc.has_exactly_one_line(file_path: str) bool

Function to check if a file has exactly one line.

Parameters:

file_path (str) – Path to the file to check.

Returns:

True if the file has exactly one line, False otherwise.

Return type:

bool

metaloci.misc.misc.signal_binnarize(data: DataFrame, sum_type: str) DataFrame

Parsing the signal data frame with the appropiate summarising method.

Parameters:
  • data (pd.DataFrame) – Signal DataFrmae

  • sum_type (str) – Method to binnarize the signal

Returns:

data – Binnarized signal data frame

Return type:

pd.DataFrame

metaloci.misc.misc.signal_normalization(region_signal: DataFrame, pseudocounts: float = None, norm: str = None) ndarray

Normalize signal values.

Parameters:
  • ind_signal (pd.DataFrame) – Subset of the signal for a given region and for a signal type.

  • pseudocounts (float, optional) – Pseudocounts to add if the signal is 0, by default corresponds to the median of the signal for the region.

  • norm (str, optional) – Type of normalization to use. Values can be “max” (divide each value by the max value in the signal), “sum” (divide each value by the sum of all values), or “01” (value - min(signal) / max(signal) - min(signal)), by default “01”

Returns:

signal – Array of normalized signal values for a region and a signal type.

Return type:

np.ndarray

metaloci.misc.misc.ucscparser(gene_file: Path, name: str, extend: int, resolution: int) tuple[dict, dict, dict, str]

Parses a UCSC file and returns the information of the genes, excluding artifacts.

Parameters:
  • gene_file (str) – Path to the file that contains the genes.

  • name (str) – Name of the project.

  • extend (int) – Extent of the region to be analyzed.

  • resolution (int) – Resolution at which to split the genome.

Returns:

  • id_chrom (dict) – Dictionary linking the gene id to the chromosome.

  • id_tss (dict) – Dictionary linking the gene id to the tss.

  • id_name (dict) – Dictionary linking the gene id to the name.

  • filename (str) – Name of the end file.

metaloci.misc.misc.write_bad_region(mlobject, work_dir)

Writes the bad regions, after quality checking, to a file. Thread-safe and multi-process-safe using file locking.

Parameters:
  • mlobject (mlo.MetalociObject) – METALoci object.

  • work_dir (str) – Path to the working directory.

metaloci.misc.misc.write_bed(mlobject: MetalociObject, signal_type: str, neighbourhood: int, BFACT: float, args=None, silent: bool = False) None

Writes the bed file with the metalocis location.

Parameters:
  • mlobject (mlo.MetalociObject) – METALoci object.

  • signal_type (str) – Signal type to be used.

  • neighbourhood (int) – Neighbourhood to be used.

  • BFACT (float) – BFACT value to be used.

  • args (argparse.Namespace) – Arguments from the command line.

  • silent (bool, optional) – Variable that controls the verbosity of the function (useful for multiprocessing), by default False.

metaloci.misc.misc.write_moran_data(mlobject: MetalociObject, args, scan: bool = False, silent: bool = False) None

Writes the Moran data to a file.

Parameters:
  • mlobject (mlo.MetalociObject) – METALoci object with the Moran data.

  • args (argparse.Namespace) – Arguments from the command line.

  • scan (bool, optional) – Whether the data is part of a scan (default: False).

  • silent (bool, optional) – Controls verbosity (default: False).