tgutils package¶

Submodules¶

tgutils.application module¶

Utilities for main functions.

class tgutils.application.FileLockLoggerAdapter(logger: logging.Logger, path: str)¶

Bases: logging.LoggerAdapter

A logger adapter that performs a file lock around each logged messages.

If used consistently in multiple applications, this ensures that logging does not get garbled, even when running across multiple machines.

__init__(logger: logging.Logger, path: str) → None¶: Create a logger adapter that locks the specified directory path.

log(*args, **kwargs) → Any¶: Log a message while locking the directory.

class tgutils.application.Loop(*, start: str, progress: str, completed: str, log_every: int = 1, log_with: Optional[int] = None, expected: Optional[int] = None)¶

Bases: object

Log progress for a (possibly parallel) loop.

__init__(*, start: str, progress: str, completed: str, log_every: int = 1, log_with: Optional[int] = None, expected: Optional[int] = None) → None¶: Initialize self. See help(type(self)) for accurate signature.

completed = None¶: The format of the completion messages.

done() → None¶: Indicate the loop has completed.

expected = None¶: The expected number of increments.

local_every = None¶: Granularity of parallel counting.

log_every = None¶: Emit a log message every this amount of iterations (typically a power of 10).

log_with = None¶: The value in the log message is divided by this amount (typically a power of 1000).

progress = None¶: The format of the progress message.

shared_counter = None¶: The shared memory iteration counter.

start = None¶: The format of the start message.

step(fraction: Optional[float] = None) → None¶

Indicate a loop iteration.

Ideally is called at the end of the iteration to indicate the iteration has completed. If the loop code is complex (contains continue etc.) then it is placed at the start of the code.

tgutils.application.each_file_line(path: str, loop: Optional[tgutils.application.Loop] = None) → Iterator[Tuple[int, str]]¶: Loop on each file line.

tgutils.application.indexed_range(index: int, *, size: int, invocations: int = 0) → range¶

Return a range of indices for an indexed invocation.

Each invocation index will get its own range, where the range sizes will be the same (as much as possible) for each invocation.

If the number of invocations is zero, it is assumed to be the number of available parallel processes, that is, that there will be one invocation per parallel process (at most size).

tgutils.application.lock_file(lock_path: str, lock_fd: int) → Iterator[None]¶: Perform some action while holding a file lock.

tgutils.application.main(parser: argparse.ArgumentParser, functions: Optional[List[str]] = None, *, adapter: Optional[Callable[argparse.Namespace, None]] = None) → None¶

A generic main function for configurable functions.

See dynamake.application.main().

tgutils.application.maximal_open_files() → None¶: Ensure we can use the maximal number of open files at the same time.

tgutils.application.reset_application() → None¶: Reset the global state (for tests).

tgutils.application.tg_qsub_logger(logger: logging.Logger) → logging.Logger¶: Wrap a logger so that messages will not get interleaved with other program invocations and/or the messages from the tg_qsub script.

tgutils.application.tgutils_adapter(args: argparse.Namespace) → None¶: Perform last minute adaptation of the program following parsing the command line options.

tgutils.cache module¶

Simple caching of expensive values.

class tgutils.cache.Cache¶

Bases: typing.Generic

Cache of expensive values.

__init__() → None¶: Initialize self. See help(type(self)) for accurate signature.

lookup(key: Key, compute_value: Callable[Value]) → Value¶: Lookup a value by its key, computing it only if this is the first lookup.

static reset() → None¶: Clear all the caches (for tests).

tgutils.load_yaml module¶

Load data from YAML files.

tgutils.load_yaml.load_dictionary(path: str, data: Any = None, *, allowed_keys: Optional[Dict[str, type]] = None, required_keys: Optional[Dict[str, type]] = None, key_type: type = <class 'str'>, value_type: Optional[type] = None) → Dict[Any, Any]¶

Load a dictionary with string keys a YAML or JSON file.

Parameters:	path – The path of the YAML/JSON file. data – Optional data loaded from the file. If this is `None`, the file is loaded instead. allowed_keys – An optional dictionary of allowed keys, where the value is the expected type of the loaded value. If not `None`, other keys are rejected (unless listed in required_keys). required_keys – An optional dictionary of required_keys, where the value is the expected type of the loaded value. If not `None`, specified keys that are missing from the loaded data are an error. key_type – The expected type of the keys, `str` by default. value_type – An optional type. If not `None`
Returns:	The loaded dictionary.
Return type:	Dict[str, Any]

tgutils.load_yaml.verify_type(path: str, element_kind: str, element_identifier: str, value: Any, expected_type: Optional[type]) → None¶

Verify the type of an element loaded from a YAML/JSON file.

If the value has an unexpected type, throws a RuntimeError.

Parameters:	path – The path of the loaded YAML/JSON file. element_kind – The kind of element this is (for the error message). element_identifier – The identifier of the element (unique within its kind). value – The loaded value of the element. expected_type – The expected Python class the value should be an instance of.

tgutils.make module¶

Utilities for using DynaMake.

tgutils.make.parallel_jobs() → int¶

Return the number of jobs to use for a parallel sub-process in the current context (can be passed to --jobs).

This assumes all the actions of the innermost tg_require in the current context are executed, and tries to utilize all the available CPUs for them.

tgutils.make.reset_make() → None¶: Reset the persistent context (for tests).

tgutils.make.tg_require(*paths) → None¶

Require all the specified paths with a parallel context.

This sets up the invocation context(s) of all the actions needed to build these files, and any of their dependencies, such that parallel_size contains the number of paths and parallel_index contains the index of the specific path. If nested, the context of the innermost call is used.

The parallel_size and parallel_index context can then be embedded in the run_prefix of the actions, to be passed to qsubber which uses this information to optimize the assignment of CPUs to SunGrid jobs.

For example, suppose you wrote the following in DynaMake.yaml:

- when:
    is_parallel: True
    step: my_expensive_multi_processing_step
  then:
    run_prefix:
      'qsubber -v -I {parallel_index} -S {parallel_size} -j job-{action_id} -s 8- --'

Then qsubber will allocate at least 8 CPUs for each action invoked by some_step. If there are only a few such invocations (say, up to one per cluster server), it may assign more CPUs to each invocation (up to all the CPUs on each server). If there are many invocations, it will assign less, to ensure as many invocations as possible run in parallel.

This is due to the unfortunate fact that speedup gained by using more CPUs is not linear; that is, a 16-CPU action takes longer than half the time it takes using 8 CPUs. Therefore, if all we have is a 16-CPU machine, we are better off running two 8-CPU actions in parallel than one 16-CPU action followed by another.

This is overly convoluted, sub-optimal, and very specific to the way we distribute actions on the SunGrid cluster in the Tanay Group lab. The cluster manager should arguably do much better without all these complications. However, all we have is qsub.

tgutils.numpy module¶

Numpy utilities.

Import this as np instead of importing the numpy module. It exports the same symbols, with the addition of strongly-typed phantom classes for tracking the exact dimensions and type of each variable using mypy. It also provides some additional utilities (I/O).

tgutils.numpy.A = ~A¶: Type variable for arrays.

tgutils.numpy.ARRAY_OF_DTYPE = {'bool': <class 'tgutils.numpy.ArrayBool'>, 'float32': <class 'tgutils.numpy.ArrayFloat32'>, 'float64': <class 'tgutils.numpy.ArrayFloat64'>, 'int16': <class 'tgutils.numpy.ArrayInt16'>, 'int32': <class 'tgutils.numpy.ArrayInt32'>, 'int64': <class 'tgutils.numpy.ArrayInt64'>, 'int8': <class 'tgutils.numpy.ArrayInt8'>, 'str': <class 'tgutils.numpy.ArrayStr'>}¶: The phantom type for an array by its data type name.

class tgutils.numpy.ArrayBool¶

tgutils package¶

Submodules¶

tgutils.application module¶

tgutils.cache module¶

tgutils.load_yaml module¶

tgutils.make module¶

tgutils.numpy module¶

tgutils.pandas module¶

tgutils.setup_mypy module¶

tgutils.tests module¶

tgutils.tg_qsub module¶

tgutils.version module¶

Module contents¶