TGUtils - Utilities for Tanay Group lab code¶
This package contains common utilities used by the Tanay Group Python lab code (for example,
metacell). These utilities are generally useful and not associated with any specific project.
Phantom Types¶
The vanilla np.ndarray, pd.Series and pd.DataFrame types are very generic. They are the
same regardless of the element data type used, and in the case of np.ndarray, the number of
dimensions (array vs. matrix).
To understand the code, it is helpful to keep track of a more detailed data type - whether the
variable is an array or a matrix, and what the element data type is. To facilitate this, tgutils
provides what is known as “Phantom Types”. These types can be used in mypy type declarations,
even though the actual data type of the variables remains the vanilla Numpy/Pandas data type.
See tgutils.numpy and tgutils.pandas for the list of provided phantom types. To
use them, instead of the vanilla:
import numpy as np
import pandas as pd
Write the modified:
import tgutils.numpy as np
import tgutils.pandas as pd
This this will provide access to the vanilla symbols using the np. and/or pd. prefixes, and
will also provide access to the enhanced functionality described below.
For example, instead of writing:
def compute_some_integers() -> np.ndarray:
...
def compute_some_floats() -> np.ndarray:
...
def compute_stuff(foo: np.ndarray, bar: np.ndarray) -> ...:
...
def workflow() -> ...:
foo = compute_some_integers()
bar = compute_some_floats()
compute_stuff(foo, bar)
You should write:
def compute_some_integers() -> np.ArrayInt32:
...
def compute_some_floats() -> np.MatrixFloat64:
...
def compute_stuff(foo: np.ArrayInt32, bar: np.MatrixFloat64) -> ...:
...
def workflow() -> ...:
foo = compute_some_integers()
bar = compute_some_floats()
compute_stuff(foo, bar)
This will allow the reader to understand the exact data types involved. Even better, it will allow
mypy to verify that you actually pass the correct data type to each function invocation.
For example, if you by mistake write compute_stuff(bar, foo) then mypy will complain that
the data types do not match - even though, under the covers, both foo and bar have exactly
the same data type at run-time: np.ndarray.
To further help with mypy type checking, the tgutils package includes a stubs directory
containing very partial quick-and-dirty type stubs for numpy and pandas (ideally, some brave
soul(s) would tackle the very difficult issue of providing proper stubs for these libraries,
allowing for the removal of the tgutils stubs). Importing tgutils.setup_mypy() module
set MYPYPATH to this stubs directory, which is also a hack (see the metacell package for an
example of using this in your setup.py file).
Type Operations¶
Control over the data types is also important when performing computations. It affects performance,
memory consumption and even the semantics of some operations. For example, integer elements can
never be NaN while floating point elements can, boolean elements have their own logic, and
string elements are different from numeric elements.
To help with this, tgutils provides two functions, am and be. Both these functions
return the requested data type, but am is just an assertion while be is a cast operation.
That is, writing ArrayInt32.am(foo) will return foo as an ArrayInt32, or will raise an
error if foo is not an array of int32; while writing ArrayInt32.be(foo) will always
return an ArrayInt32, which is either foo if it is an array of int32, or a copy of
foo whose elements are the conversion of the elements of foo to int32.
De/serialization¶
The phantom types also provide read and write operations for efficiently storing data on the disk.
That is, writing ArrayInt32.read(path) will read an array of int32 elements from the
specified path, and ArrayInt32.write(foo, path) will write an array of int32 elements
into the specified path.
DynaMake¶
Import tgutils.make instead of dynamake.make. This will achieve the following:
Using Qsub¶
The tgutils.tg_qsub script deals with submitting jobs to run on the SunGrid cluster in the
Tanay Group lab.
A tgutils.make.tg_require() function allows for collecting context for optimizing the slot
allocation of tg_qsub for maximizing the cluster utilization and minimizing wait times. This has
no effect unless the collected context values are explicitly used in the run_prefix and/or
run_suffix action wrapper of some step.
This is a convoluted and sub-optimal mechanism but has significant performance benefits in the specific environment it was designed for.
Applications¶
Import tgutils.application instead of dynamake.application. This will achieve the following:
Resources¶
By default, the Python process is restricted in the number of simultaneous open files. This
is raised by tgutils to the maximum allowed by the operating system.
Numpy Errors¶
By default, numpy ignores several kinds of numeric errors. This is modified by tgutils
to raise an appropriate exception. This increases the robustness of the results.
Numpy Random Number Generation¶
By default, dynamake only handles the Python random number generator. This is extended by
tgutils so that the numpy random number generator is seeded with the same seeds as the
Python random number generator, even in parallel calls. This seeding ensures results are replicable
(when using the same non-zero seed).
Logging¶
The default Python logging that prints to stderr works well for a single application. However,
when running multiple applications in parallel, log messages may get interleaved resulting in
garbled output.
This is solved by tgutils using the tgutils.application.tg_qsub_logger(), which wraps
the default logger with a tgutils.application.FileLockLoggerAdapter. This uses a file
lock operation around each emitted log message to ensure it is atomic. The lock file is chosen
to be compatible with the one used by the tgutils.tg_qsub script, so that log messages
from this script will also be protected.
Parallel¶
When running a large number of very small tasks, it possible to let multiprocessing.Pool run
each task on the much smaller number of available threads. However, this is less efficient. An
alternative is to use tgutils.application.indexed_range() which will partition the large
range of task indices into equal-sized sub-ranges, one per process. Reporting progress can be
done using the tgutils.application.ParallelCounter class.
Other Utilities¶
Tests¶
The provided tgutils.tests module provides TestWithReset which properly
resets all the global state for each test, and TestWithFiles which also creates
a fresh temporary directory for each test. You can create new files using
tgutils.tests.write_file() and verify file contents using
tgutils.tests.TestWithFiles.expect_file().
Caching¶
You can use the tgutils.cache.Cache class for a lightweight generic cache mechanism.
It uses weak references to hold onto expensive-to-compute data.
YAML¶
You can use tgutils.load_yaml.load_dictionary() for a lightweight verification of loaded
YAML data.