TGUtils - Utilities for Tanay Group lab code¶
This package contains common utilities used by the Tanay Group Python lab code (for example,
metacell
). These utilities are generally useful and not associated with any specific project.
Phantom Types¶
The vanilla np.ndarray
, pd.Series
and pd.DataFrame
types are very generic. They are the
same regardless of the element data type used, and in the case of np.ndarray
, the number of
dimensions (array vs. matrix).
To understand the code, it is helpful to keep track of a more detailed data type - whether the
variable is an array or a matrix, and what the element data type is. To facilitate this, tgutils
provides what is known as “Phantom Types”. These types can be used in mypy
type declarations,
even though the actual data type of the variables remains the vanilla Numpy/Pandas data type.
See tgutils.numpy
and tgutils.pandas
for the list of provided phantom types. To
use them, instead of the vanilla:
import numpy as np
import pandas as pd
Write the modified:
import tgutils.numpy as np
import tgutils.pandas as pd
This this will provide access to the vanilla symbols using the np.
and/or pd.
prefixes, and
will also provide access to the enhanced functionality described below.
For example, instead of writing:
def compute_some_integers() -> np.ndarray:
...
def compute_some_floats() -> np.ndarray:
...
def compute_stuff(foo: np.ndarray, bar: np.ndarray) -> ...:
...
def workflow() -> ...:
foo = compute_some_integers()
bar = compute_some_floats()
compute_stuff(foo, bar)
You should write:
def compute_some_integers() -> np.ArrayInt32:
...
def compute_some_floats() -> np.MatrixFloat64:
...
def compute_stuff(foo: np.ArrayInt32, bar: np.MatrixFloat64) -> ...:
...
def workflow() -> ...:
foo = compute_some_integers()
bar = compute_some_floats()
compute_stuff(foo, bar)
This will allow the reader to understand the exact data types involved. Even better, it will allow
mypy
to verify that you actually pass the correct data type to each function invocation.
For example, if you by mistake write compute_stuff(bar, foo)
then mypy
will complain that
the data types do not match - even though, under the covers, both foo
and bar
have exactly
the same data type at run-time: np.ndarray
.
To further help with mypy
type checking, the tgutils
package includes a stubs
directory
containing very partial quick-and-dirty type stubs for numpy
and pandas
(ideally, some brave
soul(s) would tackle the very difficult issue of providing proper stubs for these libraries,
allowing for the removal of the tgutils
stubs). Importing tgutils.setup_mypy()
module
set MYPYPATH
to this stubs directory, which is also a hack (see the metacell
package for an
example of using this in your setup.py
file).
Type Operations¶
Control over the data types is also important when performing computations. It affects performance,
memory consumption and even the semantics of some operations. For example, integer elements can
never be NaN
while floating point elements can, boolean elements have their own logic, and
string elements are different from numeric elements.
To help with this, tgutils
provides two functions, am
and be
. Both these functions
return the requested data type, but am
is just an assertion while be
is a cast operation.
That is, writing ArrayInt32.am(foo)
will return foo
as an ArrayInt32
, or will raise an
error if foo
is not an array of int32
; while writing ArrayInt32.be(foo)
will always
return an ArrayInt32
, which is either foo
if it is an array of int32
, or a copy of
foo
whose elements are the conversion of the elements of foo
to int32
.
De/serialization¶
The phantom types also provide read and write operations for efficiently storing data on the disk.
That is, writing ArrayInt32.read(path)
will read an array of int32
elements from the
specified path, and ArrayInt32.write(foo, path)
will write an array of int32
elements
into the specified path.
DynaMake¶
Import tgutils.make
instead of dynamake.make
. This will achieve the following:
Using Qsub¶
The tgutils.tg_qsub
script deals with submitting jobs to run on the SunGrid cluster in the
Tanay Group lab.
A tgutils.make.tg_require()
function allows for collecting context for optimizing the slot
allocation of tg_qsub
for maximizing the cluster utilization and minimizing wait times. This has
no effect unless the collected context values are explicitly used in the run_prefix
and/or
run_suffix
action wrapper of some step.
This is a convoluted and sub-optimal mechanism but has significant performance benefits in the specific environment it was designed for.
Applications¶
Import tgutils.application
instead of dynamake.application
. This will achieve the following:
Resources¶
By default, the Python process is restricted in the number of simultaneous open files. This
is raised by tgutils
to the maximum allowed by the operating system.
Numpy Errors¶
By default, numpy
ignores several kinds of numeric errors. This is modified by tgutils
to raise an appropriate exception. This increases the robustness of the results.
Numpy Random Number Generation¶
By default, dynamake
only handles the Python random number generator. This is extended by
tgutils
so that the numpy
random number generator is seeded with the same seeds as the
Python random number generator, even in parallel calls. This seeding ensures results are replicable
(when using the same non-zero seed).
Logging¶
The default Python logging that prints to stderr
works well for a single application. However,
when running multiple applications in parallel, log messages may get interleaved resulting in
garbled output.
This is solved by tgutils
using the tgutils.application.tg_qsub_logger()
, which wraps
the default logger with a tgutils.application.FileLockLoggerAdapter
. This uses a file
lock operation around each emitted log message to ensure it is atomic. The lock file is chosen
to be compatible with the one used by the tgutils.tg_qsub
script, so that log messages
from this script will also be protected.
Parallel¶
When running a large number of very small tasks, it possible to let multiprocessing.Pool
run
each task on the much smaller number of available threads. However, this is less efficient. An
alternative is to use tgutils.application.indexed_range()
which will partition the large
range of task indices into equal-sized sub-ranges, one per process. Reporting progress can be
done using the tgutils.application.ParallelCounter
class.
Other Utilities¶
Tests¶
The provided tgutils.tests
module provides TestWithReset
which properly
resets all the global state for each test, and TestWithFiles
which also creates
a fresh temporary directory for each test. You can create new files using
tgutils.tests.write_file()
and verify file contents using
tgutils.tests.TestWithFiles.expect_file()
.
Caching¶
You can use the tgutils.cache.Cache
class for a lightweight generic cache mechanism.
It uses weak references to hold onto expensive-to-compute data.
YAML¶
You can use tgutils.load_yaml.load_dictionary()
for a lightweight verification of loaded
YAML data.