Fast pairwise mutual information for Python and C++
A fast pairwise mutual information calculation for Python and C++, optimised for small repeated integer counts such as those seen in single-cell RNA sequencing data. This uses the copula method to combine normal or negative binomial marginal distributions and a kd-tree for joint density estimation. In this way, sparse zero-inflated data are naturally incorporated, and as an added benefit, the
The mutual information between two discrete variables
With the assumption that we know the marginal distributions of
In Python, pairwise mutual information calculations are provided for both normal (mi_normal) and negative binomial (mi_negative_binomial) marginal distributions. These functions are overloaded to accept either scalar or vector (e.g., np.array) of floating point or integer valued observations. Several optimisations are provided for integer valued data.
These functions have to be passed the parameters of the corresponding marginal distributions. For example, for the normal distribution:
from fast_mutual_information import mi_normal
# mean and standard-deviation for the two distributions
mean1, std_dev1 = 0.0, 1.0
mean2, std_dev2 = 0.0, 1.0
# data: an (Nsamples, 2) array
mi, chi2 = mi_normal(mean1, std_dev1, mean2, std_dev2, data)and for the negative binomial distribution (in the nb2 parameterisation):
from fast_mutual_information import mi_negative_binomial
# means and concentrations for negative binomial marginals
mean1, conc1 = 10, 10
mean2, conc2 = 10, 10
data = np.random.normal(0, 1, (1000, 2)).astype(np.int32)
# data: an integer data array of shape (Nsamples, Nfeatures)
mi_matrix, chi2_matrix = mi_negative_binomial(data, means, concentrations)The vector functions take an (N_samples, N_features) array of data, and return an (N_features, N_features) matrix of mutual information values, of which only the upper triangular part is filled.
These are parallelised using OpenMP, so be sure to set OMP_NUM_THREADS to a reasonable value:
export OMP_NUM_THREADS=4Run time comparison on an M3 Pro, with OMP_NUM_THREADS=8.
For Python, install from PyPi using pip:
pip install fast_mutual_informationor, using the latest wheel for your platform from GitHub:
pip install fast_mutual_information-0.1.0-cp310-cp310-macosx_11_0_arm64.whlor, to install from source, see below.
This is the best way to ensure that the package uses as many features of your CPU as possible, as the pre-built libraries are built assuming generic CPU architectures.
When building from source, this package requires CMake, Boost, Eigen, OpenMP, and PyBind11. PyBind11 is included with the package, however the rest are best installed using the package manager for your system. On Mac these can be installed using brew:
brew install cmake
brew install eigen
brew install libomp
brew install boostand OpenMP requires the path to be set:
export OpenMP_ROOT=$(brew --prefix)/opt/libompOn Ubuntu, these can be installed using apt-get:
sudo apt-get install cmake
sudo apt-get install python3-dev
sudo apt-get install libboost-all-dev
sudo apt-get install libeigen3-dev
This package also requires the fast_negative_binomial package, which by default is included as a submodule.
To checkout the repository, as well as the PyBind11 and FastNB submodule:
git clone --recurse-submodules git@github.com:dpohanlon/fast-mi.git
cd fast_miTo build the C++ library, you can build with CMake:
mkdir build
cd build
cmake ../
make -j4and to build with the optional benchmarks (requires Google Benchmarks), configure cmake with the argument -DENABLE_BENCHMARK=true:
cmake ../ -DENABLE_BENCHMARK=trueWith the repository checked out, build using pip:
pip install .If you've made modifications, reformat with clang-format:
clang-format -i -style=file fast_mutual_information/*ppSeveral optimisations have been implemented to accelerate mutual information computations, especially when working with repeated integer values. The package constructs a kd-tree from the transformed data to reduce the computational complexity of pairwise distance evaluations. This significantly speeds up the mutual information calculation.
Mutual information is computed in the transformed space via a copula. This decouples the marginals from the dependency structure, enabling flexible specification of marginal distributions (e.g., normal, uniform, negative binomial). The copula functions are configurable, allowing for on-the-fly calculations.
The kd-tree in the integer case is calculated on the raw input data, rather than the uniform distribution required for the copula, in order to exploit a 'run length' representation of the data. This keeps track of repeated values, so that the computation needs only happen once for each unique value. When the PDF of the copula is computed for the MI calculation, the CDF transformation is performed on the fly.
