marsi.nearest_neighbors package

Submodules

marsi.nearest_neighbors.model module

class marsi.nearest_neighbors.model.ClFail[source]

Bases: object

class marsi.nearest_neighbors.model.KNN(fingerprint, k, mode)[source]

Bases: object

K-Nearest Neighbors runner object.

It is assigned to a model and runs the knn function.

fp

numpy.array – A numpy.array with the fingerprint values.

k

int – The maximum number of neighbors to retrieve.

mode

str – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.

Methods

__call__(nn)
class marsi.nearest_neighbors.model.RNN(fingerprint, radius, mode)[source]

Bases: object

R-Nearest Neighbors runner object.

It is assigned to a model and runs the rnn function.

fp

numpy.array – A numpy.array with the fingerprint values.

radius

float – A distance radius ]0, 1].

mode

str – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.

Methods

__call__(nn)
class marsi.nearest_neighbors.model.Distance(fingerprint, mode)[source]

Bases: object

Distance runner object.

It is assigned to a model and runs the distance function.

fp

numpy.array – A numpy.array with the fingerprint values.

mode

str – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.

Methods

__call__(nn)
class marsi.nearest_neighbors.model.DistributedNearestNeighbors(nns)[source]

Bases: object

Nearest Neighbors distributed implementation.

index

numpy.array – The index of all entries across multiple models.

Attributes

index

Methods

distance_matrix([mode]) Generates a distance matrix between all elements in the models.
distances(fingerprint[, mode, view]) Retrieves the distance a fingerprint and all elements in the model.
feature(index) Retrieves the fingerprint at a given index.
k_nearest_neighbors(fingerprint[, k, mode, view]) Retrieves the K nearest neighbors to a fingerprint.
radius_nearest_neighbors(fingerprint[, …]) Retrieves the nearest neighbors to a fingerprint within a distance radius.
k_nearest_neighbors(fingerprint, k=5, mode='native', view=<cameo.parallel.SequentialView object>)[source]

Retrieves the K nearest neighbors to a fingerprint.

Parameters:
  • fingerprint (list, np.array, tuple) – A fingerprint to use as query.
  • k (int) – The number of neighbors to retrieve.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
  • view (cameo.parallel.ParallelView, cameo.parallel.SequentialView) – A parallel mode runner.
Returns:

A dictionary with the InChI Key as key and the distance as value.

Return type:

dict

radius_nearest_neighbors(fingerprint, radius=0.25, mode='native', view=<cameo.parallel.SequentialView object>)[source]

Retrieves the nearest neighbors to a fingerprint within a distance radius.

Parameters:
  • fingerprint (list, np.array, tuple) – A fingerprint to use as query.
  • radius (float) – A distance radius ]0, 1].
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
  • view (cameo.parallel.ParallelView, cameo.parallel.SequentialView) – A parallel mode runner.
Returns:

A dictionary with the InChI Key as key and the distance as value.

Return type:

dict

distances(fingerprint, mode='native', view=<cameo.parallel.SequentialView object>)[source]

Retrieves the distance a fingerprint and all elements in the model.

Parameters:
  • fingerprint (list, np.array, tuple) – A fingerprint to use as query.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
  • view (cameo.parallel.ParallelView, cameo.parallel.SequentialView) – A parallel mode runner.
Returns:

A dictionary with the InChI Key as key and the distance as value.

Return type:

dict

index
distance_matrix(mode='native')[source]

Generates a distance matrix between all elements in the models.

Parameters:mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
Returns:The distance matrix.
Return type:numpy.array
feature(index)[source]

Retrieves the fingerprint at a given index. The index is global for the ensemble of models.

Returns:The fingerprint.
Return type:numpy.array
Raises:IndexError
class marsi.nearest_neighbors.model.NearestNeighbors(index, features, features_lengths, use_cl=False, opencl_context=None)[source]

Bases: marsi.nearest_neighbors.model_ext.CNearestNeighbors

Attributes

cl_context
data_frame Create a DataFrame with this model.
features
features_lengths
index
program
start_positions

Methods

distances(fingerprint[, mode])
distances_cl(*args, **kwargs)
distances_py
input_buffer(*args, **kwargs)
knn(fingerprint, k[, mode]) K-Nearest Neighbors
max_memory_allocation_size()
output_buffer(*args, **kwargs)
queue()
rnn(fingerprint, radius[, mode]) Radius-Nearest Neighbors
run_kernel(*args, **kwargs)
cl_context
queue()[source]
program
knn(fingerprint, k, mode='native')[source]

K-Nearest Neighbors

Parameters:
  • fingerprint (ndarray) – The fingerprint to search for.
  • k (int) – The number of neighbors to return.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
Returns:

(Index –> Distance)

Return type:

dict

rnn(fingerprint, radius, mode='native')[source]

Radius-Nearest Neighbors

Parameters:
  • fingerprint (ndarray) – The fingerprint to search for.
  • radius (float) – The maximum distance of neighbors to return.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
Returns:

(Index –> Distance)

Return type:

dict

distances(fingerprint, mode='native')[source]
max_memory_allocation_size()[source]
distances_cl(*args, **kwargs)
input_buffer(*args, **kwargs)
output_buffer(*args, **kwargs)
run_kernel(*args, **kwargs)
index
data_frame

Create a DataFrame with this model.

Returns:A data frame.
Return type:pandas.DataFrame
class marsi.nearest_neighbors.model.DBNearestNeighbors(index, session, fingerprint_format, metric='jaccard')[source]

Bases: object

Attributes

data_frame Create a DataFrame with this model.
features
index
neighbors

Methods

distances(fingerprint[, mode])
knn(fingerprint, k[, mode]) K-Nearest Neighbors
rnn(fingerprint, radius[, mode]) Radius-Nearest Neighbors
neighbors
knn(fingerprint, k, mode='native')[source]

K-Nearest Neighbors

Parameters:
  • fingerprint (ndarray) – The fingerprint to search for.
  • k (int) – The number of neighbors to return.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
Returns:

(Index –> Distance)

Return type:

dict

rnn(fingerprint, radius, mode='native')[source]

Radius-Nearest Neighbors

Parameters:
  • fingerprint (ndarray) – The fingerprint to search for.
  • radius (float) – The maximum distance of neighbors to return.
  • mode (str) – ‘native’ to run python implementation or ‘cl’ to run OpenCL implementation if available.
Returns:

(Index –> Distance)

Return type:

dict

distances(fingerprint, mode='native')[source]
index
features
data_frame

Create a DataFrame with this model.

Returns:A data frame.
Return type:pandas.DataFrame

marsi.nearest_neighbors.model_ext module

class marsi.nearest_neighbors.model_ext.CNearestNeighbors

Bases: object

Attributes

features
features_lengths
start_positions

Methods

distances_py
distances_py()
features
features_lengths
start_positions

Module contents

marsi.nearest_neighbors.build_nearest_neighbors_model(database, fpformat='fp4', solubility='high', n_models=5, chunk_size=1000000.0, view=<class 'cameo.parallel.SequentialView'>)[source]

Loads a NN model.

If a ‘default_model.pickle’ exists in data it will load the model. Otherwise it will build a model from the Database. This can take several hours depending on the size of the database.

Parameters:
  • database (marsi.io.mongodb.CollectionWrapper) – A Database interface to the metabolites.
  • chunk_size (int) – Maximum number of entries per chunk.
  • fpformat (str) – The format of the fingerprint (see pybel.fps)
  • solubility (str) – One of high, medium, low or all.
  • view (cameo.parallel.SequentialView, cameo.parallel.MultiprocesingView) – A view to control parallelization.
  • n_models (int) – The number of NearestNeighbors models.
marsi.nearest_neighbors.load_nearest_neighbors_model(chunk_size=1000000.0, fpformat='fp4', solubility='all', session=<sqlalchemy.orm.session.Session object>, view=<cameo.parallel.SequentialView object>, model_size=100000, source='db', costum_query=None)[source]

Loads a NN model.

If a ‘default_model.pickle’ exists in data it will load the model. Otherwise it will build a model from the Database. This can take several hours depending on the size of the database.

Parameters:
  • chunk_size (int) – Maximum number of entries per chunk.
  • fpformat (str) – The format of the fingerprint (see pybel.fps)
  • solubility (str) – One of high, medium, low or all.
  • view (cameo.parallel.SequentialView, cameo.parallel.MultiprocesingView) – A view to control parallelization.
  • model_size (int) – The size of each NearestNeighbor in the ensemble.