Spark Python K-nn

Simple but memory efficient function for computation of K nearest neighbors.

Requires

Installs Numpy, scikit-learn.

Install

Add jakac:spark-python-knn:0.0.3 to requirements of your app:

$SPARK_HOME/bin/pyspark --packages jakac:spark-python-knn:0.0.3

Usage

```python
from gaussalgo.knn import compute_neighbors
import numpy as np

left = sc.parallelize([
        (1, np.array([0,0,1,1])),
        (2, np.array([0,1,1,1])),
        (3, np.array([0,0,1,1])),
        (4, np.array([1,1,1,1]))
])
right = sc.parallelize([
        (5, np.array([0,0,1,1])),
        (6, np.array([0,1,1,1])),
        (7, np.array([1,0,0,1])),
        (8, np.array([1,1,1,1]))
])

neighbors = compute_neighbors(left, right, top_n=2).collect()
returns: [(1, [(5, 0.99999999999999978), (6, 0.81649658092772603)]), (2, [(6, 1.0000000000000002), (8, 0.86602540378443882)]), (3, [(5, 0.99999999999999978), (6, 0.81649658092772603)]), (4, [(8, 1.0), (6, 0.86602540378443882)])]
```

Options: * top_n: int How many neighbors compute * metric: callable which for two vectors return their distance/similarity. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. * is_similarity: boolean If metric is similarity or distance * X_blocks, Y_block: int Repartition X or Y to specified number of partitions.

Tips

Computational complexity is still $X \times Y$, it uses brute force for K-nn search, but it do not need $X \times Y$ of memory. Memory usage is X_PARTITION_SIZE $\times$ NUM_X_PARTITIONS $\times$ Y_PARTITION_SIZE $\times$ NUM_Y_PARTITIONS, so choosing right number of partitions is crucial.

Contribution

Feel free to comment it or contribute to it.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
python		python
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Python K-nn

Requires

Install

Usage

Tips

Contribution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jakac/spark-python-knn

Folders and files

Latest commit

History

Repository files navigation

Spark Python K-nn

Requires

Install

Usage

Tips

Contribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages