An HTTP server that allows you to find near duplicate or similar documents given another document. Implements go-raft so it can run as a cluster with other nodes and provide high-availability.
The explanations of the minhash and local sensitivity hashing algorithms used can be found here.
go get github.com/mauidude/dedupergodep go build./deduper [data directory]-hostThe host the server will run on. Defaults tolocalhost.-portThe port the server will run on. Defaults to8080.-leaderThehost:portof the leader node, if running as a follower. Defaults to leader mode.-debugEnables debug output. Defaults tofalse.
The following options will require testing with your document sizes and overall corpus size. If you change these values, you will need to readd all of your documents.
-bandsThe number of bands to use in the minhash algorithm. Defaults to100.-hashesThe number of hashes to use in the minhash algorithm. Defaults to2.-shinglesThe shingle size to use on the text. Defaults to2.
godep go test ./...POST /documents/:id HTTP/1.1
[HTTP headers...]
[document body]
This will add the document to the index under the given id.
Writes can be given to a leader or follower. Any writes to a follower get proxied to the leader.
POST /documents/similar HTTP/1.1
[HTTP headers...]
[document body]
This POST takes an optional threshold argument in the query string which will return only
documents with a similarity greater than or equal to that value. This value must be between
0 and 1. The default is 0.8.
This will return a JSON object of matching documents and their similarity. Similarity is a
value between 0 and 1 where 1 is identical and 0 is no shared content.
[
{
"id": "mydocument.txt",
"similarity": 0.934
},
{
"id": "someotherdocument.txt",
"similarity": 0.85
}
]