-
-
Notifications
You must be signed in to change notification settings - Fork 177
Project Ideas Improve Copyright Detection Accuracy and Speed
Copyright detection is reasonably good, but it is the slowest scanner in ScanCode. It is based on NLTK part of speech (PoS) tagging and a copyright grammar. The exact start and end lines where a copyright is found are approximate.
The goal of this project is to refactor Copyright detection for speed and simplicity possibly implementing a new parser (PEG?, etc.) or re-implementing core elements in Rust with a Python binding for speed or using a fork of NLTK or any other tool to be faster and more accurate.
This would include also keeping track of line numbers and offsets where copyrights are found in a text of binary file.
This requires to embrace the surprising complexity and ambiguity of parsing what looks on the surface as simple copyright statements. Luckily and to help you in your quests, we have over 3000 tests available and potentially 9M of scans that could be used as a basis to drive novel data-driven approaches.
-
- Level
-
- Advanced
-
- Tech
-
- Python, Rust, Go?
-
- Mentors
-
- @JonoYang https://github.com/JonoYang