Releases · scribe-org/Scribe-Data

Scribe-Data now has the ability to download the most recent or a specific Wikidata lexemes dump (#517).
- The user is prompted to download a dump for calls for all data (#518).
- Scribe-Data must now use a lexeme dump to download all Wikidata lexeme data (#519).
- The total command can be ran against a Wikidata lexeme dump (#520, #524).
- Translations can be parsed from Wikidata dumps (#525).
Wikidata SPARQL queries are now autogenerated and maintained via Wikidata dumps (#513).
- Forms are separated into files based on their identifiers while ignoring maintainer set queries (#575).
- Queries have been expanded for all languages and data forms based on the Wikidata dump process.
The date of last modification for Wikidata lexemes has been added to query and dump parsing outputs (#562).
Interactive mode now functions throughout the CLI functionality where the user is presented with options for data extraction.
The is now a top level interactive mode command for accessing all Scribe-Data functionality (#523).
Repeat forms are combined with vertical bars ("|") as a separator (#544, #573).
A workflow has been created to update the emoji data on a regular basis (#542).
Resulting data can be filtered based on data contracts (#581).
- Contracts can be checked against data to assure that they're valid given the data's field names (#561).
The Wikipedia based autosuggestion functionality is now CLI based instead of using a Jupyter notebook (#206).

⚖️ Legal

SPDX license identifiers have been added for all files (#553).

🐞 Bug Fixes

The version command was fixed to account for cases where the version has a v before it (#534).
The functionality to check for current data and prompt its deletion was centralized and messages to the user were made more clear (#336).
If Wikidata queries can't be completed, Scribe-Data now includes dramatically better error messages and directs the user to leverage commands that use Wikidata dumps (#549).
General bug fixes for a more fluid developer experience.

✅ Tests

Tests have been written for all new functionalities (#570).
CI testing now includes a coverage check that breaks if coverage falls below a given percentage.

📝 Documentation

Documentation has been expanded for all functionalities of the CLI.

♻️ Code Refactoring

All numpydoc docstrings have been fixed and unneeded code has been removed (#547).

Assets 2

09 Dec 23:14

andrewtavis

4.1.0

fc0b3a8

Scribe-Data 4.1.0

✨ Features

Queries for noun genders and other properties that require the Wikidata label service now return their English label rather than auto label that was returning just the Wikidata QID.
SPARQL queries for English and Portuguese prepositions were added to allow the CLI to query these types of data.
The convert functionality once again works for lists of languages all data types for them.

🐞 Bug Fixes

SQLite conversion was fixed for all queries (#527).
The data conversion process outputs were improved including capitalizing language names and repeat notices to the user were removed.
The CLI's get command now returns all data types if none is passed.
The Portuguese verbs query was fixed as it wasn't formatted correctly.
The emoji keyword functionality was fixed given the new lexeme ID based form of the data.
- Arguments were fixed that were breaking the functionality.
- Languages for the user were capitalized.
case has been renamed grammaticalCase in preposition queries to assure that SQLite reserved keywords are not used.

Assets 2

28 Nov 18:27

andrewtavis

4.0.0

4aa722f

Scribe-Data 4.0.0

✨ Features

Queries for countless data types for countless languages were expanded and added ❤️
Scribe-Data is now a fully functional CLI.
- Querying Wikidata lexicographical data can be done via the get command (#159).
- The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
- Output paths can be set for query results (#144).
- The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
- Total Wikidata lexemes for languages and data types can be derived with the total command (#147).
- Interactive and total commands can be used via an interactive mode with the --interactive argument (#158, #203).
- Outputs were standardized to assure that the CLI experience is consistent
The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
CLI commands have an argument check that can suggest correct languages and data types (#341).

🐞 Bug Fixes

Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).

✅ Tests

Tests have been written for the CLI to assure that it's functionality remains consistent.
Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
- Project queries and its structure have been updated to match the rules developed for the checks.

📝 Documentation

The CLI's functionality has been fully documented (#152, #208).
Documentation was created to show how to write Scribe-Data queries (#395).

♻️ Code Refactoring

word_type has been switched to data_type throughout the codebase (#160).
Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
Many files were renamed including update_data.py being renamed query_data.py
Paths within the package have been updated to work for all operating systems via pathlib (#125).
The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
The update_files directory was removed in preparation of other means of showing data totals.
The language_data_extraction directory was moved under the Wikidata directory as it's only used for those processes now (#446).
The emoji keyword process was centralized to simplify project maintenance (#359).
PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
The data formatting step was centralized such that we only have one for all languages (#142).
Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the query_data.py process.

Assets 2

09 Jun 12:51

andrewtavis

3.3.0

d79ff42

Scribe-Data 3.3.0

✨ Features

The translation process has been updated to allow for translations from non-English languages (#72, #73, #74, #75, #75, #76, #77, #78, #79).

📝 Documentation

The documentation has been given a new layout with the logo in the top left (#90).
The documentation now has links to the code at the top of each page (#91).

🐞 Bug Fixes

Annotation bugs were removed like repeat or empty values.
Perfect tenses of Portuguese verbs were fixed via finding the appropriate PID (#68).
- Note that the most common past perfect property is not the standard one, so this will need to be fixed.

♻️ Code Refactoring

pre-commit have been added to the repo to improve the development experience (#137).
Code formatting was shifted from black to Ruff.
A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request (#109).
The _update_files directory was renamed update_files as these files are used in non-internal manners now (#57).
A common function has been created to map Wikidata ids to noun genders (#69).
The project now is installed locally for development and command line usage, so usages of sys.path have been removed from files (#122).
The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary (#139).
- Translation files are moved to their own directory.
- The extract_transform directory has been removed and all files within it have been moved one level up.
- The languages directory has been renamed language_data_extraction.
- All files within wikidata/_resources have been moved to the resources directory.
- The gender and case annotations for data formatting have now been commonly defined.
- All language directory formatted_data files have been now moved to the scribe_data_json_export directory to prepare for outputs being required to be directed to a directory outside of the package.
- Path computing has been refactored throughout the codebase, and unneeded functions for data transfers have been removed.

Assets 2

Releases: scribe-org/Scribe-Data

Scribe-Data 5.1.4

🐞 Bug Fixes

Uh oh!

Scribe-Data 5.1.3

🐞 Bug Fixes

Uh oh!

Scribe-Data 5.1.2

🐞 Bug Fixes

Uh oh!

Scribe-Data 5.1.1

🐞 Bug Fixes

✅ Tests

♻️ Code Refactoring

Uh oh!

Scribe-Data 5.1.0

✨ Features

Scribe-Data 5.0.1

♻️ Code Refactoring

⬆️ Dependencies

Uh oh!

Scribe-Data 5.0.1

♻️ Code Refactoring

Uh oh!

Scribe-Data 5.0.0

✨ Features

⚖️ Legal

🐞 Bug Fixes

✅ Tests

📝 Documentation

♻️ Code Refactoring

Uh oh!

Scribe-Data 4.1.0

✨ Features

🐞 Bug Fixes

Uh oh!

Scribe-Data 4.0.0

✨ Features

🐞 Bug Fixes

✅ Tests

📝 Documentation

♻️ Code Refactoring

Uh oh!

Scribe-Data 3.3.0

✨ Features

📝 Documentation

🐞 Bug Fixes

♻️ Code Refactoring

Uh oh!