Skip to content

Releases: scribe-org/Scribe-Data

Scribe-Data 5.1.4

20 Aug 20:16
28d6764

Choose a tag to compare

🐞 Bug Fixes

  • Allow the convert parser to accept multiple data types (#634).

Scribe-Data 5.1.3

20 Aug 20:16
28d6764

Choose a tag to compare

🐞 Bug Fixes

  • Fixed data conversion not handling multiple explicitly passed languages and data types (#632).

Scribe-Data 5.1.2

18 Aug 16:33

Choose a tag to compare

🐞 Bug Fixes

  • Fixed data conversion not handling multiple explicitly passed languages (#630).

Scribe-Data 5.1.1

17 Aug 14:48

Choose a tag to compare

🐞 Bug Fixes

  • The path to the contracts was fixed in data filtration to assure that it's a pathlib.Path value (#627).

βœ… Tests

  • The upgrade functionality of the CLI is now comprehensively tested (#624).

♻️ Code Refactoring

  • The upgrade message instructs the user to use the built in upgrade functionality.

Scribe-Data 5.1.0

28 Jun 15:43

Choose a tag to compare

✨ Features

  • The upgrade command now upgrades the package via pip rather than bringing down GitHub files and installing them directly.

Scribe-Data 5.0.1

♻️ Code Refactoring

  • The requirement files have been updated to fix package install errors (#621).

⬆️ Dependencies

  • Update minimum Python version to 3.11.

Scribe-Data 5.0.1

26 Jun 03:15
61c763b

Choose a tag to compare

♻️ Code Refactoring

  • The requirement files have been updated to fix package install errors (#621).

Scribe-Data 5.0.0

03 May 14:40

Choose a tag to compare

✨ Features

  • Scribe-Data now has the ability to download the most recent or a specific Wikidata lexemes dump (#517).
    • The user is prompted to download a dump for calls for all data (#518).
    • Scribe-Data must now use a lexeme dump to download all Wikidata lexeme data (#519).
    • The total command can be ran against a Wikidata lexeme dump (#520, #524).
    • Translations can be parsed from Wikidata dumps (#525).
  • Wikidata SPARQL queries are now autogenerated and maintained via Wikidata dumps (#513).
    • Forms are separated into files based on their identifiers while ignoring maintainer set queries (#575).
    • Queries have been expanded for all languages and data forms based on the Wikidata dump process.
  • The date of last modification for Wikidata lexemes has been added to query and dump parsing outputs (#562).
  • Interactive mode now functions throughout the CLI functionality where the user is presented with options for data extraction.
  • The is now a top level interactive mode command for accessing all Scribe-Data functionality (#523).
  • Repeat forms are combined with vertical bars ("|") as a separator (#544, #573).
  • A workflow has been created to update the emoji data on a regular basis (#542).
  • Resulting data can be filtered based on data contracts (#581).
    • Contracts can be checked against data to assure that they're valid given the data's field names (#561).
  • The Wikipedia based autosuggestion functionality is now CLI based instead of using a Jupyter notebook (#206).

βš–οΈ Legal

  • SPDX license identifiers have been added for all files (#553).

🐞 Bug Fixes

  • The version command was fixed to account for cases where the version has a v before it (#534).
  • The functionality to check for current data and prompt its deletion was centralized and messages to the user were made more clear (#336).
  • If Wikidata queries can't be completed, Scribe-Data now includes dramatically better error messages and directs the user to leverage commands that use Wikidata dumps (#549).
  • General bug fixes for a more fluid developer experience.

βœ… Tests

  • Tests have been written for all new functionalities (#570).
  • CI testing now includes a coverage check that breaks if coverage falls below a given percentage.

πŸ“ Documentation

  • Documentation has been expanded for all functionalities of the CLI.

♻️ Code Refactoring

  • All numpydoc docstrings have been fixed and unneeded code has been removed (#547).

Scribe-Data 4.1.0

09 Dec 23:14

Choose a tag to compare

✨ Features

  • Queries for noun genders and other properties that require the Wikidata label service now return their English label rather than auto label that was returning just the Wikidata QID.
  • SPARQL queries for English and Portuguese prepositions were added to allow the CLI to query these types of data.
  • The convert functionality once again works for lists of languages all data types for them.

🐞 Bug Fixes

  • SQLite conversion was fixed for all queries (#527).
  • The data conversion process outputs were improved including capitalizing language names and repeat notices to the user were removed.
  • The CLI's get command now returns all data types if none is passed.
  • The Portuguese verbs query was fixed as it wasn't formatted correctly.
  • The emoji keyword functionality was fixed given the new lexeme ID based form of the data.
    • Arguments were fixed that were breaking the functionality.
    • Languages for the user were capitalized.
  • case has been renamed grammaticalCase in preposition queries to assure that SQLite reserved keywords are not used.

Scribe-Data 4.0.0

28 Nov 18:27

Choose a tag to compare

✨ Features

  • Queries for countless data types for countless languages were expanded and added ❀️
  • Scribe-Data is now a fully functional CLI.
    • Querying Wikidata lexicographical data can be done via the get command (#159).
    • The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
    • Output paths can be set for query results (#144).
    • The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
    • Total Wikidata lexemes for languages and data types can be derived with the total command (#147).
    • Interactive and total commands can be used via an interactive mode with the --interactive argument (#158, #203).
    • Outputs were standardized to assure that the CLI experience is consistent
  • The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
  • Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
  • CLI commands have an argument check that can suggest correct languages and data types (#341).

🐞 Bug Fixes

  • Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).

βœ… Tests

  • Tests have been written for the CLI to assure that it's functionality remains consistent.
  • Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
    • Project queries and its structure have been updated to match the rules developed for the checks.

πŸ“ Documentation

  • The CLI's functionality has been fully documented (#152, #208).
  • Documentation was created to show how to write Scribe-Data queries (#395).

♻️ Code Refactoring

  • word_type has been switched to data_type throughout the codebase (#160).
  • Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
  • The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
  • Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
  • Many files were renamed including update_data.py being renamed query_data.py
  • Paths within the package have been updated to work for all operating systems via pathlib (#125).
  • The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
  • The update_files directory was removed in preparation of other means of showing data totals.
  • The language_data_extraction directory was moved under the Wikidata directory as it's only used for those processes now (#446).
  • The emoji keyword process was centralized to simplify project maintenance (#359).
  • PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
  • The data formatting step was centralized such that we only have one for all languages (#142).
  • Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the query_data.py process.

Scribe-Data 3.3.0

09 Jun 12:51

Choose a tag to compare

✨ Features

  • The translation process has been updated to allow for translations from non-English languages (#72, #73, #74, #75, #75, #76, #77, #78, #79).

πŸ“ Documentation

  • The documentation has been given a new layout with the logo in the top left (#90).
  • The documentation now has links to the code at the top of each page (#91).

🐞 Bug Fixes

  • Annotation bugs were removed like repeat or empty values.
  • Perfect tenses of Portuguese verbs were fixed via finding the appropriate PID (#68).
    • Note that the most common past perfect property is not the standard one, so this will need to be fixed.

♻️ Code Refactoring

  • pre-commit have been added to the repo to improve the development experience (#137).
  • Code formatting was shifted from black to Ruff.
  • A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request (#109).
  • The _update_files directory was renamed update_files as these files are used in non-internal manners now (#57).
  • A common function has been created to map Wikidata ids to noun genders (#69).
  • The project now is installed locally for development and command line usage, so usages of sys.path have been removed from files (#122).
  • The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary (#139).
    • Translation files are moved to their own directory.
    • The extract_transform directory has been removed and all files within it have been moved one level up.
    • The languages directory has been renamed language_data_extraction.
    • All files within wikidata/_resources have been moved to the resources directory.
    • The gender and case annotations for data formatting have now been commonly defined.
    • All language directory formatted_data files have been now moved to the scribe_data_json_export directory to prepare for outputs being required to be directed to a directory outside of the package.
    • Path computing has been refactored throughout the codebase, and unneeded functions for data transfers have been removed.