Releases: keras-team/keras-hub
Releases · keras-team/keras-hub
v0.4.0.dev0
The KerasNLP 0.4 adds support for pretrained models to the API via keras_nlp.models
. If you encounter any problems or have questions, please open an issue or discussion of the discussion tab!
Breaking Changes
- Renamed
keras_nlp.layers.MLMHead
->keras_nlp.layers.MaskedLMHead
. - Renamed
keras_nlp.layers.MLMMaskGenerator
->keras_nlp.layers.MaskedLMMaskGenerator
. - Renamed
keras_nlp.layers.UnicodeCharacterTokenizer
->keras_nlp.layers.UnicodeCodepointTokenizer
. - Switched the default of
lowercase
inkeras_nlp.tokenizers.WordPieceTokenizer
fromTrue
toFalse
. - Renamed the token id output of
MaskedLMMaskGenerator
from"tokens"
to"tokens_ids"
.
Summary
- Added the
keras_nlp.models
API.- Adds support for BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models and pretrained checkpoints.
- Added new metrics.
keras_nlp.metrics.Bleu
andkeras_nlp.metrics.EditDistance
.
- Added new vocabulary training utilities.
keras_nlp.tokenizers.compute_word_piece_vocabulary
andkeras_nlp.tokenizers.compute_sentence_piece_proto
.
What's Changed
- Add Edit Distance Metric by @abheesht17 in #231
- Minor fix to simplify and test handling of max_length prompts by @jbischof in #258
- Remove split regex args for WordPieceTokenizer by @mattdangerw in #255
- Add instructions on installing the latest changes by @mattdangerw in #261
- Add warning when k > vocab_size in top_k_search by @jbischof in #260
- Fix keras library imports and usage by @jbischof in #262
- Add BLEU Score by @abheesht17 in #222
- Configure GKE-based accelerator testing by @chenmoneygithub in #265
- Added WordPieceTokenizer training function by @jessechancy in #256
- Add requirements.txt for cloud build by @chenmoneygithub in #267
- Global Seed Bug Fix by @jessechancy in #269
- Update accelerator testing to use the new GCP project by @chenmoneygithub in #272
- Fixed typo: "recieved" by @ehrencrona in #273
- Reuse dense pooled output for fine tuning by @mattdangerw in #251
- Simplify BERT modeling, use keras embeddings by @mattdangerw in #253
- Rename UnicodeCharacterTokenizer>UnicodeCodepointTokenizer by @mattdangerw in #254
- Add README for accelerator testing config folder by @chenmoneygithub in #276
- Random Deletion Layer by @aflah02 in #214
- Made trainer more efficient. Loading full files instead of using TextLineDataset. by @jessechancy in #280
- Use KerasNLP for BERT preprocessing for GLUE by @mattdangerw in #252
- Minor fixes to the Random Deletion Layer by @aflah02 in #286
- Fixes for WordPieceTrainer by @aflah02 in #293
- Update default to strip_accents=False by @jessechancy in #289
- Move Bert to models folder by @jbischof in #288
- Make Decoding Functions Graph-compatible (with XLA Support!) by @abheesht17 in #271
- SentencePieceTrainer by @aflah02 in #281
- Rename
models.Bert()
tomodels.BertCustom()
by @jbischof in #310 - Add a test for variable sequence length inputs by @mattdangerw in #313
- Support checkpoint loading for
BertBase
by @jbischof in #299 - RoBERTa pretrained model forward pass by @jessechancy in #304
- Register objects as serializable by @mattdangerw in #292
- Style merging for Bert and Roberta by @jbischof in #315
- Streamline and speed up tests by @jbischof in #324
- Add Support for CJK Char Splitting for WordPiece Tokenizer by @abheesht17 in #318
- Clean up model input names for consistency by @mattdangerw in #327
- Return a single tensor from roberta by @mattdangerw in #328
- BERT, RoBERTa: Add
model.compile
UTs by @abheesht17 in #330 - Continue rename of bert model inputs by @mattdangerw in #329
- Text Generation Utilities: Add Support for Ragged Inputs by @abheesht17 in #300
bert_base_zh
,bert_base_multi_cased
: Add BERT Base Variants by @abheesht17 in #319- WordPiece vocabularies trainer on Wikipedia dataset by @jessechancy in #316
- Use the exported ragged ops for RandomDeletion by @mattdangerw in #332
- Random Swap Layer by @aflah02 in #224
- Fixes for Random Deletion Layer by @aflah02 in #339
- Move cloudbuild to a hidden directory by @mattdangerw in #345
- Fix the build by @mattdangerw in #349
- Migrating from Datasets to TFDS for GLUE Example by @aflah02 in #340
- Move network_tests into keras_nlp/ by @mattdangerw in #344
- Stop hardcoding 2.9 by @mattdangerw in #351
- Add BERT Large by @abheesht17 in #331
- Add normalize_first arg to Transformer Layers by @abheesht17 in #350
- Add Small BERT Variants by @abheesht17 in #338
- Beam Search: Add Ragged and XLA Support by @abheesht17 in #341
- Fix download paths for bert weights by @mattdangerw in #356
- Add a BertPreprocessor class by @mattdangerw in #343
- Text Generation Functions: Add Benchmark Script by @abheesht17 in #342
- Improve readability for encoder/decoder blocks by @mattdangerw in #353
- Add GPT-2 Model and its Variants by @abheesht17 in #354
- Clean up BERT, RoBERTa doc-strings by @abheesht17 in #359
- Create unique string id for each BERT backbone by @jbischof in #361
- Use model.fit() for BERT Example by @abheesht17 in #360
- Minor Fixes in BertPreprocessor Layer by @abheesht17 in #373
- Clone user passed initializers called multiple times by @mattdangerw in #371
- Update BERT model file structure by @mattdangerw in #376
- Move gpt model code into a directory by @mattdangerw in #379
- Move roberta model code into a directory by @mattdangerw in #380
- Reorg test directories by @mattdangerw in #384
- Add XLM-RoBERTa by @abheesht17 in #372
- Add DistilBERT by @abheesht17 in #382
- Stop running CI on Windows by @mattdangerw in #386
- Fix Bert serialization by @mattdangerw in #385
- Improve MacOS support and pin tensorflow version during testing by @mattdangerw in #383
- Unify BERT model API in one class by @jbischof in #387
- Add
from_preset
constructor toBertPreprocessor
by @jbischof in #390 - More robustly test BERT preprocessing by @mattdangerw in #394
- Move
name
andtrainable
tokwargs
by @jbischof in #399 - Add
backbone
asproperty
for task models by @jbischof in #398 - Set default name of
Bert
instance to"backbone"
by @jbischof in #397 - Fix gpt2 serialization by @mattdangerw in #391
- Fix distilbert serialization by @mattdangerw in #392
- Fix roberta and xlm-roberta serialization by @mattdangerw in https:...
v0.3.1
Summary
- Add
keras_nlp.tokenizers.BytePairTokenizer
withtf.data
friendly support for the tokenization used by GPT-2, RoBERTa and other models. - Remove the hard dependency on
tensorflow
andtensorflow-text
when pip installing on MacOS, to accommodate M1 chips. See this section of our contributor guide for more information on MacOS development.
What's Changed
- Cherry picks 0.3 by @mattdangerw in #454
- Bump version for 0.3.1 pre release by @mattdangerw in #456
- Remove dev prefix for 0.3.1 release by @mattdangerw in #457
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Summary
- Added
keras_nlp.tokenizers.SentencePieceTokenizer
. - Added two token packing layers
keras_nlp.layers.StartEndPacker
andkeras_nlp.layers.MultiSegmentPacker
. - Added two metrics,
keras_nlp.metrics.RougeL
andkeras_nlp.metrics.RougeN
based on therouge-score
package. - Added five utilities for generating sequences,
keras_nlp.utils.greedy_search
,keras_nlp.utils.random_search
,keras_nlp.utils.top_k_search
,keras_nlp.utils.top_p_search
,keras_nlp.utils.beam_search
.
What's Changed
- Greedy text generation util by @chenmoneygithub in #154
- Remove incorrect embedding size limit by @mattdangerw in #195
- Fix inits for bert heads by @mattdangerw in #192
- Add keras.io links to README by @mattdangerw in #196
- Minor Corrections In ROADMAP.md by @saiteja13427 in #200
- Fix Loose Dependency Imports by @abheesht17 in #199
- Reorganize examples by @mattdangerw in #179
- Remove bert config arguments from README by @mattdangerw in #205
- Add checkpoints to BERT training by @chenmoneygithub in #184
- Run keras tuner from a temp directory by @mattdangerw in #202
- Token and position embedding minor fixes by @mattdangerw in #203
- Correct typo in WordPieceTokenizer by @abheesht17 in #208
- Add TPU support to BERT example by @chenmoneygithub in #207
- Remove type annotations for complex types by @mattdangerw in #194
- Issue 182: Modified TransformerDecoder with optional parameter by @jessechancy in #217
- Add StartEndPacker layer by @abheesht17 in #221
- Add a layer for packing inputs for BERT-likes by @mattdangerw in #88
- Ignore UserWarning to fix nightly testing breakage by @chenmoneygithub in #227
- Add ROUGE Metric by @abheesht17 in #122
- Allow long lines for links in docstrings by @mattdangerw in #229
- Random Sampling Util for Text Generation by @jessechancy in #228
- added top k search util by @jessechancy in #232
- top p search and testing by @jessechancy in #233
- Add a SentencePiece tokenizer by @mattdangerw in #218
- Add cloud training support for BERT example by @chenmoneygithub in #226
- Bump version to 0.3.0 for upcoming release by @mattdangerw in #239
- Add support for StartEndPacker packing 2D tensor by @jessechancy in #240
- Fixed Bug with Unicode Tokenizer Vocab Size by @aflah02 in #243
- Fixed Import for top_p_search util by @aflah02 in #245
- MultiSegmentPacker support for 2D dense tensor by @jessechancy in #244
- Minor fixes for multi-segment packer by @mattdangerw in #246
- Add beam search decoding util by @jessechancy in #237
New Contributors
- @saiteja13427 made their first contribution in #200
- @jessechancy made their first contribution in #217
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Summary
- Documentation live on keras.io.
- Added two tokenizers:
ByteTokenizer
andUnicodeCharacterTokenizer
. - Added a
Perplexity
metric. - Added three layers
TokenAndPositionEmbedding
,MLMMaskGenerator
andMLMHead
. - Contributing guides and roadmap.
What's Changed
- Add Byte Tokenizer by @abheesht17 in #80
- Fixing rank 1 outputs for WordPieceTokenizer by @aflah02 in #92
- Add tokenizer accessors to the base class by @mattdangerw in #89
- Fix word piece attributes by @mattdangerw in #97
- Small fix: change assertEquals to assertEqual by @chenmoneygithub in #103
- Added a Learning Rate Schedule for the BERT Example by @Stealth-py in #96
- Add Perplexity Metric by @abheesht17 in #68
- Use the black profile for isort by @mattdangerw in #117
- Update README with release information by @mattdangerw in #118
- Add a class to generate LM masks by @chenmoneygithub in #61
- Add docstring testing by @mattdangerw in #116
- Fix broken docstring in MLMMaskGenerator by @chenmoneygithub in #121
- Adding a UnicodeCharacterTokenizer by @aflah02 in #100
- Added Class by @adhadse in #91
- Fix bert example so it is runnable by @mattdangerw in #123
- Fix the issue that MLMMaskGenerator does not work in graph mode by @chenmoneygithub in #131
- Actually use layer norm epsilon in encoder/decoder by @mattdangerw in #133
- Whitelisted formatting and lint check targets by @adhadse in #126
- Updated CONTRIBUTING.md for setup of venv and standard pip install by @adhadse in #127
- Fix mask propagation of transformer layers by @chenmoneygithub in #139
- Fix masking for TokenAndPositionEmbedding by @mattdangerw in #140
- Fixed no oov token error in vocab for WordPieceTokenizer by @adhadse in #136
- Add a MLMHead layer by @mattdangerw in #132
- Bump version for 0.2.0 dev release by @mattdangerw in #142
- Added WSL setup text to CONTRIBUTING.md by @adhadse in #144
- Add attribution for the BERT modeling code by @mattdangerw in #151
- Remove preprocessing subdir by @mattdangerw in #150
- Word piece arg change by @mattdangerw in #148
- Rename max_length to sequence_length by @mattdangerw in #149
- Don't accept a string dtype for unicode tokenizer by @mattdangerw in #147
- Adding Utility to Detokenize as list of Strings to Tokenizer Base Class by @aflah02 in #124
- Fixed Import Error by @aflah02 in #161
- Added KerasTuner Hyper-Parameter Search for the BERT fine-tuning script. by @Stealth-py in #143
- Docstring updates for upcoming doc publish by @mattdangerw in #146
- version bump for 0.2.0.dev2 pre-release by @mattdangerw in #165
- Added a vocabulary_size argument to UnicodeCharacterTokenizer by @aflah02 in #163
- Simplified utility to preview a tfrecord by @mattdangerw in #168
- Update BERT example's README with data downloading instructions by @chenmoneygithub in #169
- Add a call to repeat during pretraining by @mattdangerw in #172
- Add an integration test matching our quick start by @mattdangerw in #162
- Modify README of bert example by @chenmoneygithub in #174
- Fix the finetuning script's loss and metric config by @chenmoneygithub in #176
- Minor improvements to the position embedding docs by @mattdangerw in #180
- Update docs for upcoming 0.2.0 release by @mattdangerw in #158
- Restore accidentally deleted line from README by @mattdangerw in #185
- Bump version for 0.2.0 release by @mattdangerw in #186
- Pre release fix by @mattdangerw in #187
New Contributors
- @Stealth-py made their first contribution in #96
- @adhadse made their first contribution in #91
Full Changelog: v0.1.1...v0.2.0
v0.2.0.dev2
What's Changed
- Added WSL setup text to CONTRIBUTING.md by @adhadse in #144
- Add attribution for the BERT modeling code by @mattdangerw in #151
- Remove preprocessing subdir by @mattdangerw in #150
- Word piece arg change by @mattdangerw in #148
- Rename max_length to sequence_length by @mattdangerw in #149
- Don't accept a string dtype for unicode tokenizer by @mattdangerw in #147
- Adding Utility to Detokenize as list of Strings to Tokenizer Base Class by @aflah02 in #124
- Fixed Import Error by @aflah02 in #161
- Added KerasTuner Hyper-Parameter Search for the BERT fine-tuning script. by @Stealth-py in #143
- Docstring updates for upcoming doc publish by @mattdangerw in #146
- version bump for 0.2.0.dev2 pre-release by @mattdangerw in #165
Full Changelog: v0.2.0-dev.1...v0.2.0.dev2
v0.2.0-dev.1
What's Changed
- Add Byte Tokenizer by @abheesht17 in #80
- Fixing rank 1 outputs for WordPieceTokenizer by @aflah02 in #92
- Add tokenizer accessors to the base class by @mattdangerw in #89
- Fix word piece attributes by @mattdangerw in #97
- Small fix: change assertEquals to assertEqual by @chenmoneygithub in #103
- Added a Learning Rate Schedule for the BERT Example by @Stealth-py in #96
- Add Perplexity Metric by @abheesht17 in #68
- Use the black profile for isort by @mattdangerw in #117
- Update README with release information by @mattdangerw in #118
- Add a class to generate LM masks by @chenmoneygithub in #61
- Add docstring testing by @mattdangerw in #116
- Fix broken docstring in MLMMaskGenerator by @chenmoneygithub in #121
- Adding a UnicodeCharacterTokenizer by @aflah02 in #100
- Added TokenAndPositionEmbedding Class by @adhadse in #91
- Fix bert example so it is runnable by @mattdangerw in #123
- Fix the issue that MLMMaskGenerator does not work in graph mode by @chenmoneygithub in #131
- Actually use layer norm epsilon in encoder/decoder by @mattdangerw in #133
- Whitelisted formatting and lint check targets by @adhadse in #126
- Updated CONTRIBUTING.md for setup of venv and standard pip install by @adhadse in #127
- Fix mask propagation of transformer layers by @chenmoneygithub in #139
- Fix masking for TokenAndPositionEmbedding by @mattdangerw in #140
- Fixed no oov token error in vocab for WordPieceTokenizer by @adhadse in #136
- Add a MLMHead layer by @mattdangerw in #132
- Bump version for 0.2.0 dev release by @mattdangerw in #142
New Contributors
- @Stealth-py made their first contribution in #96
- @adhadse made their first contribution in #91
Full Changelog: v0.1.1...v0.2.0-dev.1
v0.1.1
What's Changed
- Add tokenizer helper to convert tokens to ids by @mattdangerw in #75
- Add a sinusoidal embedding layer by @amantayal44 in #59
- Add a learned positional embedding layer by @hertschuh in #47
- Fix typo in position embedding docstring by @mattdangerw in #86
- Bump version number to 0.1.1 by @mattdangerw in #90
New Contributors
- @amantayal44 made their first contribution in #59
- @hertschuh made their first contribution in #47
Full Changelog: v0.1.0...v0.1.1
v0.1.0
Initial release of keras-nlp with word piece tokenizer and transformer encoder/decoder blocks.
This is a v0 release, with no API compatibility guarantees.