Introducing spaCy v2.2 · Blog · Explosion, Hacker News

October 2, 2019· byMatthew Honnibal and Ines Montani

Version 2.2 of the spaCy Natural Language Processing library is leaner, cleaner and even more user-friendly. In addition to new model packages and features for training, evaluation and serialization, we’ve made lots of bug fixes, improved debugging and error handling, and greatly reduced the size of the library on disk.

While we’re grateful to the whole spaCy community for their patches and support, Explosion has been lucky to welcometwo new team memberswho deserve special credit for the recent rapid improvements: Sofie Van Landeghem and Adriane Boyd have been working on spaCy full-time. This brings thecore teamup tofour developers– so you can look forward to a lot more to come.

New models and data augmentation

spaCy v2.2comes with retrained statistical models, that include bug fixes andimproved performance over lower-cased texts. Like other statistical models, spaCy’s models can be sensitive to differences between the training data and the data you’re working with. One type of difference we’ve had a lot of trouble with is casing and formality: most of the training data we have is text that is fairly well edited, which has meant lower accuracy on texts which have inconsistent casing and punctuation.

To address this, we’ve begun developing a newdata augmentation system. The first feature we’ve introduced in the v2.2 models is a word replacement system that also supports paired punctuation marks, such as quote characters. During training, replacement dictionaries can be provided, with replacements made in a random subset of sentences each epoch. Here’s an example of the type of problem this change can help with. The German NER model is trained on a treebank that uses``as its open-quote symbol. When Wolfgang Seekerdeveloped spaCy’s German support, he used a preprocessing script that replaced some of those quotes with unicode or ASCII quotation marks. However, one-off preprocessing steps like that are easy to lose track of – eventually leading to a bug in the v2.1 German model. It’s much better to make those replacements during training, which is just what the new system allows you to do.

If you’re using thespacy traincommand, the new data augmentation strategy can be enabled with the new- orth-variant-levelparameter. We’ve set it to0.3by default, which means that 30% of the occurrences of some tokens are subject to replacement during training. Additionally, if an input is randomly selected for orthographic replacement, it has a 50% chance of also being forced to lower-case. We’re still experimenting with this policy, but we’re hoping it leads to models that are more robust to case variation. Let us know how you find it! More APIs for data augmentation will be developed in future, especially as we get more evaluation metrics for these strategies into place.

We’re also pleased to introduce pretrained models fortwo additional languages :NorwegianandLithuanian. Accuracy on both of these languages should improve in subsequent releases, as the current models make use of neither pretrained word vectors nor thespacy pretraincommand. The addition of these languages has been made possible by the awesome work of the spaCy community, especiallyTokenMillfor the Lithuanian model, and theUniversity of Oslo Language Technology Groupfor Norwegian. We’ve been adopting a cautious approach to adding new language models, as we want to make sure that once a model is added, we can continue to support it in each subsequent version of spaCy. That means we have to be able to train all of the language models ourselves, because subsequent versions of spaCy won’t necessarily be compatible with the previous suite of models. With steady improvements to our automation systems and newteam membersjoining spaCy, we look forward to addingWhy not more models ?TheUniversal Dependenciescorpora make it reasonably easy to distribute models for a much wider variety of languages. However, most UD-trained models aren’t that useful for practical work. The UD corpora tend to be small, CC BY-NC licensed, and they tend not to provide NER annotations. To avoid breaking backwards compatibility, we’re trying to only roll out new languages once we have models that are a bit more ready for use.

Better Dutch NER with 20 categories

Our friends atNLP Townhave been making some great contributions to spaCy’s Dutch support. For v2.2, they’ve gone even further, andannotated a new datasetthat should make the pretrained DutchNERmodel much more useful. The new dataset provides OntoNotes 5 annotations over theLaSSy corpus. This allows us to replace the semi-automatic Wikipedia NER model with one trained ongold-standard entities of 20 categories. You can see the updated results in our new and improvedmodels directory, that now shows more detail about the different models, including the label scheme. At first glance the new model might look worse, if you only look at the evaluation figures. However, the previous evaluation was conducted on the semi-automatically created Wikipedia data, which makes it much easier for the model to achieve high scores. The accuracy of the model should improve further when we add pretrained word vectors and when we wire in support for thespacy pretraincommandinto our model training pipeline.

spaCy models directory screenshots — ThespaCy models directoryand an example of the label scheme shown for the (English models)

New CLI features for training

spaCy v2.2 includes several usability improvements to thetr aining and data development workflow, especially for text categorization. We’ve improved error messages, updated the documentation, and made the evaluation metrics more detailed – for example, the evaluation now provides per-entity-type and per-text-categoryaccuracy statisticsby default. One of the most useful improvements is integrated support for the text categorizer in thespacy traincommand line interface. You can now write commands like the following, just as you would when training the parser, entity recognizer or tagger:

python -m spacy train en / output / train / dev --pipeline textcat --textcat-arch simple_cnn --textcat-multilabel

You can read more about the data format requiredin the API docs. To make training even easier, we’ve also introduced a newdebug-datacommand, tovalidate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more . Checking your data before training should be a huge time-saver, as it’s never fun to hit an error after hours of training.