Comparisons and Benchmarks
Peer-reviewed Evaluations
spaCy is committed to rigorous evaluation under standard methodology. Two papers in 2015 confirm that:
- spaCy is the fastest syntactic parser in the world;
- Its accuracy is within 1% of the best available;
- The few systems that are more accurate are 20× slower or more.
spaCy v0.84 was evaluated by researchers at Yahoo! Labs and Emory University, as part of a survey paper benchmarking the current state-of-the-art dependency parsers (Choi et al., 2015).
System | Language | Accuracy | Speed |
---|---|---|---|
spaCy v0.84 | Cython | 90.6 | 13,963 |
spaCy v0.89 | Cython | 91.8 | 13,000 (est.) |
ClearNLP | Java | 91.7 | 10,271 |
CoreNLP | Java | 89.6 | 8,602 |
MATE | Java | 92.5 | 550 |
Turbo | C++ | 92.4 | 349 |
Yara | Java | 92.3 | 340 |
Discussion with the authors led to accuracy improvements in spaCy, which have been accepted for publication in EMNLP, in joint work with Macquarie University (Honnibal and Johnson, 2015).
How does spaCy compare to NLTK?
spaCy
- Over 400 times faster
- State-of-the-art accuracy
- Tokenizer maintains alignment
- Powerful, concise API
- Integrated word vectors
- English only (at present)
NLTK
- Slow
- Low accuracy
- Tokens do not align to original string
- Models return lists of strings
- No word vector support
- Multiple languages
How does spaCy compare to CoreNLP?
spaCy
- 50% faster
- More accurate parser
- Word vectors integration
- Minimalist design
- Great documentation
- English only
- Python
CoreNLP features:
- More accurate NER
- Coreference resolution
- Sentiment analysis
- Little documentation
- Multiple languages
- Java
How does spaCy compare to ClearNLP?
spaCy
- 30% faster
- Well documented
- English only
- Equivalent accuracy
- Python
ClearNLP:
- Semantic Role Labelling
- Model for biology/life-science
- Multiple Languages
- Equivalent accuracy
- Java
Online Demo
The best parse-tree visualizer and annotation tool in all the land.
displaCy lets you peek inside spaCy's syntactic parser, as it reads a sentence word-by-word. By repeatedly choosing from a small set of actions, it links the words together according to their syntactic structure. This type of representation powers a wide range of technologies, from translation and summarization, to sentiment analysis and algorithmic trading. Read more.
Usage by Example
Load resources and process text
from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')
Get tokens and sentences
token = doc[0]
sentence = doc.sents.next()
assert token is sentence[0]
Use integer IDs for any string
hello_id = nlp.vocab.strings['Hello']
hello_str = nlp.vocab.strings[hello_id]
assert token.orth == hello_id == 469755
assert token.orth_ == hello_str == 'Hello'
Get and set string views and flags
Coming soon, in v0.90.
assert token.shape_ == 'Xxxxx'
for lexeme in nlp.vocab:
if lexeme.is_alpha:
lexeme.shape_ = 'W'
elif lexeme.is_digit:
lexeme.shape_ = 'D'
elif lexeme.is_punct:
lexeme.shape_ = 'P'
else:
lexeme.shape_ = 'M'
assert token.shape_ == 'W'
Export to numpy arrays
from spacy.en.attrs import ORTH, LIKE_URL, IS_OOV
attr_ids = [ORTH, LIKE_URL, IS_OOV]
doc_array = doc.to_array(attr_ids)
assert doc_array.shape == (len(doc), len(attr_ids))
assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]
assert list(doc_array[:, 1]) == [t.like_url for t in doc]
Word vectors
doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[1]
boots = doc[6]
hippos = doc[8]
assert apples.similarity(oranges) > boots.similarity(hippos)
Part-of-speech tags
from spacy.parts_of_speech import ADV
def is_adverb(token):
return token.pos == spacy.parts_of_speech.ADV
# These are data-specific, so no constants are provided. You have to look
# up the IDs from the StringStore.
NNS = nlp.vocab.strings['NNS']
NNPS = nlp.vocab.strings['NNPS']
def is_plural_noun(token):
return token.tag == NNS or token.tag == NNPS
def print_coarse_pos(token):
print(token.pos_)
def print_fine_pos(token):
print(token.tag_)
Syntactic dependencies
def dependency_labels_to_root(token):
'''Walk up the syntactic tree, collecting the arc labels.'''
dep_labels = []
while token.head is not token:
dep_labels.append(token.dep)
token = token.head
return dep_labels
Named entities
def iter_products(docs):
for doc in docs:
for ent in doc.ents:
if ent.label_ == 'PRODUCT':
yield ent
def word_is_in_entity(word):
return word.ent_type != 0
def count_parent_verb_by_person(docs):
counts = defaultdict(defaultdict(int))
for doc in docs:
for ent in doc.ents:
if ent.label_ == 'PERSON' and ent.root.head.pos == VERB:
counts[ent.orth_][ent.root.head.lemma_] += 1
return counts
Calculate inline mark-up on original string
def put_spans_around_tokens(doc, get_classes):
'''Given some function to compute class names, put each token in a
span element, with the appropriate classes computed.
All whitespace is preserved, outside of the spans. (Yes, I know HTML
won't display it. But the point is no information is lost, so you can
calculate what you need, e.g. <br /> tags, <p> tags, etc.)
'''
output = []
template = '<span classes="{classes}">{word}</span>{space}'
for token in doc:
if token.is_space:
output.append(token.orth_)
else:
output.append(
template.format(
classes=' '.join(get_classes(token)),
word=token.orth_,
space=token.whitespace_))
string = ''.join(output)
string = string.replace('\n', '
')
string = string.replace('\t', ' ')
return string
Efficient binary serialization
byte_string = doc.as_bytes()
open('/tmp/moby_dick.bin', 'wb').write(byte_string)
nlp = spacy.en.English()
for byte_string in Doc.read(open('/tmp/moby_dick.bin', 'rb')):
doc = Doc(nlp.vocab)
doc.from_bytes(byte_string)
Full documentation
Install v0.89
Updating your installation
To update your installation:
$ pip install --upgrade spacy
$ python -m spacy.en.download all
Most updates ship a new model, so you will usually have to redownload the data.
conda
$ conda install spacy
$ python -m spacy.en.download
pip and virtualenv
With Python 2.7 or Python 3, using Linux or OSX, run:
$ pip install spacy
$ python -m spacy.en.download
The download command fetches and installs about 300mb of data, for the parser model and word vectors, which it installs within the spacy.en package directory.
Workaround for obsolete system Python
If you're stuck using a server with an old version of Python, and you don't have root access, I've prepared a bootstrap script to help you compile a local Python install. Run:
$ curl https://raw.githubusercontent.com/honnibal/spaCy/master/bootstrap_python_env.sh | bash && source .env/bin/activate
Compile from source
The other way to install the package is to clone the github repository, and build it from source. This installs an additional dependency, Cython. If you're using Python 2, I also recommend installing fabric and fabtools – this is how I build the project.
$ git clone https://github.com/honnibal/spaCy.git
$ cd spaCy
$ virtualenv .env && source .env/bin/activate
$ export PYTHONPATH=`pwd`
$ pip install -r requirements.txt
$ python setup.py build_ext --inplace
$ python -m spacy.en.download
$ pip install pytest
$ py.test tests/
Python packaging is awkward at the best of times, and it's particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them.
pypy (Unsupported)
If PyPy support is a priority for you, please get in touch. We could likely
fix the remaining issues, if necessary. However, the library is likely to
be much slower on PyPy, as it's written in Cython, which produces code tuned
for the performance of CPython.
Windows (Unsupported)
Unfortunately we don't currently support Windows.
What's New?
2015-07-29 v0.89: Fix Spans, efficiency
- Fix regression in parse times on very long texts. Recent versions were calculating parse features in a way that was polynomial in input length.
- Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser does not assign entities to whitespace.
- Rename
Span.head
toSpan.root
, fix its documentation, and make it more efficient.
2015-07-08 v0.88: Refactoring release.
- If you have the data for v0.87, you don't need to redownload the data for this release.
- You can now set
tag=False
,parse=False
orentity=False
when creating the pipleine, to disable some of the models. See the documentation for details. - Models no longer lazy-loaded.
- Warning emitted when parse=True or entity=True but model not loaded.
- Rename the tokens.Tokens class to tokens.Doc. An alias has been made to assist backwards compatibility, but you should update your code to refer to the new class name.
- Various bits of internal refactoring
2015-07-01 v0.87: Memory use
- Changed weights data structure. Memory use should be reduced 30-40%. Fixed speed regressions introduced in the last few versions.
- Models should now be slightly more robust to noise in the input text, as I'm now training on data with a small amount of noise added, e.g. I randomly corrupt capitalization, swap spaces for newlines, etc. This is bringing a small benefit on out-of-domain data. I think this strategy could yield better results with a better noise-generation function. If you think you have a good way to make clean text resemble the kind of noisy input you're seeing in your domain, get in touch.
2015-06-24 v0.86: Parser accuracy
Parser now more accurate, using novel non-monotonic transition system that's currently under review.
2015-05-12 v0.85: More diverse training data
- Parser produces richer dependency labels following the `ClearNLP scheme`_
- Training data now includes text from a variety of genres.
- Parser now uses more memory and the data is slightly larger, due to the additional labels. Impact on efficiency is minimal: entire process still takes <10ms per document.
Most users should see a substantial increase in accuracy from the new model.
2015-05-12 v0.84: Bug fixes
- Bug fixes for parsing
- Bug fixes for named entity recognition
2015-04-13 v0.80
Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements.
- Better sentence boundary detection, drawn from the syntactic structure.
- Lots of bug fixes.
2015-03-05: v0.70
- Improved parse navigation API
- Bug fixes to labelled parsing
2015-01-30: v0.4
- Train statistical models on running text running text
2015-01-25: v0.33
- Alpha release