spaCy.io

Build Tomorrow's Language Technologies

spaCy is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you're a small company doing NLP, we want spaCy to seem like a minor miracle.

Comparisons and Benchmarks

Peer-reviewed Evaluations

spaCy is committed to rigorous evaluation under standard methodology. Two papers in 2015 confirm that:

  1. spaCy is the fastest syntactic parser in the world;
  2. Its accuracy is within 1% of the best available;
  3. The few systems that are more accurate are 20× slower or more.

spaCy v0.84 was evaluated by researchers at Yahoo! Labs and Emory University, as part of a survey paper benchmarking the current state-of-the-art dependency parsers (Choi et al., 2015).

System Language Accuracy Speed
spaCy v0.84 Cython 90.6 13,963
spaCy v0.89 Cython 91.8 13,000 (est.)
ClearNLP Java 91.7 10,271
CoreNLP Java 89.6 8,602
MATE Java 92.5 550
Turbo C++ 92.4 349
Yara Java 92.3 340

Discussion with the authors led to accuracy improvements in spaCy, which have been accepted for publication in EMNLP, in joint work with Macquarie University (Honnibal and Johnson, 2015).

How does spaCy compare to NLTK?

spaCy
  • Over 400 times faster
  • State-of-the-art accuracy
  • Tokenizer maintains alignment
  • Powerful, concise API
  • Integrated word vectors
  • English only (at present)
NLTK
  • Slow
  • Low accuracy
  • Tokens do not align to original string
  • Models return lists of strings
  • No word vector support
  • Multiple languages

How does spaCy compare to CoreNLP?

spaCy
  • 50% faster
  • More accurate parser
  • Word vectors integration
  • Minimalist design
  • Great documentation
  • English only
  • Python
CoreNLP features:
  • More accurate NER
  • Coreference resolution
  • Sentiment analysis
  • Little documentation
  • Multiple languages
  • Java

How does spaCy compare to ClearNLP?

spaCy
  • 30% faster
  • Well documented
  • English only
  • Equivalent accuracy
  • Python
ClearNLP:
  • Semantic Role Labelling
  • Model for biology/life-science
  • Multiple Languages
  • Equivalent accuracy
  • Java

Online Demo

Interactive Visualizer

The best parse-tree visualizer and annotation tool in all the land.

displaCy lets you peek inside spaCy's syntactic parser, as it reads a sentence word-by-word. By repeatedly choosing from a small set of actions, it links the words together according to their syntactic structure. This type of representation powers a wide range of technologies, from translation and summarization, to sentiment analysis and algorithmic trading. Read more.

Usage by Example

Load resources and process text

from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')

Get tokens and sentences

token = doc[0]
sentence = doc.sents.next()
assert token is sentence[0]

Use integer IDs for any string

hello_id = nlp.vocab.strings['Hello']
hello_str = nlp.vocab.strings[hello_id]
 
assert token.orth  == hello_id  == 469755
assert token.orth_ == hello_str == 'Hello'

Get and set string views and flags

Coming soon, in v0.90.
assert token.shape_ == 'Xxxxx'
for lexeme in nlp.vocab:
    if lexeme.is_alpha:
        lexeme.shape_ = 'W'
    elif lexeme.is_digit:
        lexeme.shape_ = 'D'
    elif lexeme.is_punct:
        lexeme.shape_ = 'P'
    else:
        lexeme.shape_ = 'M'
assert token.shape_ == 'W'

Export to numpy arrays

from spacy.en.attrs import ORTH, LIKE_URL, IS_OOV
 
attr_ids = [ORTH, LIKE_URL, IS_OOV]
doc_array = doc.to_array(attr_ids)
assert doc_array.shape == (len(doc), len(attr_ids))
assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]
assert list(doc_array[:, 1]) == [t.like_url for t in doc]

Word vectors

doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")
 
apples = doc[0]
oranges = doc[1]
boots = doc[6]
hippos = doc[8]
 
assert apples.similarity(oranges) > boots.similarity(hippos)

Part-of-speech tags

from spacy.parts_of_speech import ADV
 
def is_adverb(token):
    return token.pos == spacy.parts_of_speech.ADV
 
# These are data-specific, so no constants are provided. You have to look
# up the IDs from the StringStore.
NNS = nlp.vocab.strings['NNS']
NNPS = nlp.vocab.strings['NNPS']
def is_plural_noun(token):
    return token.tag == NNS or token.tag == NNPS
 
def print_coarse_pos(token):
    print(token.pos_)
 
def print_fine_pos(token):
    print(token.tag_)

Syntactic dependencies

def dependency_labels_to_root(token):
    '''Walk up the syntactic tree, collecting the arc labels.'''
    dep_labels = []
    while token.head is not token:
        dep_labels.append(token.dep)
        token = token.head
    return dep_labels

Named entities

def iter_products(docs):
    for doc in docs:
        for ent in doc.ents:
            if ent.label_ == 'PRODUCT':
                yield ent
 
def word_is_in_entity(word):
    return word.ent_type != 0
 
def count_parent_verb_by_person(docs):
    counts = defaultdict(defaultdict(int))
    for doc in docs:
        for ent in doc.ents:
            if ent.label_ == 'PERSON' and ent.root.head.pos == VERB:
                counts[ent.orth_][ent.root.head.lemma_] += 1
    return counts

Calculate inline mark-up on original string

def put_spans_around_tokens(doc, get_classes):
    '''Given some function to compute class names, put each token in a
    span element, with the appropriate classes computed.
 
    All whitespace is preserved, outside of the spans. (Yes, I know HTML
    won't display it. But the point is no information is lost, so you can
    calculate what you need, e.g. <br /> tags, <p> tags, etc.)
    '''
    output = []
    template = '<span classes="{classes}">{word}</span>{space}'
    for token in doc:
        if token.is_space:
            output.append(token.orth_)
        else:
            output.append(
              template.format(
                classes=' '.join(get_classes(token)),
                word=token.orth_,
                space=token.whitespace_))
    string = ''.join(output)
    string = string.replace('\n', '
') string = string.replace('\t', '    ') return string

Efficient binary serialization

byte_string = doc.as_bytes()
open('/tmp/moby_dick.bin', 'wb').write(byte_string)
 
nlp = spacy.en.English()
for byte_string in Doc.read(open('/tmp/moby_dick.bin', 'rb')):
   doc = Doc(nlp.vocab)
   doc.from_bytes(byte_string)

Full documentation

Install v0.89

Updating your installation

To update your installation:
$ pip install --upgrade spacy
$ python -m spacy.en.download all

Most updates ship a new model, so you will usually have to redownload the data.

conda

$ conda install spacy
$ python -m spacy.en.download

pip and virtualenv

With Python 2.7 or Python 3, using Linux or OSX, run:

$ pip install spacy
$ python -m spacy.en.download

The download command fetches and installs about 300mb of data, for the parser model and word vectors, which it installs within the spacy.en package directory.

Workaround for obsolete system Python

If you're stuck using a server with an old version of Python, and you don't have root access, I've prepared a bootstrap script to help you compile a local Python install. Run:

$ curl https://raw.githubusercontent.com/honnibal/spaCy/master/bootstrap_python_env.sh | bash && source .env/bin/activate

Compile from source

The other way to install the package is to clone the github repository, and build it from source. This installs an additional dependency, Cython. If you're using Python 2, I also recommend installing fabric and fabtools – this is how I build the project.

$ git clone https://github.com/honnibal/spaCy.git
$ cd spaCy
$ virtualenv .env && source .env/bin/activate
$ export PYTHONPATH=`pwd`
$ pip install -r requirements.txt
$ python setup.py build_ext --inplace
$ python -m spacy.en.download
$ pip install pytest
$ py.test tests/

Python packaging is awkward at the best of times, and it's particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them.

pypy (Unsupported)

If PyPy support is a priority for you, please get in touch. We could likely fix the remaining issues, if necessary. However, the library is likely to be much slower on PyPy, as it's written in Cython, which produces code tuned for the performance of CPython.

Windows (Unsupported)

Unfortunately we don't currently support Windows.

What's New?

2015-07-29 v0.89: Fix Spans, efficiency

  • Fix regression in parse times on very long texts. Recent versions were calculating parse features in a way that was polynomial in input length.
  • Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser does not assign entities to whitespace.
  • Rename Span.head to Span.root, fix its documentation, and make it more efficient.

2015-07-08 v0.88: Refactoring release.

  • If you have the data for v0.87, you don't need to redownload the data for this release.
  • You can now set tag=False, parse=False or entity=False when creating the pipleine, to disable some of the models. See the documentation for details.
  • Models no longer lazy-loaded.
  • Warning emitted when parse=True or entity=True but model not loaded.
  • Rename the tokens.Tokens class to tokens.Doc. An alias has been made to assist backwards compatibility, but you should update your code to refer to the new class name.
  • Various bits of internal refactoring

2015-07-01 v0.87: Memory use

  • Changed weights data structure. Memory use should be reduced 30-40%. Fixed speed regressions introduced in the last few versions.
  • Models should now be slightly more robust to noise in the input text, as I'm now training on data with a small amount of noise added, e.g. I randomly corrupt capitalization, swap spaces for newlines, etc. This is bringing a small benefit on out-of-domain data. I think this strategy could yield better results with a better noise-generation function. If you think you have a good way to make clean text resemble the kind of noisy input you're seeing in your domain, get in touch.

2015-06-24 v0.86: Parser accuracy

Parser now more accurate, using novel non-monotonic transition system that's currently under review.

2015-05-12 v0.85: More diverse training data

  • Parser produces richer dependency labels following the `ClearNLP scheme`_
  • Training data now includes text from a variety of genres.
  • Parser now uses more memory and the data is slightly larger, due to the additional labels. Impact on efficiency is minimal: entire process still takes <10ms per document.

Most users should see a substantial increase in accuracy from the new model.

2015-05-12 v0.84: Bug fixes

  • Bug fixes for parsing
  • Bug fixes for named entity recognition

2015-04-13 v0.80

Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements.

  • Better sentence boundary detection, drawn from the syntactic structure.
  • Lots of bug fixes.

2015-03-05: v0.70

  • Improved parse navigation API
  • Bug fixes to labelled parsing

2015-01-30: v0.4

  • Train statistical models on running text running text

2015-01-25: v0.33

  • Alpha release