Open-Source LLM Document Extraction Using Mistral 7B

Published in

GoPenAI

6 min readFeb 2

Introduction

OCR is a mastered science, and they have been for a while now. The problem nowadays is in the extraction and processing of the data, and LLM shines in this part. JP Morgan recently launched DocLLM[1] ( you can read more here), an LLM just designed for this. The model is not available yet and has some limitations in terms of context window size. So I decided to do the opposite, a full-open source with a much bigger window size.

You can see the repo here:

https://github.com/enoch3712/Open-DocLLM

The project is divided into two parts, the OCR and LLM layer. In production-ready projects, this would be different microservices or even different services. But the division is clear, there is the reading of all the content (OCR Layer) and then the extraction of the specific content (LLM Layer).

OCR Layer

Convert pages to images

First, it’s essential to convert any type of file into an image. This approach ensures that all content within the document can be accessed. By splitting the pages into images, they can be individually processed in subsequent steps.

Preprocess image for OCR

Often, images are of poor quality and require adjustments to improve readability. Enhancements can include changes in contrast, as well as the use of various libraries and frameworks. While this article will not delve into those details, implementing such adjustments is crucial for production use cases.

Tesseract OCR

It's the most famous and popular open-source OCR in the world. Quite old too, but it has been been tweaked over time to support more language structures and tables. This example supports tables by dividing with “|”.

GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)

Tesseract Open Source OCR Engine (main repository) - GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine…

github.com

It uses pytesseract to connect to the tesseract via Python. Here is the code:

def extract_text_with_pytesseract(list_dict_final_images):
    
    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []
    
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for index, image_bytes in enumerate(image_list):
            future = executor.submit(process_image, index, image_bytes)
            futures.append(future)
        
        for future in concurrent.futures.as_completed(futures):
            try:
                raw_text = future.result()
                image_content.append(raw_text)
            except Exception as e:
                raise Exception(f"Error processing image: {e}")
    
    return image_content

# only this one matters, the rest is concurrency logic
def process_image(index, image_bytes):
    try:
        image = Image.open(BytesIO(image_bytes))
        raw_text = str(image_to_string(image))
        return raw_text
    except Exception as e:
        raise Exception(f"Error processing image {index}: {e}")

Ignore most of the code present in the extract_text_with_pytesseract, since is mostly concurrency logic so you can process in parallel if possible.

LLM Layer

Extraction contract definition

Once we have the content from the document, we need to extract the information in a structured way. The image below shows the complete prompt:

The protocol section can be defined in several ways, but you should use something similar or even the same as any programming language. This protocol here is a pseudo-code that includes the name, type, and description of the field.

You can also ask for the classification of documents:

documentType:string => can be only “invoice”, “bill of sale”, “LLC creation document”, “Eviction Document”

Why choose JSON over other formats like YAML, especially when YAML often results in smaller payloads? The answer lies in the availability of training data. JSON is far more prevalent in the datasets used for training since it serves as the primary data exchange format on the web.

Proper JSON extraction

In OpenAI you can use already the JSON return type:

response = openai.Completion.create(
  model="text-davinci-003",
  prompt="Translate the following English text to French: 'Hello, how are you?'",
  response_format={ "type": "json_object" }
)

Since this functionality is not available, you’ll need to manually add and then trim the JSON structure. Typically, results are returned in this format for training purposes:

```json
[Content]
```

With this understanding, we can proceed as follows with the input content:

{
  "model": "mistral-tiny",
  "messages": [
    {
      "role": "user",
      "content": "[Content]```json\n{"
    }
  ],
  ...
}

This results in the following output content:

{
    ...
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "[Content]```[Content]"
            },
            "finish_reason": "stop"
        }
    ],
    ...
}

You can use this function that receives both contents and gives you the extracted JSON:

def extractJsonSubstring(str1, str2):
    # Concatenate the two strings
    combined_str = str1 + str2
    
    # Define the start and end markers
    start_marker = "```json"
    end_marker = "```"
    
    # Find the start and end positions of the substring
    start_pos = combined_str.find(start_marker)
    ### jump start_pos to the end of the start_marker
    start_pos = start_pos + len(start_marker)
    end_pos = combined_str.find(end_marker, start_pos)
    
    # Extract the substring
    json_substring = combined_str[start_pos:end_pos]

    return json_substring

Run locally — OLLAMA

There are several ways to run models on-premises nowadays, like LLM studio or Ollama. in this case, given the project, you can use LlamaIndex and Ollama. This article covers everything so you can remove the API call and have the same experience for an on-premise local solution.

Take a look at the code and test it

In this repo, you have a FastAPI app with one endpoint, just to test all these components. But it is a simple API that you should expand for your use case in particular.

Firstly you should point to the proper Tessaract executable

If you are using docker, use the second one and comment on the hardcoded path.

Also, this example uses a direct request to the Mistral API. Make sure you changed the key on the config.py file.

This is a FastAPI project, so make sure you install the dependencies and go to localhost:8000/docs

Once there go to the /extract endpoint, with the file and contract.

Make the request and you should have the proper result. Remember, this is just a template, some things could not work with tricky contracts.

Advanced cases: 1 Million token context

LLMLingua

LLM Lingua [2] is one of my favorite pieces of software I have seen in a while. It's clever and simple, fine-tuning a much smaller model to compress the content, so it can be passed onto a much bigger expensive model.

According to the paper, there is around a 1.5% drop for a 20x compression but would depend on the nature of the data.

Mistral Yarn 128k context window

YaRN (Yet another RoPE extensioN method) [3], uses Rotary Position Embeddings (RoPE) to increase the window size. You can read more in the paper, but allows you to increase the window size with almost zero loss. You can test it here.

Using both techniques a 10x * 128k context window compression rate will give you a 1M+ window. You can put an entire bible!

Use cases for bigger contexts

Sometimes a page doesn't contain the entire values to extract, sometimes will be multiple pages. Getting the data and then aggregating can be a problem since he can extract a wrong value instead of null (as discussed in the contract section), so one solution could be simply to increase the context content so nothing gets left outside.

This would be an extreme use case, so be cautious about how you pretend to use it.

Conclusion

The integration of OCR and LLM technologies, as showcased in the open-source project for document extraction, marks a pivotal advancement in analyzing unstructured data. The combination of open-source projects like Tesseract and Mistral makes a perfect implementation that could be used in an on-premise use case.

References & Documents

[1] DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL DOCUMENT UNDERSTANDING:

https://arxiv.org/pdf/2401.00908.pdf

[2] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models:

https://arxiv.org/pdf/2310.05736.pdf

[3] YaRN: Efficient Context Window Extension of Large Language Models: https://arxiv.org/pdf/2309.00071.pdf

Other contributors

Jorge Santos

Sérgio Magalhães

Open-Source LLM Document Extraction Using Mistral 7B

Introduction

OCR Layer

Convert pages to images

Preprocess image for OCR

Tesseract OCR

GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)

Tesseract Open Source OCR Engine (main repository) - GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine…

LLM Layer

Extraction contract definition

Proper JSON extraction

Run locally — OLLAMA

Take a look at the code and test it

Advanced cases: 1 Million token context

LLMLingua

Mistral Yarn 128k context window

Use cases for bigger contexts

Conclusion

References & Documents

Other contributors

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Júlio Almeida

More from Júlio Almeida and GoPenAI

GitHub Copilot under the hood. How it works and getting the best out of it.

GitHub Copilot has been a hot topic like other AI tools based on GTP-3. Some people love it, and some people loathe it with prophetic…

A Step-by-Step Guide to Training Your Own Large Language Models (LLMs).

Large Language Models (LLMs) have truly revolutionized the realm of Artificial Intelligence (AI). These powerful AI systems, such as GPT-3…

Python is out of Favor?Hugging Face Open-Sources a New ML Framework which written in Rust

Hugging Face has quietly open sourced an ML framework — Candle

AngularJS is Dead. Long live Open AI! A migration miracle.

Angular JS has officially reached end-of-life (EOL) status. This was something that everyone in the front-end community expected since the…

Recommended from Medium

Advanced RAG 02: Unveiling PDF Parsing

Including key points, diagrams, and code

MultiHop-RAG

A recent direction in RAG architecture is establishing wider context via a process of orchestration and chains over multiple documents.

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA

The field of natural language processing has been revolutionized by large language models (LLMs), which showcase advanced capabilities and…

EasyOCR: A Comprehensive Guide

Detailed explanation of EasyOCR with usage examples

How to fine-tune Mixtral-8x7B-Instruct on your own data?

It takes just a few minutes over three steps:

Transformers vs. OCR: who can actually read better?

A comparison between OCR-based and OCR-free approaches for Information Extraction