Open-Source LLM Document Extraction Using Mistral 7B
Introduction
OCR is a mastered science, and they have been for a while now. The problem nowadays is in the extraction and processing of the data, and LLM shines in this part. JP Morgan recently launched DocLLM[1] ( you can read more here), an LLM just designed for this. The model is not available yet and has some limitations in terms of context window size. So I decided to do the opposite, a full-open source with a much bigger window size.
You can see the repo here:
https://github.com/enoch3712/Open-DocLLM
The project is divided into two parts, the OCR and LLM layer. In production-ready projects, this would be different microservices or even different services. But the division is clear, there is the reading of all the content (OCR Layer) and then the extraction of the specific content (LLM Layer).
OCR Layer
Convert pages to images
First, it’s essential to convert any type of file into an image. This approach ensures that all content within the document can be accessed. By splitting the pages into images, they can be individually processed in subsequent steps.
Preprocess image for OCR
Often, images are of poor quality and require adjustments to improve readability. Enhancements can include changes in contrast, as well as the use of various libraries and frameworks. While this article will not delve into those details, implementing such adjustments is crucial for production use cases.
Tesseract OCR
It's the most famous and popular open-source OCR in the world. Quite old too, but it has been been tweaked over time to support more language structures and tables. This example supports tables by dividing with “|”.
It uses pytesseract to connect to the tesseract via Python. Here is the code:
def extract_text_with_pytesseract(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for index, image_bytes in enumerate(image_list):
future = executor.submit(process_image, index, image_bytes)
futures.append(future)
for future in concurrent.futures.as_completed(futures):
try:
raw_text = future.result()
image_content.append(raw_text)
except Exception as e:
raise Exception(f"Error processing image: {e}")
return image_content
# only this one matters, the rest is concurrency logic
def process_image(index, image_bytes):
try:
image = Image.open(BytesIO(image_bytes))
raw_text = str(image_to_string(image))
return raw_text
except Exception as e:
raise Exception(f"Error processing image {index}: {e}")
Ignore most of the code present in the extract_text_with_pytesseract, since is mostly concurrency logic so you can process in parallel if possible.
LLM Layer
Extraction contract definition
Once we have the content from the document, we need to extract the information in a structured way. The image below shows the complete prompt:
The protocol section can be defined in several ways, but you should use something similar or even the same as any programming language. This protocol here is a pseudo-code that includes the name, type, and description of the field.
You can also ask for the classification of documents:
documentType:string => can be only “invoice”, “bill of sale”, “LLC creation document”, “Eviction Document”
Why choose JSON over other formats like YAML, especially when YAML often results in smaller payloads? The answer lies in the availability of training data. JSON is far more prevalent in the datasets used for training since it serves as the primary data exchange format on the web.
Proper JSON extraction
In OpenAI you can use already the JSON return type:
response = openai.Completion.create(
model="text-davinci-003",
prompt="Translate the following English text to French: 'Hello, how are you?'",
response_format={ "type": "json_object" }
)
Since this functionality is not available, you’ll need to manually add and then trim the JSON structure. Typically, results are returned in this format for training purposes:
```json
[Content]
```
With this understanding, we can proceed as follows with the input content:
{
"model": "mistral-tiny",
"messages": [
{
"role": "user",
"content": "[Content]```json\n{"
}
],
...
}
This results in the following output content:
{
...
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "[Content]```[Content]"
},
"finish_reason": "stop"
}
],
...
}
You can use this function that receives both contents and gives you the extracted JSON:
def extractJsonSubstring(str1, str2):
# Concatenate the two strings
combined_str = str1 + str2
# Define the start and end markers
start_marker = "```json"
end_marker = "```"
# Find the start and end positions of the substring
start_pos = combined_str.find(start_marker)
### jump start_pos to the end of the start_marker
start_pos = start_pos + len(start_marker)
end_pos = combined_str.find(end_marker, start_pos)
# Extract the substring
json_substring = combined_str[start_pos:end_pos]
return json_substring
Run locally — OLLAMA
There are several ways to run models on-premises nowadays, like LLM studio or Ollama. in this case, given the project, you can use LlamaIndex and Ollama. This article covers everything so you can remove the API call and have the same experience for an on-premise local solution.
Take a look at the code and test it
In this repo, you have a FastAPI app with one endpoint, just to test all these components. But it is a simple API that you should expand for your use case in particular.
Firstly you should point to the proper Tessaract executable
If you are using docker, use the second one and comment on the hardcoded path.
Also, this example uses a direct request to the Mistral API. Make sure you changed the key on the config.py file.
This is a FastAPI project, so make sure you install the dependencies and go to localhost:8000/docs
Once there go to the /extract endpoint, with the file and contract.
Make the request and you should have the proper result. Remember, this is just a template, some things could not work with tricky contracts.
Advanced cases: 1 Million token context
LLMLingua
LLM Lingua [2] is one of my favorite pieces of software I have seen in a while. It's clever and simple, fine-tuning a much smaller model to compress the content, so it can be passed onto a much bigger expensive model.
According to the paper, there is around a 1.5% drop for a 20x compression but would depend on the nature of the data.
Mistral Yarn 128k context window
YaRN (Yet another RoPE extensioN method) [3], uses Rotary Position Embeddings (RoPE) to increase the window size. You can read more in the paper, but allows you to increase the window size with almost zero loss. You can test it here.
Using both techniques a 10x * 128k context window compression rate will give you a 1M+ window. You can put an entire bible!
Use cases for bigger contexts
Sometimes a page doesn't contain the entire values to extract, sometimes will be multiple pages. Getting the data and then aggregating can be a problem since he can extract a wrong value instead of null (as discussed in the contract section), so one solution could be simply to increase the context content so nothing gets left outside.
This would be an extreme use case, so be cautious about how you pretend to use it.
Conclusion
The integration of OCR and LLM technologies, as showcased in the open-source project for document extraction, marks a pivotal advancement in analyzing unstructured data. The combination of open-source projects like Tesseract and Mistral makes a perfect implementation that could be used in an on-premise use case.
References & Documents
[1] DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL DOCUMENT UNDERSTANDING:
https://arxiv.org/pdf/2401.00908.pdf
[2] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models:
https://arxiv.org/pdf/2310.05736.pdf
[3] YaRN: Efficient Context Window Extension of Large Language Models: https://arxiv.org/pdf/2309.00071.pdf