AIx2 Occipital: A Multi-Modal Small Language Model (SLM) using LlaVa for Finance Documents Data Extraction.
At AIx2, we’ve built an in-house multi-modal Small Language Model (SLM), codenamed AIx2 Occipital, to address the challenge of extracting data from multimodal finance documents at low cost, low latency and high accuracy. Occipital is powered by a LLaVa back-end (short for Large Language and Vision Assistant, an open-source multi-modal framework) and fine-tuned on finance-specific data. Below, we unpack the key components of Occipital, highlighting how it extracts data from text, graphs, tables, and infographics with low latency, cost efficiency, and high accuracy (See Colab here).
Extracting actionable insights from complex financial documents—riddled with tables, charts, infographics, and unstructured text—poses significant challenges. Large Language Models (LLMs) can be effective at parsing textual data, but they often run into performance and cost hurdles when handling multi-modal finance documents in real time
1. The Challenge of Multi-Modal Finance Data
1.1 Complex Data Structures
Finance documents can combine:
Dense text: Analyst reports, executive summaries, disclaimers
Tables: Balance sheets, revenue breakdowns, and historical performance tables
Charts & Graphs: Stock price evolution, macroeconomic indicators
Infographics: Illustrations highlighting M&A flows, organizational structures
Bringing all these disparate elements into a cohesive representation is non-trivial for generic LLMs. Traditional LLMs focus on pure text; extracting data from images, PDFs, or scanned reports can require costly and slow optical character recognition (OCR) pipelines. Moreover, table parsing can be error-prone unless carefully tuned.
1.2 Existing Solutions: Slow and Expensive
Off-the-shelf multi-modal solutions typically employ massive LLMs (GPT-like architectures or commercial APIs) to process documents. While effective, they come with:
High Compute Costs: Large models demand extensive GPU resources.
Latency Constraints: Summaries or insights can take seconds or minutes to generate, unacceptable for real-time financial workflows.
Limited Fine-Tuning: Adapting these mega-models for finance data is both expensive and complicated.
2. Why an In-House Small Language Model (SLM)?
2.1 Low Latency, High Throughput
By downscaling the base model size and optimizing the architecture for multi-modal finance tasks, AIx2 Occipital can run efficiently on our own GPU clusters. This keeps inference times low (sub-second in many cases), which is critical when analysts need real-time insight during market events or due diligence processes.
2.2 Cost Efficiency
Large commercial LLMs can be prohibitively expensive to query repeatedly for high-volume data extraction tasks. Occipital’s compact size and on-premise deployment model significantly reduce operational costs—especially for large-scale or continuous data ingestion scenarios.
2.3 Finance-Focused Fine-Tuning
Generic large models might understand everyday text well, but finance-specific nuances (e.g., IFRS vs. GAAP, specialized metrics like IRR, EV/EBITDA, etc.) demand a specialized approach. By training Occipital on domain-specific corpora and curated multi-modal finance documents, we achieve high accuracy and low hallucination on sector-specific tasks.
3. Introducing AIx2 Occipital
FalconEye is our next-generation SLM built atop LLaVa (Large Language and Vision Assistant). It comprises:
Lightweight Transformer Backbone
We prune and optimize open-source LLM checkpoints to create a slimmer but finance-focused language model.
Vision Encoder for Finance Media
A specialized visual encoder that recognizes charts, infographics, and tables typical in financial docs.
Pre-trained on large-scale visual datasets, then fine-tuned on finance images (annual reports, pitch decks, etc.).
Adaptive Multi-Modal Fusion
Uses cross-attention layers to merge textual and visual embeddings, enabling the model to interpret text in tandem with accompanying charts or tables.
Learns to align an image region (like a bar in a bar chart) with numerical or categorical text labels.
4. Core Capabilities
4.1 Multi-Modal Data Extraction
Occipital can handle documents containing both text and visuals. It accurately extracts:
Tabular Data: Reads row/column headers, aggregates numerical values, and maps them to the correct field (e.g., “Total Revenue” or “Q2 2025 Earnings”).
Graph Annotations: Identifies legends, axes, and peaks in line/bar charts, extracting numerical values or highlight trends.
Infographic Insights: Interprets diagrams illustrating capital flows or organizational hierarchies, distilling them into structured relationships.
4.2 Question-Answering & Summarization
A user can ask natural language questions about a multimodal finance document:
“What is the year-over-year growth percentage in the bar chart for the Healthcare segment?”
Occipital will visually parse the chart, cross-reference text commentary, and return a precise numeric answer (e.g., “15.2% growth from 2024 to 2025”). For broader context, FalconEye provides succinct summaries of risk factors, major announcements, or key insights from multi-page filings.
4.3 Real-Time Inference & Integration
Because Occipital runs on our in-house GPU clusters:
Latency is minimal, enabling sub-second queries.
Scalability is straightforward, with load balancing for large-scale batch extractions.
Privacy & Security are maintained, as sensitive user documents never leave AIx2’s environment.
5. Implementation Insights
5.1 Fine-Tuning Open-Source LLMs
Occipital’s pipeline starts with an open-source LLM (e.g., LLaVa-based or a smaller GPT variant). We applied domain-specific fine-tuning:
Finance Corpus: Millions of tokens from financial statements, earnings calls, macro reports.
Visual Documents: Labeled chart images, tables, and infographics, focusing on standard finance conventions.
Real-World QA Pairs: Annotated question-answer sets to train the model’s multi-modal comprehension.
5.2 Hardware and Training Setup
Model Size: A fraction of GPT-4’s parameter count, making it feasible to train on mid-range GPU clusters.
Mixed Precision: To manage memory footprint and speed up training.
Inference Pipeline: Deployed on AIx2’s private GPU servers with automatic scaling based on query volume.
5.3 Open-Source Collaboration
We provide select components of Occipital (like data loaders and prompt templates) in a [public Colab notebook] and [GitHub repository], enabling community members to explore or replicate smaller variants. While certain finance-specific modules remain proprietary, we strongly believe in open research and encouraging broader multi-modal AI adoption.
6. Real-World Applications
Fast Due Diligence
Summarize investment decks or decode pitch documents (slides containing charts, revenue tables, bullet points).
Extract critical metrics (EBITDA, IRR, market share) and highlight anomalies or red flags in real time.
Financial News & Reports Scraping
Monitor daily public filings or earnings call slide decks.
Provide updates on key metrics and performance indicators to analysts.
Enterprise Knowledge Management
Ingest large volumes of PDFs, annotated by Occipital for quick retrieval.
Build an internal knowledge graph linking relevant insights across investor presentations, risk disclosures, and more.
7. Conclusion
AIx2 Occipital demonstrates how multi-modal small language models—fine-tuned for finance—can combine speed, accuracy, and cost-efficiency. By leveraging LLaVa and open-source LLM foundations, we’ve crafted a lightweight system capable of parsing text, charts, tables, and infographics under a single pipeline, maintaining low latency and high fidelity in data extraction.