Synthetic Data for Improving Small Language Models (SLM) with Multi-modal Financial Document Data Extraction

AIx2 team has developed a framework that leverages synthetic data to train small, domain-specific language models (SLMs) for data extraction from multi-modal financial documents. Unlike large-scale general-purpose LMs like GPT-4, our approach focuses on smaller, cost-efficient models fine-tuned for financial document analysis. This technique is used by the AIx2 team to improve the quality of AIx2 proprietary SLM model for parsing multi-modal financial documents (AIx2 Occipital).

We explore how synthetic data enables small LMs to excel in extracting insights from financial documents, highlighting the methodology, applications, and benefits.

Why Financial Documents Demand Multi-Modal Learning

In the evolving field of natural language processing (NLP), the integration of multi-modal data—combining text, images, infographics, and tables—has emerged as a key challenge. Financial documents, such as 10-Ks, investment memos, 10-Qs, due diligence reports, and market analyses, are particularly demanding. These documents often mix dense textual narratives with intricate numerical data and complex visualizations, requiring models to interpret and synthesize multi-modal inputs.

Financial documents are unique in their complexity. They combine:

Narratives: Descriptions of business operations, market opportunities, and risk factors.
Numerical Data: Financial statements, ratios, and forecasts.
Visuals: Charts, graphs, and infographics that represent trends and KPIs.

Existing NLP systems, trained primarily on text-based corpora, struggle to handle the interplay between these elements. For instance, extracting revenue growth trends may involve parsing a table, correlating it with textual explanations, and interpreting a bar chart—all within the same document.

The Case for Small Language Models

Large language models (LLMs) are powerful but have significant drawbacks in this domain:

Computational Costs: Training and deploying LLMs require enormous computational resources, often beyond the reach of small- to mid-sized enterprises.
Fine-Tuning Challenges: Adapting LLMs for finance-specific tasks demands significant domain data, which is often proprietary and scarce.
Privacy Concerns: Sending sensitive financial data to large-scale cloud-hosted LLMs poses security and compliance risks.

In contrast, small language models (SLMs) offer a more practical solution:

Efficiency: They are faster to train and deploy, making them suitable for real-time or on-premise applications.
Adaptability: With focused training, SLMs can achieve near state-of-the-art performance in niche domains like finance.
Privacy: SLMs can be deployed in secure environments, ensuring sensitive data remains protected.

The Power of Synthetic Data

Training small LMs effectively for financial tasks requires large amounts of labeled data. However, collecting such data is fraught with challenges: proprietary restrictions, privacy concerns, and domain-specific nuances. This is where synthetic data comes into play.

What is Synthetic Data?
Synthetic data is artificially generated data that mimics the characteristics of real-world datasets while ensuring privacy and scalability. For financial document analysis, synthetic data includes:

Text: Simulated narratives about market conditions, revenue trends, or business strategies.
Tables: Financial statements and KPIs with realistic numerical values.
Infographics: Bar charts, pie charts, and line graphs representing business metrics.
Annotations: Labeled data for tasks like entity recognition, risk extraction, and KPI analysis.

By training SLMs on synthetic data, we can overcome real-data limitations while preserving the statistical and contextual richness needed for financial insights.

The AIx2 Approach to Synthetic Data

Our pipeline for generating synthetic data involves several key steps:

1. Domain-Specific Corpus Development

We start by analyzing publicly available financial documents, such as redacted sections of 10-Ks or anonymized investment memos, to construct a template library. These templates guide the structure and content of synthetic documents.

2. Text Generation

Using pre-trained language models fine-tuned on financial texts, we generate realistic narratives that emulate the tone and style of actual financial documents. For example, we might create a "Management Discussion" section discussing hypothetical market trends or risk factors.

3. Tabular Data Simulation

Tabular data is generated using statistical distributions that reflect real-world financial metrics. For example, we simulate revenue growth, profit margins, and expense breakdowns using conditional probabilities derived from historical data.

4. Infographic Generation

Graphs and charts are created using visualization libraries like Matplotlib. These infographics are paired with textual descriptions to ensure alignment between visual and textual information.

5. Annotation

Each synthetic document is automatically annotated with labels for entities, numerical values, relationships, and tasks like summarization or Q&A. This labeled data becomes the foundation for supervised learning.

Example

A synthetic financial document might include:

Text: "The company’s revenue grew by 15% year-over-year, driven by increased adoption in the North American market."
Table: A detailed breakdown of revenues by region, year, and product line.
Chart: A line graph illustrating revenue growth trends from 2019 to 2023.

Multi-Modal Model Architecture

To process financial documents effectively, we employ a multi-modal architecture designed to handle both textual and visual inputs:

Text Encoder
A small Transformer-based encoder processes textual content, extracting semantic embeddings.
Visual Encoder
A convolutional neural network (CNN) or Vision Transformer processes infographics and table images, extracting visual features.
Fusion Layer
A cross-attention mechanism combines text and visual embeddings, enabling the model to understand correlations between narrative explanations and visualized data.
Task-Specific Heads
Depending on the use case, the model includes heads for tasks like:

Entity Recognition: Extracting company names, KPIs, or risk factors.
KPI Computation: Deriving insights from numerical tables.
Document Summarization: Producing concise summaries of financial reports.

Evaluation and Performance

Our approach has been evaluated on tasks critical to financial analysis, such as:

Competitor Analysis: Identifying direct and indirect competitors from due diligence documents.
Risk Extraction: Parsing risk factors from 10-K filings.
Revenue Insights: Extracting revenue growth trends and correlating them with textual explanations.

Results show that models trained on synthetic data achieve near state-of-the-art performance, often matching or surpassing models trained on smaller real-world datasets. Additionally, the inclusion of multi-modal inputs significantly improves accuracy for tasks involving numerical or visual data.

Advantages of the AIx2 Framework

Cost-Effective Training
Synthetic data reduces the dependency on expensive labeled datasets, making it feasible to train specialized models on a budget.
Privacy-Preserving
By simulating proprietary data, we sidestep privacy concerns while ensuring models are exposed to realistic financial scenarios.
Scalability
Synthetic data can be generated at scale, covering edge cases and rare scenarios that might be missing in real-world data.
Domain-Specific Mastery
Fine-tuning on synthetic financial data ensures models deeply understand sector-specific nuances, outperforming generic LMs in finance-specific tasks.

Next steps

As financial data becomes increasingly complex and multi-modal, the need for domain-specific AI solutions will grow. At AIx2, we’re committed to advancing this frontier by refining our synthetic data generation techniques and expanding the capabilities of small language models. By combining cutting-edge research with practical applications, we aim to empower financial analysts, fund managers, and other stakeholders with tools that deliver actionable insights.

The use of synthetic data to train small, multi-modal language models represents a paradigm shift in financial document analysis. By addressing the unique challenges posed by financial documents—combining dense narratives, numerical data, and visualizations—this approach sets a new standard for cost-effective, privacy-preserving, and domain-specific AI solutions. Whether you're analyzing 10-Ks or performing due diligence on potential investments, AIx2’s framework provides a powerful toolkit for extracting insights and making data-driven decisions.