Structuring Posologies with LLMs

July 25, 2025

At Posos' Research and Development Department, our main goal is to enable scalable, multilingual structuration without starting from scratch in every language. Before reaching for multilingual generalization, we built a robust, reliable system capable of handling the complexities of real-world French prescriptions: the NERL (short for Named Entity Recognition and Linking), a hybrid engine that has been at the core of our prescription structuring process. In the sections that follow, we’ll briefly outline the design of this historical system, its strengths, and its limitations, as well as our recent experiments with LLMs and hybrid approaches. A more comprehensive account is available in this scientific article¹ , published in June 2025.

Summary

Structuring Posologies in French

Here, structuring posologies involves translating a raw textual posology instruction into a standardized object where each bit of information is correctly labelled. This standardization should produce the same output for all the different ways one can write a given dosage instruction. The resulting structured posology can then be used to query a medical database, such as the one Posos has developed, containing the recommended dosages for all commercialized drugs and for diverse patient criteria. Posos’ prescription widget relies on the result, checks that the prescribed dosage follows official recommendations for the given patient, and throws relevant alerts.

Free-text posology structuration by Posos — Example of dosage instruction and its structuration shown in a pre-filled form

From PDF to Structured Data: The Workflow

The process starts with a prescription uploaded as a PDF or an image. Our OCR engine (Optical Character Recognition) transcribes it line by line, after which we strip any personal data to focus solely on medical content. This cleaned input is then processed by NERL.

NERL integrates various techniques, including NER (Named Entity Recognition) to identify medical entities, and NEL (Named Entity Linking) which ensures alignment with international data standards by mapping terms like drug names to entries in the Posos medical database. While NER and NEL are key components, the techniques used in NERL are not limited to these alone, we dive deeper into this approach in another dedicated Posos blog post.

The result is a fully structured prescription, ready for manipulation, storage, or integration into other healthcare systems. To ensure the reliability of prescription structuration, each output is paired with a confidence score generated by another internally developed model. For a deeper look into how this works, check out our detailed explanation in our blog post on developping a hybrid automatic prescription structuring system.

Evaluating on Complex Prescription Cases

To validate the robustness of our system, we tested it on a challenging dataset that reflects the messy reality of medical prescriptions. Unlike simple cases (e.g., “Doliprane, 3 times a day”), our test set included:

OCR errors (e.g., “ampoule” misread as “ampoute”)
Typographical mistakes
Sequential dosage instructions (e.g., "1 sachet 3x/day for 1 week, then 2x/day for 3 days")
Implicit structures (like inferring a singular dosage amount)
Compound doses (e.g., AMLODIPINE 10 mg + PERINDOPRIL 10 mg tablet (COVERAM 10 mg/10 mg))

The NERL nowadays achieves 85% accuracy on this evaluation set. Given the inherent ambiguity of natural medical language, this highlights the system's strong performance despite the task's complexity.

Pushing the Boundaries: Exploring LLMs for Posology Structuration

While our NERL-based pipeline is robust and performs well, we want to go further and explore how Large Language Models (LLMs) could contribute, both in structuring complex posology instructions and for rapid multilingual generalization. The first question is: Can an LLM handle on its own the same tasks we currently solve and outperform our current solution?

Phase 1: Prompt Engineering & First Evaluations

Before exploring multilingual capabilities, we first needed to assess how LLMs perform on the posology structuration itself. We began with a series of preliminary tests across multiple LLMs, both open-source and proprietary. After this exploration phase, we chose to move forward with Gemini, integrated through Vertex AI, due to its performance, flexibility, and integration capabilities.

Once we started refining our prompts, it became clear that we had to stick with a single model, as optimized prompts tend to be highly model-specific.

To evaluate performance, we used the same complex prescription dataset that we used for NERL. For simplicity, we’ll focus here on the accuracy metric, although our evaluation includes other dimensions as well.

Our first step was prompt engineering, and we explored several strategies to improve results:

Chain-of-Thought prompting, where the model is thought to reason step-by-step,
Few-shot examples, providing concrete structured examples within the prompt,
Reformulation instructions, where the model was explicitly asked to rewrite dosage texts to clarify ambiguities before structuring.

We also fixed the temperature to 0 to enforce deterministic, consistent outputs and used Vertex AI’s native structured output capabilities. This setup yielded a first accuracy score of 73%.

Phase 2: Fine-Tuning with Synthetic Data

To push further, we explored lightweight fine-tuning techniques, such as adapters. The challenge here was data availability: hand-annotated real-world prescriptions are scarce, so, we turned to synthetic data, leveraging over 1,000 posology instructions that were automatically structured using NERL.

To ensure quality, we only kept those with high confidence scores, resulting in a curated training set of about 1,000 samples.

Fine-tuning on this dataset helped us improve the model’s accuracy to 83%, nearly reaching the 85% achieved by our NERL system.

What LLMs Do Better (and What They Don’t)

The error analysis highlighted clear complementarities between NERL and LLMs.

NERL excels in interpreting complex yet explicit dosage structures, effectively managing temporal expressions such as "1 hour before meals," and accurately identifying start and end dates.

In contrast, LLMs demonstrate a better understanding of conditional instructions like "if pain" or "if needed," show greater resilience to OCR errors and typos, and are more skilled at handling compound doses and sequential or implicit dosage patterns.

These distinct strengths point toward a hybrid approach as the most promising path forward.

Toward a Confidence-Based Hybrid Workflow

Rather than replacing our current system, we propose an intelligent orchestration layer combining both NERL and LLMs:

NERL processes the document first.
If the confidence score is below a certain threshold, the document is passed to the LLM for additional analysis.
The final output is selected based on the highest confidence.
If confidence remains low or the result comes from the LLM, the user is alerted to verify manually.

With this hybrid setup, we achieve up to 90% accuracy, while keeping latency and resource usage low since LLM calls are limited to ambiguous cases.

Current limitations

Though promising, this approach still presents several limitations. LLMs currently lack direct access to medical databases, which constrains their ability to verify or normalize drug names. There's also a risk of hallucination and while initial results in French are encouraging, generalizing reliably across languages remains an ongoing challenge.

To address these issues, we're working on integrating our NEL component to enrich LLM outputs with verified drug data from the Posos medical database. We're also reinforcing our evaluation datasets with edge cases and safety-critical scenarios, while actively expanding our language coverage beyond French to build a more globally robust solution.

Multilingual Extension: First Steps Toward English (and Beyond)

We were able to achieve strong results quickly with LLMs in French. Naturally, the next question is: how well, and how fast, can we generalize this to other languages, starting with English?

Using LLMs, we translated our complex French test set into English, ensuring not just a direct translation, but an adaptation to UK medical language conventions. With the original base model and the same French-optimized prompt, we reached an accuracy of 62%, which is clearly below expectations. We also tried running our French fine-tuned model on the translated English dataset, which improved accuracy slightly to 69%.

To go further, we translated our synthetic training data into English, again adapting dosage phrasing to local conventions, and fine-tuned a model specifically for English. This boosted our result: 79% accuracy on the English test set, a clear improvement over simply transferring models used for French.

However, this approach has a major limitation: even with adaptation, automatically translated data still lacks the linguistic nuance and domain-specific authenticity of real-world English prescriptions. As a result, it may not generalize well to actual data encountered in production.

That said, the speed at which we’ve been able to reach near-80% accuracy in English, starting from French resources, demonstrates the strong potential of our multilingual strategy. As we begin integrating real-world English data, we're confident that we can bridge the remaining gap and replicate this progress across other languages.

Conclusion

Our experiments demonstrate that LLMs do not replace our existing pipeline but rather complement it. When combined with NERL in a confidence-based hybrid model, they can improve its accuracy in structuring posologies. This design not only enhances performance but also optimizes resource usage and reduces latency by activating LLMs only when needed. Multilingual expansion is another promising area. LLMs offer a practical path to structuring posologies in languages where traditional, fully annotated NLP pipelines don’t yet exist.

That said, key challenges remain: LLMs still need to be connected to our NEL module to ensure accurate drug normalization, and there’s a need for annotated real-world data, particularly in English and other target languages. We're also refining prompt strategies and implementing additional safety layers to mitigate hallucinations and ensure more reliable outputs in LLM-driven scenarios.

Moving forward, our efforts are focused on collecting real multilingual data, beginning with English, and strengthening both prompts and validation mechanisms. With ongoing research and real-world data integration, we firmly believe our path leads to more intelligent and reliable medical text structuration.

‍

To explore the challenges and strategies of extending our posology structuration capabilities beyond French, we encourage you to read the previous article of our R&D series: Toward multi-lingual posology structuration.

‍

Extra sources :

¹ https://arxiv.org/abs/2506.19525

‍