Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 alpha #33

Merged
merged 24 commits into from
May 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ node_modules
.env
.DS_Store
.icloud
*.icloud
*.icloud
42 changes: 26 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Form Extractor Prototype

This tool extracts the structure from an image of a form.
This tool extracts the structure from a PDF or image of a form.

It uses the [Claude 3 LLM](https://claude.ai) model by Anthropic.
By default it uses the [Claude 3 LLM](https://claude.ai) model by Anthropic.

But it can also use the OpenAI LLM.

A single extraction of an A4 form page costs about 10p.

Expand All @@ -18,34 +20,40 @@ You'll notice that it doesn't try to faithfully replicate every field in a quest
Instead, it uses the relevant components and patterns from the [GOV.UK Design System](https://design-system.service.gov.uk/).
This is a feature not a bug ;-)

## Run locally
## Install

You'll need an [Anthropic API key](https://www.anthropic.com/api).
You'll need either an [Anthropic API key](https://www.anthropic.com/api), or an [Open AI one](https://openai.com/index/openai-api/).

Add the key as a local environment variable called `ANTHROPIC_API_KEY`.
Add the key as a local environment variable called `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`.

Install the app locally with `npm install`.

Start the app with `npm start`.
You'll also need to install GraphicsMagick. It's used to convert PDF pages into images.

[There's a guide for doing that here](https://github.com/yakovmeister/pdf2image/blob/HEAD/docs/gm-installation.md).

## Run

Start the app locally with `npm start dev`.

It'll be available at http://localhost:3000/

## Current capabilities

- processing PDF forms or images of forms
- breaking a form down into questions
- distinguishing between question, hint and field text
- distinguishing between single-choice and multiple-choice questions
- recognising common question types like 'name', 'address', 'date' etc.
- recognising when an image isn't a form
- recognising when a question has conditional routing
- processing hand drawn forms
- browsing previously processed forms

## Current limitations

- it can only process jpg images of forms, not documents
- it only knows about certain kinds of question types
- you can't provide your own API key via the UI
- you can't browse previous form extractions
- like a lot of Gen AI, it can be unpredictable

## How it works
Expand All @@ -56,21 +64,23 @@ The main UI is in [app/views/index.html](https://github.com/timpaul/form-extract

Other Nunjucks page templates and macros are in [app/views](https://github.com/timpaul/form-extractor-prototype/tree/main/app/views).

Additional CSS styles are in [public/assets/style.css](https://github.com/timpaul/form-extractor-prototype/blob/main/assets/style.scss).
Additional CSS styles are in [assets/style.scss](https://github.com/timpaul/form-extractor-prototype/blob/main/assets/style.scss).

Generate updates to the CSS with `sass assets/style.scss public/assets/style.css`.

The script in [public/assets/scripts.js](https://github.com/timpaul/form-extractor-prototype/blob/main/assets/scripts.js) handles the image preview and loading spinner.
The script in [public/assets/scripts.js](https://github.com/timpaul/form-extractor-prototype/blob/main/assets/scripts.js) enhances file upload and adds loading spinners.

The form in [index.html](https://github.com/timpaul/form-extractor-prototype/blob/main/app/views/index.html) sends the image at the URL provided by the user to the Claude API.
The form in [index.html](https://github.com/timpaul/form-extractor-prototype/blob/main/app/views/index.html) uploads the file to the server.

It does this via the 'SendToClaude' function in [server.js](https://github.com/timpaul/form-extractor-prototype/blob/main/server.js).
If it's a PDF it uses GraphicsMagick to convert the pages into image files.

The function makes use of the 'tools' feature of Claude.
Form files are stored in subfolders in [public/results](https://github.com/timpaul/form-extractor-prototype/blob/main/public/results).

That allows you to specify a JSON schema that you'd like it's response to conform to.
The images are sent to an LLM, along with a prompt and JSON schema, via the 'SendToLLM' function in [server.js](https://github.com/timpaul/form-extractor-prototype/blob/main/server.js).

The JSON schema is specified in [data/extract-form-questions.json](https://github.com/timpaul/form-extractor-prototype/blob/main/data/extract-form-questions.json).
The JSON schema for each LLM is specified in [data/](https://github.com/timpaul/form-extractor-prototype/blob/main/data/).

The results are saved as JSON files in [app/data/](https://github.com/timpaul/form-extractor-prototype/tree/main/app/data).
The results are saved as a JSON files in the subfolders in [public/results](https://github.com/timpaul/form-extractor-prototype/blob/main/public/results).

Those files are used to generate the pages that are loaded into iframes in [app/views/index.html](https://github.com/timpaul/form-extractor-prototype/blob/main/app/views/index.html).

Expand Down
25 changes: 25 additions & 0 deletions app/views/breadcrumbs.njk
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@

<div class="govuk-breadcrumbs">
<ol class="govuk-breadcrumbs__list">
{% if fileData.filename %}

<li class="govuk-breadcrumbs__list-item">
<a class="govuk-breadcrumbs__link" href="/">Home</a>
</li>

<li class="govuk-breadcrumbs__list-item" aria-current="page">{{fileData.filename}} &nbsp;&nbsp;
<a class="govuk-link govuk-link--no-visited-state" href="/delete/{{ formId }}">Delete file</a>
</li>

{% else %}

<li class="govuk-breadcrumbs__list-item">Home</li>

{% endif %}
</ol>
</div>





79 changes: 79 additions & 0 deletions app/views/check-answers-popup.njk
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
{% extends "govuk/template.njk" %}
{% import "answer-types.njk" as answerType %}

{% block head %}
<link href="/assets/style.css" rel="stylesheet">
{% endblock %}

{% block header %}
{{ govukHeader({
useTudorCrown: true
}) }}
{% endblock %}

{% from "govuk/components/back-link/macro.njk" import govukBackLink %}
{% from "govuk/components/button/macro.njk" import govukButton %}
{% from "govuk/components/summary-list/macro.njk" import govukSummaryList %}



{% block beforeContent %}
{{ govukBackLink({
text: "Back",
href: "/form-popup/" + formId + "/" + question
}) }}
{% endblock %}

{% block content %}

<div class="govuk-grid-row">
<div class="govuk-grid-column-two-thirds-from-desktop">

<h1 class="govuk-heading-l">Check your answers before sending your application</h1>

<dl class="govuk-summary-list govuk-!-margin-bottom-9">
{% for question in fileData.pages %}
<div class="govuk-summary-list__row">
<dt class="govuk-summary-list__key">
{{question.question_text}}
</dt>
<dd class="govuk-summary-list__value">

</dd>
<dd class="govuk-summary-list__actions">
<a class="govuk-link" href="#">Change</a>
</dd>
</div>
{% endfor %}
</dl>

<h2 class="govuk-heading-m">Now send your application</h2>

<p class="govuk-body">By submitting this application you are confirming that, to the best of your knowledge, the details you are providing are correct.</p>

<form action="/form-handler" method="post" novalidate>

<input type="hidden" name="answers-checked" value="true">

{{ govukButton({
text: "Accept and send"
}) }}

</form>

</div>
</div>

{% endblock %}

{% block footer %}
{% endblock %}

{% block bodyEnd %}
{# Run JavaScript at end of the <body>, to avoid blocking the initial render. #}
<script type="module" src="/assets/govuk-frontend.min.js"></script>
<script type="module">
import { initAll } from '/assets/govuk-frontend.min.js'
initAll()
</script>
{% endblock %}
7 changes: 3 additions & 4 deletions app/views/check-answers.njk
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
{% extends "govuk/template.njk" %}
{% import "answer-types.njk" as answerType %}

{% set lastQuestion = formData.pages[question-1] %}

{% block head %}
<link href="/assets/style.css" rel="stylesheet">
{% endblock %}
Expand All @@ -18,10 +16,11 @@
{% from "govuk/components/summary-list/macro.njk" import govukSummaryList %}



{% block beforeContent %}
{{ govukBackLink({
text: "Back",
href: "/forms/" + formId + "/" + formData.pages | length
href: "/form-popup/" + formId + "/" + question
}) }}
{% endblock %}

Expand All @@ -33,7 +32,7 @@
<h1 class="govuk-heading-l">Check your answers before sending your application</h1>

<dl class="govuk-summary-list govuk-!-margin-bottom-9">
{% for question in formData.pages %}
{% for question in fileData.pages %}
<div class="govuk-summary-list__row">
<dt class="govuk-summary-list__key">
{{question.question_text}}
Expand Down
37 changes: 37 additions & 0 deletions app/views/doc-pagination.njk
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
<nav class="govuk-pagination form-pagination govuk-!-margin-bottom-2" aria-label="Pagination">

<div class="govuk-pagination__prev">
{% if pageNum | float > 1 %}
<a class="govuk-link govuk-pagination__link govuk-link--no-visited-state" href="/results/form-{{formId}}/{{ pageNum | float - 1 }}" rel="prev">
<svg class="govuk-pagination__icon govuk-pagination__icon--prev" xmlns="http://www.w3.org/2000/svg" height="13" width="15" aria-hidden="true" focusable="false" viewBox="0 0 15 13">
<path d="m6.5938-0.0078125-6.7266 6.7266 6.7441 6.4062 1.377-1.449-4.1856-3.9768h12.896v-2h-12.984l4.2931-4.293-1.414-1.414z"></path>
</svg>
<span class="govuk-pagination__link-title">
Previous<span class="govuk-visually-hidden"> page</span>
</span>
</a>
{% endif %}
</div>

<ul class="govuk-pagination__list">
<li class="govuk-pagination__item govuk-pagination__item--current">
Page {{ pageNum }}
</li>
</ul>

<div class="govuk-pagination__next">
{% if pageNum | float < filePages %}
<a class="govuk-link govuk-pagination__link govuk-link--no-visited-state" href="/results/form-{{formId}}/{{ pageNum | float + 1 }}" rel="next">
<span class="govuk-pagination__link-title">
Next<span class="govuk-visually-hidden"> page</span>
</span>
<svg class="govuk-pagination__icon govuk-pagination__icon--next" xmlns="http://www.w3.org/2000/svg" height="13" width="15" aria-hidden="true" focusable="false" viewBox="0 0 15 13">
<path d="m8.107-0.0078125-1.4136 1.414 4.2926 4.293h-12.986v2h12.896l-4.1855 3.9766 1.377 1.4492 6.7441-6.4062-6.7246-6.7266z"></path>
</svg>
</a>
{% endif %}
</div>

</nav>


117 changes: 117 additions & 0 deletions app/views/form-popup.njk
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
{% extends "govuk/template.njk" %}
{% import "answer-types.njk" as answerType %}

{% set questionJson = fileData.pages[question-1] %}

{% block head %}
<link href="/assets/style.css" rel="stylesheet">
{% endblock %}

{% block header %}

{{ govukHeader({
useTudorCrown: true
}) }}

{% endblock %}

{% from "govuk/components/back-link/macro.njk" import govukBackLink %}
{% from "govuk/components/button/macro.njk" import govukButton %}


{% block beforeContent %}

{% if question > 1 %}

{{ govukBackLink({
text: "Back",
href: question-1
}) }}

{% endif %}

{% endblock %}

{% block content %}

<div class="govuk-grid-row">
<div class="govuk-grid-column-two-thirds">

<span class="govuk-caption-m">Question {{question}}</span>


{% if questionJson.answer_type == "date" and questionJson.answer_settings.input_type == "date_of_birth" %}
{{ answerType.date_of_birth(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "date" %}
{{ answerType.other_date(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "name" %}
{{ answerType.name(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "number" %}
{{ answerType.number(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "email" %}
{{ answerType.email(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "text" %}
{{ answerType.text(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "national_insurance_number" %}
{{ answerType.national_insurance_number(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "phone_number" %}
{{ answerType.phone_number(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "organisation_name" %}
{{ answerType.organisation_name(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "address" %}
{{ answerType.address(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "yes_no_question" %}
{{ answerType.yes_no(questionJson.question_text, questionJson.hint_text) }}

{% elif questionJson.answer_type == "single_choice" %}
{{ answerType.single_choice(questionJson.question_text, questionJson.hint_text, questionJson.options) }}

{% elif questionJson.answer_type == "multiple_choice" %}
{{ answerType.multiple_choice(questionJson.question_text, questionJson.hint_text, questionJson.options) }}

{% else %}
<h1 class="govuk-heading-l">
{{questionJson.question_text}}
</h1>
<p class="govuk-body">{{questionJson.hint_text}}</p>
{% endif %}


{% if fileData.pages.length == question %}
{{ govukButton({
text: "Continue",
href: "/form-popup/" + formId + "/" + question + "/check-answers"
}) }}
{% else %}
{{ govukButton({
text: "Continue",
href: question + 1
}) }}
{% endif %}

</div>
</div>



{% endblock %}

{% block bodyEnd %}
{# Run JavaScript at end of the <body>, to avoid blocking the initial render. #}
<script type="module" src="/assets/govuk-frontend.min.js"></script>
<script type="module">
import { initAll } from '/assets/govuk-frontend.min.js'
initAll()
</script>
<script src="/assets/scripts.js"></script>
{% endblock %}
Loading