Make AI document extraction easy for everyone

Problem

FormExtractor has to stand out from the crowded, no-code document extractor market. Our customers want to extract complex data from many more types of documents.

Solution

By empowering everyone to build their custom extractors, we have scaled our product to address any extraction use case, and put ourselves into a unique position in the market.

Outcome

3.7x the lifetime value of customers building custom extractors, and cut down 70% of customer support time.

 

My Role

Led the lean UX process with our PM to validate three of our biggest assumptions. Designed and shipped the first release of this feature.

Project Team

Me, Lead Designer
Frank, Design Manager
Fung, Product Manager
Jason, Senior ML Developer
Ben, Junior Developer

Create Versatile Extractor

User starts creating their custom extractor by uploading samples for our system to learn. We enhance the samples upload on our backend and nudge users to upload more for better results

Make Labelling Easy

We empower users to set up the extractor by letting them label the samples. When they instruct the system how to extract data just as they would input manually, we foster trust in our system.

Get Data From Document

Once the extractor is created, users can get data out of thousands of documents in minutes, or integrate our API into their workflow applications and extract data automatically

 

Challenge and Research

How to create product differentiation & stickiness

FormExtractor falls on the no-code side of the document extractor market which is crowded by many receipt and invoice scanners.

While we support more types of documents, we hear many times from our large enterprise customers and prospective customers that they want to extract documents of their own formats. Because of this, we lost a HK$ 3M opportunity automating the application and claim system of the largest insurance company in Hong Kong.

 

Customers want to extract from more documents types than what we provide

Early Explorations

HMW help customers extract from any document?

We looked at a few existing text recognition methods out there. And based o our engineering resources and after some discussions, our product team narrowed down to two diverging approaches: provide templates or create their custom extractors.

 

We were excited that there was indeed a need for creating custom extractors because of the diverse range of data and document format that our users would like to extract.

 

Create templates

Can immediately respond to customer needs

Easy to build templates for each document type

No refactoring the backend is needed

Custom extractor builder

Build it once and can scale infinitely to customers’ needs

New processes and refactoring are needed

Can become one of our UVPs

 
 

Build Proof of Concept to validate the direction

I understand that our engineers do not want to undertake such a huge engineering challenge of creating a new tool. But creating a custom extraction builder would help us scale our offering, and free up engineers time from customizing models for our customers so that they can focus on product innovation.

To prove that a custom extractor builder is a more scalable and time-saving direction, I asked our engineers to help me build a simple proof of concept to see how many different types of documents our customers want to extract from.

 

The proof of concept only has a file uploader and a form asking for information to extract from the samples

We got it shipped in 2 weeks to some existing customers. And within 1 week, they have requested to extract from many documents and data types. That is enough to prove to the team that creating templates for every document is just not practical, and a builder would help us scale our service to more customers.

PoC Result After 1 Week

34

Custom Request

10

Document Types

234

Data Types To Extract

 

Research and Deeper Insight

Customers need a sense of control in the extraction

Once the direction is set, I went back to the customers who tried our PoC and asked about their experiences. Surprisingly, some customers didn’t want to use it anymore.

Some of them thought that we were manually inputting data for them, and did not believe our system can handle their daily document extraction load if it’s manual. Others just did not get a very accurate extraction.

Customer Interview Stats

5

PoC Customers

8

Interview Sessions

 
 

Customers did not trust our system

To me, this revealed a deeper issue, which is that our customers did not trust our system to extract information reliably because 1) they are confused about how the extractor extracts data, and 2) the data is not accurate.

I identified uploading and labeling to be the two key steps that will foster trust in our system. It is about getting users to upload more, high quality samples for our system to learn from, and creating a sense to our users that they are instructing the system to extract data.

 

Nudge user to upload more for better result, and empowering users to set up the extractor will foster their trust in our extractor

Improve Uploading & Labelling

Nudging users to upload more high-quality samples

In our MVP, we see that users often only uploaded 1-2 samples, which are not enough to train the ML model for a custom extractor. The quality of some samples was also low, making the extractor inaccurate.

The number of samples is critical to the accuracy of the extractor because it gave the system more examples to learn from. I designed a number of nudges and decided to force the user to upload some samples to start. I also made uploading many samples in batch easy using drag-to-upload. An indicator also showed how many more samples should the user upload to get a more accurate extractor.

 

Users have to upload a few samples to start creating their extractor

Nudging users to upload more samples to create a better extractor

After implementing those nudges, users on average uploaded twice as many samples after this new sample upload flow was implemented.

 
 

Improving quality of samples uploaded

It is also important that the samples uploaded by the users are in high quality. At first, we gave simple instructions for users to look out for when uploading their samples. But then in our proof-of-concept, some samples uploaded by users still was not up to standard.

 

Examples of low quality samples users uploaded that will affect the extractor’s accuracy

 

After discussing with our engineers, it turned out they would process the samples uploaded afterwards to correct for skewness, contrast and accidental crops.

I suggested processing the samples as user upload them. And coupled with a few simple test for contrast and legibility, we could feedback our users about the quality of the samples in real time.

 

Design iterations on the alert when the sample may be low quality

Animation when uploaded samples is processing

Alert when the sample is low quality

Improve Uploading & Labelling

Making labeling easy for everyone

Labeling involves telling the system where is the data located and what does it look like on the sample, usually by drawing a rectangle around where the data is. The precision of labeling would determine the accuracy of the extractor.

Iteration 1: using an open-source labeler

We first embedded an open-source labeling tool onto our product for users to label data. But many of the functions were irrelevant, which confused our users.

 

Iteration 1: Embedding the open-source annotation tool right on our labeling page

 

Iteration 2: Add tutorial to open-source labeler

I then designed tutorial steps that were imposed on the embedded labeling tool to walk the user step-by-step, on how to label one of the data on one of the sample.

 

Iteration 2: Placing tutorial steps to guide users to use the annotation tool

However, that also proved to be too complicated for most users because the open-source labeling tool was built for more sophisticated labeling tasks. But the plethora of features and the granular control were irrelevant for our users.

 

Iteration 3: building our own labeler

I proposed to build our own labeling tool that stripped away most of the complex functions the open-source one has. This was the biggest engineer commitment we would make. But I argued simplified design would reduce the time needed for our account manager and engineers to tutor our users or label data for them.

 

Deciding what tools to include in our in-house annotation tool

Our engineers figured out a way to build our own labeling front-end, and feed the labeling data, such as positions of the rectangles, back to the open-source labeling tool, so that we don't need to build our own backend processor. This was a win-win for both design and engineering.

 

Step 1: Select Detection Region tool

Step 3: Define the field name and type of the data

 

Step 2: Draw the Detection Region around the data to extract

Building our first release

Designing lean and agile

With few dedicated engineering resources for developing this feature, it was my duty to figure out our assumptions, build incremental MVPs to test them, and progressively polish our product for release.

Thanks to our PM, who worked with me to scope the product, and in this lean design process, we were able to test with three iterations of MVPs to test various assumptions before we build our release version.

 

We have developed three MVPs, each validating assumptions and refining key features to define our first release

 

By end of May, we have a clear idea of what our first release would look like. While most of the features in the first release were designed and tested, there is still much polishing needed to be done. Since our development team ran on an agile process, I had to polish my designs in sprints over the two months of development.

It was challenging at first because it was hard to maintain consistency if designs were not done sequentially. Also, there are many changes to the design during the development sprints.

 

Screen designs organized in development sprints, with wireframes for features that are in future sprints

So I drew wireframes for all features yet to be developed, and only create the final design before each sprint after the requirements are finalized to limit major changes. Yet changes were still unavoidable. But at least it was easier to change major flows on wireframes.

 

Outcome and Impact

Design to increase bottom line

FormExtractor became one of the few in the market that empowers users to create their own custom document extractor. Custom extractors become the building block to solve more nuanced and complex extraction tasks without our engineers building custom features on top of our current suite.

 

3.7x

higher LTV for customers with access to custom extractor

65%

Data extraction is done using a custom extractor

 

Thanks for reading!
See all my work

Previous
Previous

Clover Shared Inbox

Next
Next

Palantir