FormExtractor
Make AI document extraction easy for everyone
FormX turns physical documents into structured data automatically. To accommodate for the ever-growing use cases of our customers and differentiate our product, we empowered our users to customize and train their own extractor with no code.
My Role
Led the lean UX process with our PM to validate three of our biggest assumptions. Designed and shipped the first release of this feature.
Project Team
Me, Lead Designer
Frank, Design Manager
Fung, Product Manager
Jason, Senior ML Developer
Ben, Junior Developer
Main Challenge
Product Differentiation and Stickiness
Document extractors automate data entry by leveraging AI to extract data from scanned copies of documents. As AI becomes mainstream, more document extractors are in the market. FormExtractor, a late entrant to the market, faces the challenge of
How might FormExtractor differentiate from the competition, at the same time increase existing customer's lifetime value?
Discovery
There are not many no-code products to create custom extractors
FormExtractor falls on the no-code side of the document extractor market. Most no-code products, including FormExtractor, only provide templates of documents to extract (e.g. IDs, receipts).
FormExtractor falls on the no-code side of the document extractor market. Most no-code products, including FormExtractor, only provide templates of documents to extract (e.g. IDs, reciepts)
Process
Using Lean Methodology to Test Our Assumptions
We have a small product team with three engineers who were occupied with a large backlog of custom features for our existing enterprise customers. So we had little bandwidth to explore this new feature.
Our PM decided to use the lean methodology to explore the feasibility and product/market fit of this feature. It was up to me to decide how to implement this strategy.
Starting by listing all the assumptions related to our hypothesis, I needed to identify the fundamental assumption that the hypothesis hinged on. After discussing with our PM and account managers, we landed on customers having many types of data and formats of documents to extract.
Testing our first assumption
Customers need to extract various data from many document types
To test our first hypothesis, I focused on understanding the document format users would upload and the data type they would like to extract. Therefore, we only built a document uploader and a form to collect the types of data the customer wants to extract.
We were excited that there was indeed a need for creating custom extractors because of the diverse range of data and document format that our users would like to extract.
Deeper Insight
Uploading & labeling will make or break the experience
Once the value hypothesis has been validated, we focused on making a custom document extractor as self-serve as possible, to alleviate to work on our account managers and engineers.
Based on interviewing customers who tried our MVP, and our account managers working with enterprise customers, we discovered that upload and labeling steps were the most challenging because:
Customers could seldom set up the label and label the samples by themselves
Customers were suspicious that we were manually inputting data because they did not know how data were labeled and extracted
Customers upload very few high-quality samples that the ML model can learn from
Therefore, after validating our fundamental value hypothesis, I turned my attention to making uploading and labelling simple but adequate to create a custom extractor.
Improve Uploading & Labelling
Nudging users to upload more high-quality samples
In our MVP, we see that users often only uploaded 1-2 samples, which are not enough to train the ML model for a custom extractor. The quality of some samples was also low, making the extractor inaccurate.
The number of samples is critical to the accuracy of the extractor because it gave the system more examples to learn from. I designed a number of nudges and decided to force the user to upload some samples to start. I also made uploading many samples in batch easy using drag-to-upload. An indicator also showed how many more samples should the user upload to get a more accurate extractor.
After implementing those nudges, users on average uploaded twice as many samples after this new sample upload flow was implemented.
Improving quality of samples uploaded
It is also important that the samples uploaded by the users are in high quality. At first, we gave simple instructions for users to look out for when uploading their samples. But then in our proof-of-concept, some samples uploaded by users still was not up to standard.
After discussing with our engineers, it turned out they would process the samples uploaded afterwards to correct for skewness, contrast and accidental crops.
I suggested processing the samples as user upload them. And coupled with a few simple test for contrast and legibility, we could feedback our users about the quality of the samples in real time.
Improve Uploading & Labelling
Making labeling easy for everyone
Labeling involves telling the system where is the data located and what does it look like on the sample, usually by drawing a rectangle around where the data is. The precision of labeling would determine the accuracy of the extractor.
Iteration 1: using an open-source labeler
We first embedded an open-source labeling tool onto our product for users to label data. But many of the functions were irrelevant, which confused our users.
Iteration 2: Add tutorial to open-source labeler
I then designed tutorial steps that were imposed on the embedded labeling tool to walk the user step-by-step, on how to label one of the data on one of the sample.
However, that also proved to be too complicated for most users because the open-source labeling tool was built for more sophisticated labeling tasks. But the plethora of features and the granular control were irrelevant for our users.
Iteration 3: building our own labeler
I proposed to build our own labeling tool that stripped away most of the complex functions the open-source one has. This was the biggest engineer commitment we would make. But I argued simplified design would reduce the time needed for our account manager and engineers to tutor our users or label data for them.
Our engineers figured out a way to build our own labeling front-end, and feed the labeling data, such as positions of the rectangles, back to the open-source labeling tool, so that we don't need to build our own backend processor. This was a win-win for both design and engineering.
Building our first release
Designing lean and agile
With few dedicated engineering resources for developing this feature, it was my duty to figure out our assumptions, build incremental MVPs to test them, and progressively polish our product for release.
Thanks to our PM, who worked with me to scope the product, and in this lean design process, we were able to test with three iterations of MVPs to test various assumptions before we build our release version.
By end of May, we have a clear idea of what our first release would look like. While most of the features in the first release were designed and tested, there is still much polishing needed to be done. Since our development team ran on an agile process, I had to polish my designs in sprints over the two months of development.
It was challenging at first because it was hard to maintain consistency if designs were not done sequentially. Also, there are many changes to the design during the development sprints.
So I drew wireframes for all features yet to be developed, and only create the final design before each sprint after the requirements are finalized to limit major changes. Yet changes were still unavoidable. But at least it was easier to change major flows on wireframes.
Outcome and Impact
Design to increase bottom line
FormExtractor became one of the few in the market that empowers users to create their own custom document extractor. Custom extractors become the building block to solve more nuanced and complex extraction tasks without our engineers building custom features on top of our current suite.
3.7x
higher LTV for users using custom extractor
45%
Data extraction is done using a custom extractor