HLA gives the California Cancer Registry an 70% Efficiency Boost using Language Engineering


The California Cancer Registry (CCR) is a state-wide population-based cancer surveillance system serving nearly 40 million inhabitants and handling the more than 175,000 new cases annually. Currently, the CCR receives upwards of 600,000 pathology reports per annum of which approximately 50% are irrelevant to the Registry’s activities. This puts a large, non-productive resource drain on CCR staff.

And further complicating the task, CCR is confronted with a looming increase in document quantities, perhaps to a level of 1 million documents per annum, driven by new State legislation requiring compulsory electronic reporting by all organisations providing services to cancer patients.

CCR Solution

At CCR, HLA’s HORIZON production line solves two of their biggest problems:

  • Case Identification – Separating Reportables and Non-Reportables;
  • Coding of 5 cancer attributes.

With the support of HLA, the CCR has installed a solution that automatically routes ALL incoming reports into Reportable, or Non-Reportable, workflows to an accuracy of 97%.

After determining that a document is a reportable cancer case the CCR needs to identify 5 attributes: Tumour Site, Histology, Grade, Behaviour and Laterality. Once identified correctly they need to be coded according to ICD O3 and the MP/H rules, as well as a range of other less formal rules.

The HORIZON Coding engine beats all previous efforts before it. It attained a 97.5% accuracy in the laboratory and 94% in client tests, significantly better than manual coding.

This is the most broad-based analysis yet produced for pathology reports covering the greater majority of the CCRs variety of Site codes (130+), Histology codes (130+) and most frequently reporting Laboratories (50+).

HLA automates the coding of seven out of every ten reportable documents, leaving only three out of ten to be managed manually. This is expected to improve further with ongoing development efforts.

The two processes create an overall decrease in document numbers being manually processed of 100% for cancer case recognition, and 70% plus for coding, with the added benefit with of a significant improvement in accuracy.

The CCR’s own progress account of the project at mid 2017 can be found here.

The Technology

THE HORIZON ORCS (Oncology Reportability and Coding System) offers not only massively improved document processing efficiency but also separation of reports into much easier groups for auditing and training staff.

The computed document groups are:

  • Reportable
  • Non-Reportable
  • Unusable
  • No Final Diagnosis section
  • Manual Processing
  • IHC and Genomics

The Reportable group is further subdivided by: Document complexity, Surgical Procedure, and Tumour Stream.

The Non-reportable group includes Reports with: all cancer excised, no recurrence, no cancer identified, non-reportable cancers, other diseases,…

Many organisations who do text mining tout that they use Natural Language Processing (NLP). However, their methods are limited to various string matching techniques. They suffer the limitation of not being able to find anything they haven’t defined in their target strings and as the range of their target strings is expanded to cover more scope they gather in more and more false positives.

The HLA Advantage

Historically NLP is the research field of Computational Linguistics devoted to creating computational methods for knowledge about language – it was a combination of the expertise of linguistics and computer scientists.

Statistical NLP (SNLP) is a research method of Computational Linguistics that has emerged in the past 15-20 years in document analysis. SNLP builds a language model of the semantics of the text by statistical representation of the variation of contexts in which words are used. In this way it can identify appropriate content that it has never been seen before by understanding the statistical characteristics of a word or phrase’s usage.

HLA’s Language Engineering harnesses SNLP into a formal Development Methodology that enables the rapid construction of a production line of: document processing, semantic concept recognition, content extraction, and coding engine.

With leading edge commercialised research, HLA has engineered a practical solution to two of the Registry’s biggest problems.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.