Big Text mining expert Professor Jon Patrick speaks at Big Data and Language Analytics in Healthcare conference

Press Release

20, October 2015

Big Text mining expert Professor Jon Patrick speaks at Big Data and Language Analytics in Healthcare conference

One of Australia’s foremost experts in health language analytics and Big Text mining Professor Jon Patrick will speak at the HISA Big data analytics: Leveraging capability in healthcare (Sydney, October 20-21, 2015)

Prof. Patrick is CEO of Health Language Analytics (HLA) which delivers advanced language processing for health texts including information extraction and text analytics, known as Big Text Analytics.

The HLA technology focuses on the 80% of “dark text” or unstructured text that existing big data analytics can’t read and analyse.

HLA delivers customised language processing services for some of Australia’s largest hospitals and cancer registries with 97% accuracy and at one-tenth the cost of existing methods.

HLA’s expertise includes:

  • Natural Language Processing (NLP) of clinical texts
  • Clinical Data Analytics
  • Language Engineering Infrastructure

The Big data analytics: Leveraging capability in healthcare conference introduces the emerging trend for personalised healthcare and the groundswell of consumer-collected health data flowing into the health system, offering both challenges and opportunities for improved – evidence based – patient care.

Speakers include Prof Isaac Kohane, Director Centre for Biomedical Informatics, Harvard Medical School and Dr Zoran Bolevich, Acting Chief Executive, Chief Information Officer, eHealth NSW.

“The volume of content in the medical record is 80% text and so far no-one has been able to mine it for useful purposes. Big Data has actually only focused on 20% of the medical record. Our work in Big Text will enable the effectiveness of Big Data to be effectively quadrupled overnight and enable a vastly greater range of research projects along with massively increased scale,” said Prof. Patrick.

The conference attracts leaders in healthcare including healthcare practitioners, information specialists in healthcare environments, university and research institute scientists who are working to harness the power of big data for patient outcomes and business.

Prof. Patrick will speak on “Patient analytics and clinical decision making”.


HLA, founded in 2012 in Sydney by Professor Jon Patrick who was Chair of Language Technology at University of Sydney, is leading the race to mine unstructured text in the health sector where it’s estimated 80% of text cannot currently be read or analysed accurately.

The HLA technology directly impacts patient health, treatment discovery and company and institution finances, delivering results at one-tenth the cost of existing methods.

Prof. Patrick, a Eureka Science Prize Winner 2005 who has more than 100 publications and 7 patents pending, has built a technology team in Sydney that delivers text-mining services for some of Australia’s largest hospitals and cancer registries with 97% accuracy.

HLA’s expertise includes:

  • Natural Language Processing (NLP) of clinical texts
  • Clinical Data Analytics
  • Language Engineering Infrastructure


Big data analytics: Leveraging capability in healthcare

Paper Abstract

Big Text Analytics – A Services Model

Jon Patrick, Min Li, Pooyan Asgari

Health Language Analytics, Sydney, Australia


In the health setting up to 80% of patient information is stored in text. If Big Data methods are to make significant advances in exploring real world evidence it will be through Clinical Natural Language Processing (CNLP), that is, BIG Text Analytics. Many text analytics services exist to feed the Big Data Behemoth but also bring their own benefits independently.

Technology Brief

A closer look at the services that can be delivered by CNLP gives an overview of the range of benefits an organisation can achieve. The Services are defined in the inner wheel and by the desired functions in the outer wheel (See Figure 1).

A search engine for clinical records requires a more sophisticated understanding of medical language than Google as it needs to understand concepts rather than exactly matching search strings. However a technology that can perform concept searches is built on an extensive technology that can be exploited for many functions.

Implementation Processes

Big Text Analytics is most successful when the language model it uses is tuned to suit specific objectives. Content Searching for documents by their content is important when the identity of the patient is unknown such as identifying a cohort group. One Sydney pathology group wanted a systematic method for retrieving cancer reports. A search engine was built and tuned from a sample of 4000 pathology reports. Testing of the search consisted in formulating 57 cancer queries. Analysis of the results showed a false positive rate of 12% but a false negative rate of 4%, which in turn was restricted to only a few queries. Subsequently the search engine was used in a research project to find a cohort of prostate cancer patients in 20 minutes, as compared to an estimated 2 person months for a manual search.

A project that combined report classification, content extraction and inferencing was completed with the cancer registries of Victoria and NSW. The objective was to determine the staging of cancer at diagnosis from radiology reports. The processing pipeline consisted of:

  1. Build a classifier to separate out reportable cancers reports;
  2. Transfer these reports to the registries;
  3. Build classifiers for Report Purpose and Tumour Stream;
  4. Extract 15 cancer attributes for primary and secondary sites.
  5. Infer from the extracted content the Tumour Staging.

Gold standards were assembled to compute the 4 language models. Over 60,000 reports were classified and 8,000 reports annotated for the extraction process. This amounted to over 1.4 million manually validated annotations, arguably the largest ever assembled for clinical natural language processing (CNLP). The reportability classifier reached an accuracy of 98.6%, Accuracies for extraction varied between 75% to 96%.

An example of document format conversion was conducted with the Royal Australasian College of Pathologists to demonstrate that free text notes could be analysed and converted to populate structured reports.

Report Completion involves using a real-time CLP process to check that an entry of information is complete, e.g. a diagnosis of asthma indicates if it is chronic or acute as they represent different billing rates.

Report Consistency Validation checks that information in different parts of a report is consistent, e.g. a report for a certain histology in the microscopic section of a pathology report is the same in the conclusions.

Hot Key Coding and Classification is the process of identifying critical text, coding the text for standard codes such as IDC10 or SNOMED CT for subsequent reuse, e.g. billing work scheduling or searching.


Big Text Analytics is currently an embryonic field of software development. Nevertheless its value can be assessed from the core services it can provide. It is best used for targeted objectives, as it can be time consuming to develop the language models for the particular tasks. The Wheel of CLP Services demonstrates that different types of text processing that are required for different outcomes.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.