TumourTExtract has been built using a variety of the most modern technologies to produce the highest quality results. The classifiers use Support Vector Machines for the supervised learning algorithm. The annotator is built on a gold standard (GS) of 1.1 million training examples of semantic tags all manually assigned and verified using a state of the art Conditional Random Field algorithm. HLA has its own proprietary annotating and verification software, the Visual Annotator (VA), that enables iterative training and reviewing of models for the validating of the GS and identifying model weaknesses by its linguists. The annotation models for different imaging services have accuracies of 97.8-99.7%. The annotator has two roles: firstly to tag semantic phrases for the classifiers to increase their accuracy, secondly to identify the specific disease progression content to be extracted and to use as the data to infer the clinical coding. This ensures a much higher level of accuracy for the final content to be output by the pipeline.
The stability of the pipeline outputs is part of the experimentation required to produce an effective solution. As the linguists discover more about the underlying language usage they are able to improve the rules by which the correct content is properly recognised. This means that over time the accuracy of the system is continually increasing.
Broadly, TumourTExtract provides the solutions to these processing tasks:
- Separating cancer reports from non-cancer reports at the source point in every imaging service in the State. A classifier is trained for each pilot site using typically 15-20,000 reports;
- Transmitting the cancer reports, once identified, from the imaging service to the Registry. The reports are encrypted and a Dropbox mechanism is used to move the documents from the pilot site to the Registry;
- Once at the registry the report is interpreted for the purpose it was conducted and the tumour stream. The two classifiers are trained separately for each pilot site so the documents source is used to direct it into the correct processing model;
- With this information the type of information that needs to be extracted is defined and the task remains to find the pertinent information. The annotator identifies the different sections of the document and then tags all the semantically relevant content. The tagged material is used by a rule system for identifying to identify the details of disease status;
- With the pertinent information for the type of tumour the tumour progression classifications are inferred. The TNM and Staging values are inferred from the content recognised by the rules;
- The results of the extractions and inferences are delivered into the Registry’s data stores as XML encoded files.
While the tasks are clearly defined there was a major variable in the language processing functions in that it is unknown whether classifiers and language recognises and extractors work across all imaging services or whether different processing engines need to be built for each imaging service due to differences in the way language issued by different radiologists. After extensive testing it was shown that some material from two sites could be combined but a third site was different enough to require its own statistical models.
A further engineering task was to ensure the whole process was fully automated so no manual activity is needed at any point in the pipeline. A web service was developed to enable users to monitor the progress of batches of files be processed at the imaging services and then move to the registry and subsequently pass through the pipeline.