TumourTExtract consists of 3 classifiers, a semantic annotator, and combined extraction and inference engine. The first classifier is located at each imaging service and separates cancer and non-cancer reports. The cancer reports are identified at better than 98% sensitivity. The cancer reports are automatically forwarded to the Registry where they enter a second classifier which determines the purpose of the report and the following classifier determines the tumour stream. Subsequently the report is automatically annotated by a semantic tagger that identifies on average 150 pieces of semantic, structural and grammatical features in the report. The results from the two classifiers are necessary to route the report into the correct extraction process.
The extraction process proceeds from interpreting the annotations looking for certain combinations of annotated semantic concepts to infer the various cancer progression characteristics. These annotations are used to recognise the tumour location, size, number of nodes and the metastatic sites effected by the disease, and recurrence of disease. This extracted data is used to infer the severity of the disease using the TNM grading scale. The results are passed to the Registry as the output of the pipeline.
The pipeline is a complete specification by the user to automate fully the whole production line from the point of writing the report to the delivery of the pertinent content into their repository. This enclosed pipeline requires no manual intervention, has no user interfaces for actions and can only be interrupted by the delivery machines being switched off. The users have access to web monitoring screens that show the current and historical processing at each remote imaging service.
The system is scalable as all communications are over the Internet. It recovers itself in the face of network or local failures and is remotely maintained. The software and statistical models of the classifiers and annotators can be automatically loaded up to each imaging service and the Registry. Each report as it travels through the pipeline is recognised as to its source and so passed to the appropriate source-based processors. This enables us to develop individual language processors for each source and yet embed them into a single architecture.