Automatic patient symptom assessment is a rapidly growing field, designed to support medics with their everyday hard work. This kind of support is crucial as we’re now facing an ever-growing global shortage of physicians, nurses and other medical personnel, which is already creating an impact as to whether people can access proper health care, and the time in which they are seen.
So, how can we make a positive impact and provide support to healthcare professionals? Being the futurists that we are, we expect that pretty soon your customers will be able to speak with our symptom checker in the most convenient way possible, without needing to worry about phrasing and being able to describe their most complex issues in plain language.
At a glance:
The goal of our symptom checker is to learn the user's initial symptoms, match them with the most probable causes, and then ask a series of questions that mimic the process of differential diagnosis, used by doctors on a daily basis. After that, the system narrows down the possible conditions to those that appear most likely given all the gathered evidence. The user is then presented with their results, along with the appropriate triage level.
The problem is that computers are terrible at understanding people. This is because human languages are incredibly complex. One person can say the same thing in a myriad of ways, or even break language rules, and still be understood. That’s where natural language processing (NLP) comes in handy. Our job, as the NLP team, is to help computers understand patients’ symptoms using the most natural mode of communication—their language.
It’s clear to see why this is an important requirement for preliminary health assessment and triage. For the millions of patients and physicians who use our solutions, the accuracy of correctly understanding patients’ needs is essential to providing safe, and reliable, medical guidance.
In this post, we’ll be addressing the issue of evaluating our NLP engine’s performance.
The problem: recognizing thousands of medical concepts
Infermedica’s NLP engine is tasked with an immense challenge—recognizing upwards of 2000 possible medical concepts in user messages. It deals with complaints which can be both symptoms and risk factors. The problem is that complaints that are crucial for the accurate analysis of the patient’s health are sometimes difficult to describe precisely. They can be delivered either by speech, via call center bots or intelligent assistants, or by written messages on chat. Each message can contain multiple reported symptoms, formed in various possible ways, e.g., one can say that they have “limb weakness”, “impaired limb movement” or “problem moving limbs” and all of these phrases correspond to the same symptom—Paresis, limbs. The NLP engine must also be able to recognize if reported issues are negated, e.g., in the sentence:
“I have a strong headache but no fever”
we expect the system to find the following information:
- present [“severe headache”] and
- absent [“fever”].
Due to this complexity, it’s not always easy to efficiently assess the effectiveness of our NLP engine, and determine if the changes we make in the processing pipeline will result in improvement or regression of the NLP engine’s results. Luckily, we’ve found a way to overcome this issue through building NLP test cases.
The solution: building NLP test cases
Beside unit and integration tests, that are meant to keep the bugs away from our codebase, we also maintain a large set of the NLP test messages. It consists of around 2500 unique messages of varying complexity (see examples below). Each NLP test case consists of an exemplary user message and expected medical concepts to be found in it.
We don’t expect every exemplary user’s message to be recognized correctly, rather with each improvement we would like to see a refinement in overall system accuracy. In other words, we are using this test set to be able to develop the NLP Engine in a data-driven manner—assessing the quality of our new features based on a change in well-defined metrics (described later in the article), rather than intuition, while continuously improving and expanding our testing capabilities. In the spirit of transparency, we decided to share a random sample of our tests with results produced by our system as of 31.08.2021. Please bear in mind that, in the future, our system might produce slightly different results. As described in the article, we want to improve accuracy on the whole test set, not only on the presented sample.
A random sample of NLP tests
Results of 100 randomly selected NLP test cases
Below you will find hand-picked examples from this list:
Measuring NLP System Accuracy
Before we describe in detail what metrics we use in Infermedica’s NLP Engine, let’s take a step back and think more generally about what the essence of the problem is. In fact, the whole existence of our NLP system core can be summarized as a solution to the following task:
Fit the user’s message to a set of mentioned (asserted or negated) medical concepts from Infermedica’s knowledge base.
More generally, we would call the above-mentioned medical concepts as a set of labels. These labels are sometimes called classes and assigning predefined labels to data, in machine learning jargon, is called a classification task or classification.
However, not all classification tasks are the same. The simplest possible categorization is a binary classification, where there are only two possible classes (which can be understood as yes and no, or positive and negative). An example of such a task would be to answer a question like “Is a given sentence written in English?”, which is a very interesting problem on its own. That said, at Infermedica we are dealing with something much more complex.
Cases which have more than two possible answers, e.g., “What is the language in a given sentence?” where the answer could be one of the following: English, Chinese or Spanish, are called multiclass classification problems. The problem we are dealing with is more complicated. For a single message we can have more than one correct answer. Such as when trying to answer the question “What languages are spoken in a specific country?”—this type of task is called multilabel classification.
Why all this theory? Because different problems require different, and sometimes very sophisticated, methods for measuring classification quality. For example, we could think that only NLP test cases (for which our system generates output exactly as it is expected) are considered correct and others are considered false. Then we could calculate the percentage of correct ones over the whole test set, and we would end up with a single number describing the system quality. That’s a good start, but what about cases where we understand some medical concepts but not all of them? For instance, the user message could contain information about fever and diarrhea, but we only managed to label fever. If we expect a perfect match, then it is as wrong as not finding anything at all!
Now, maybe you’re thinking that we shouldn’t be so strict and, instead, look at the output to check how many of the expected concepts it contains. As great as that idea is, though, it can pose more problems. What if the system output contains more medical concepts than we expected? For example, we expect fever and diarrhea, but our system finds fever, diarrhea, and nausea. This is definitely not desired, especially for the health assessment, because we’d be adding something that was not the user's intention. This might result in incorrect evidence assessment and wrong triage level.
The point is that there are, in fact, many distinct ways in which we can be right or wrong. Fortunately, they are nothing new in the fields of statistics or data science, and they even have their own names. So, what’s left for us to do is to adopt the more general theory to our case. We have:
- True positives—correctly recognized medical concepts, e.g., understanding Headache, severe in “I have a strong headache”.
- False positives—incorrectly recognized medical concepts, e.g., understanding Nausea in “I have a strong headache”.
- False negatives—missed medical concepts, e.g., not finding Headache, severe in “I have a strong headache”.
- True negatives—correctly unrecognized medical concepts, e.g., not finding Nausea in “I have a strong headache”.
You might feel that the more true positives and true negatives, and the fewer false positives and false negatives, the better. However, keeping track of four numbers can be complex. Some of them might increase, and some might decrease, making it hard to decide if we are proceeding in the right direction. To solve that issue, there are two intermediate and one final metrics, which project the all these values into a handy number between 0 and 1:
Intuitively, high precision tells us that our system is not generating too many additional labels, which should be there (false positives) and high recall shows that we are not missing any expected concepts (false negatives). Looking at precision and recall metric at the same time can sometimes be difficult. When both precision and recall rise, we can be sure that the changes we made will lead to an overall boost in performance. But what if, for example, precision rises but recall drops? It would be really handy to be able to analyze precision and recall at the same time. We could just use the arithmetic mean of these 2 values, but there is a problem with this solution, e.g., for precision = 0, and recall = 1 our metric would yield 0.5, when in reality this kind of system is useless. Thankfully, F-score utilizes harmonic mean, which produces a value of 0 for either of the supplied values being 0—please refer to the charts below for visual aid.
In the 3D charts below, we can see how the arithmetic mean is different to the harmonic mean (F-score). Precision and recall values lie on the horizontal axis, with our mean values on the vertical axis. It can be seen that the harmonic mean (F-score) is far more sensitive to the lowest value, which is what we’re after.
Here are the performance metrics from the results of 100 randomly selected NLP test cases, that you can find at the end of this article (0 being the worst possible outcome and 1 the absolute best):
- Precision = 0.837
- Recall = 0.788
- F-score = 0.812
While on our large test set (2500 cases), our results present in the following way:
- Precision = 0.852
- Recall = 0.816
- F-score = 0.833
NLP engine suitable for conversational healthcare solutions
We’ve come a long way in improving our NLP engine by utilizing the latest findings in the area. Our system achieves very good results in capturing medical concepts reported by patients. In many cases is able to deal with rather complex messages shared through chatbots, intelligent assistants, or even bots implemented within call centers. For instance, the Microsoft Healthcare Bot depends on the quality of our NLP engine to capture initial symptoms properly. The accuracy of their understanding will impact the reliability of the bot’s final recommendations.
A random sample of NLP tests
Results of 100 randomly selected NLP test cases