The Coronavirus pandemic can be seen as a milestone in information transparency, more scientifically speaking the scientific transparency, in the research community. Despite the fact that a brief lapse of transparency of information occurred at the very beginning of the outbreak in China, a renaissance was experienced in accessing information related to the Coronavirus outbreak worldwide. Colossal datasets that comprise essential raw data and information covering various aspects of the disease itself and the outbreak are released to the internet on daily basis for access of the general public by credible organizations such as the John Hopkins University Center for Systems Science and Engineering (JHU CSSE). Thus an unprecedented liberty in data analysis by scientific community in general has been bestowed due to the fact that the data are now more open than usual.
Essentially, the basic parameters such as the number of confirmed cases, recoveries, and deaths etc. are the fundamental types of raw data based on which the rational disease statistics such as indices, rates, ratios, ranges, etc. are derived. Among them only a few is comprehensible to the general public while other statistics often require much sense in mathematics and statistical concepts to understand the underlying concepts and methods employed. Thus, statistics which employ only one or two evident disease parameters are often used for public broadcasting purposes. The caveat is that almost all the disease statistics have limitations in generating a wholesome analysis of a situation and are constructed based on certain predefined conditions and definitions that thwart deriving the exact image of a situation if the required conditions are not met. In consequence, use of graphical methods employing such statistics could distort the real picture. This article intends to analyze the statistical fallacies that are generated when using and presenting disease statistics.
The Three Basic Parameters
The three basic parameters viz. confirmed cases, recoveries, and deaths can be identified as absolute figures by convention or nature as they represent one distinct status. By using only one disease parameter and a time parameter, basic statistics such as total counts, and daily counts are derived. In the case of the time parameter, the basic interval is widely accepted as a day. These measurements are straightforward and thus can return a segregated (for the case of non-cumulative counts) or aggregated (for the case of cumulative counts) statistic along the timeline. Cumulative measurements are seen as more comprehensive when employed in a graphical representation as it generates a path that shows the overall pattern of the disease up to a given date. Daily measurements alone only can return an isolated result that lacks the ability of returning a segregated observation as they vaguely represent the total scenario. Dissimilar results produced by these two methods are presented in the Figure – 1 and Figure – 2.
The diagnostic basis of the term ‘confirmed case’ has been a point of debate throughout the pandemic. Different countries have adopted varying approaches in declaring a suspected individual as a ‘confirmed case’ of COVID-19. For instance, currently, a confirmed case is widely considered as an individual who is tested positive for COVID-19 by a laboratorial method. The tests mainly include much accurate less convenient PCR Test and the less accurate more convenient Antigen/Antibody Test. At the infancy of the pandemic, in China, various other clinical and laboratory diagnostic methods with varying success rates were also experimented until an effective method was identified. The changes in diagnostic methods produced a steep and an unexpected spike in confirmed cases in China on 12th February 2020. Unless a complete revision or a clear log noting the changes, is made to the data at such a situation, an erroneous statistical result is likely to be generated. The basis of the term ‘recovered case’ is by convention understood as when the patient is discharged from the hospital. Death, however, is absolute and does not dispute with multiple definitions.
Apart from the three basic parameters, few other parameters are derived from incorporating two or more basic parameters. The most well-known derived parameters are Active cases, Case Fatality Rate, and Doubling Time etc. The term ‘Active Cases’ is defined as,
This parameter can be identified as the resultant parameter of the three basic parameters. Casually, Active cases relate to “the number of people who are still in the hospital at the end of the day”. One of the several advantages of using Active Cases is that its cumulative measurement reaches a maximum level when the outbreak is at the peak, and it eventually converges to zero at a condition where the disease is not terminal. This facilitates in generating a graphical representation that is more effective in observing the direction of the progress of the disease. Solely studying the number of confirmed cases would be counterproductive as it would not generate the image of the aftermath of the disease. Figure – 3 is a graphical representation of the epidemic scenario in South Korea by utilizing the Cumulative Active Case count method. It is apparent that South Korea has shown a steady decline in accumulating new cases since mid-March. Another advantage is that the Active case count is more intuitive to be tallied against the carrying capacity (for instance, ICU beds) than the Confirmed case count. Recovered cases and Deaths are often combined to form the parameter ‘Removed’ especially in the fields of epidemiology and epidemic simulation (e.g. Susceptible – Infected – Removed Model). When equations (1) and (2) are rearranged to form equation (3), it becomes certain that Active Cases account to the difference between accumulation of removal of cases.
Although Active Case method can act as a useful measurement due to the aggregation of three parameters, the very definition could conceal the true nature of the scenario if the preferable conditions are not met. Once an aggregation happens, the parameters tend to lose their individual identities. In this parameter, according to equation (2) it is certain that a removal of a case can happen in either of two ways: through recovery or death. Therefore, at a condition where a decline in Active cases are observed, it does not confide how the cases are being removed and in what proportions. Thus it is wise to revisit the casual definition now to amend it as “the number of people who could not make it back home at the end of the day” hence it instinctively raises the question of the possible reasons why an ill-taken individual could not reach back home. For an instance, in Sweden, despite the number of Active cases is still rising, more deaths are occurred than recoveries. Thus, the removal of a case necessitates a discriminator that indicates the mode of removal: by recovery or by death, which can be defined as,
This equation yields the arithmetic difference between Recoveries and Deaths, also the mathematical sign according to the difference of the two figures (i.e. + sign indicates there are more Recoveries than Deaths likewise). But it should be noted that whenever the both parameters are equal and/or zero, the equation yields zero indicating a zero difference. Nonetheless, this equation is sufficient to show the drastic differences on how cases are being removed in various countries. Apart from using the arithmetic difference, a proportional method can also be used using the following equation.
Figure – 4 mainly intends to emphasize the difference on how cases are being removed. It can be seen that the gap between the numbers of recoveries and deaths tend to have increased in the favor of more recoveries (+976) in Japan, and more deaths (–1215) in Sweden. Briefly stated as of 22nd April, proportionally, 82% of the removed patients in Japan are recovered (hence alive), compared to only 24% that of in Sweden. In the case of Bangladesh, it is seen that the difference has shifted from a positive to a negative value. Sri Lanka shows a promising trend in showing a steady accumulation of recovered patients and as by 22nd April, 94% of all removed cases were recoveries. The arithmetic method (Equation – 4), only expresses the absolute difference and thus it does not generate any sense regarding the proportionality of the composition. The proportional method (Equation – 5) does overcome this issue, but at a condition where the total counts of all parameters are low, the calculated value tends to generate a pseudo-exaggeration due to very high sensitivity. For an example, the difference of one recovery on a total of 5 removed cases would account for a change of 20% in proportional difference, and only a 1% on a total of 100 removed cases.
The conventional Active case parameter can be optimized to resolve the ambiguity caused due to the loss of identity occurred during aggregation of parameters recoveries, and deaths. For the case of cumulative counts, the mathematical sign of the equation (4) can be affixed to the Active case value to denote the dominant parameter. And a dynamic parameter can be defined as,
The mathematical sign would sufficiently indicate the concurrent governing parameter that removes cases from the scenario, as + being for more recoveries, and – being for more deaths. For instance, now the Cumulative Active case number of Japan by 22nd April could be noted as +.9633 which means currently 9633 cases are active and majority of the removed cases are recoveries, and in Sweden as – .13,007 where 13,007 cases are active while the majority of the removed cases are deaths. For the case of daily counts, the optimization of the parameter in the same manner produces a parameter that is much more complex and potentially unclear than the original as the original parameter already has to account a mathematical sign for the increase/decrease of cases respective to the previous day, which ultimately results a value with two mathematical signs indicating different regulations (e.g. ++500, +–500). This method can be optimized furthermore to facilitate in producing a much clearer Active case number.
Case Fatality Rate (CFR%)
CFR is a staple measurement in the field of epidemiology. In a brief sense it can be defined as,
Although it is named as a rate, it accounts for a proportion of incidence. Theoretically the proportion of incidence lies in the range of 0 and 1, for practicality it is presented as a percentage as the number of casualties per 100 infected individuals. This measurement should not be confused with the Mortality Rate. The problematic situation with this specific measurement is that its value is subjected to change by the duration of the disease and the status of the rest of the patients. The CFR is best defined when an outbreak is taken place in an acute condition over a limited course of time. The final CFR is measured when the number of active cases reaches zero. Thus, a rolling CFR value would only reflect its true value when an ideal condition where all of the remaining active cases would not result in deaths.
Figure – 5 depicts the trends of the CFR (%) of selected countries and due to various external factors, it is seen that the CFR (%) has fluctuated in various degrees in each country. At the end of the epidemic, the final CFR value can be calculated accounting for the total infected population. In such cases, rolling CFR values would differ to the final CFR value. It is certain that a CFR value calculated for an ongoing outbreak would only provide a rough estimation of the rate of casualty. Despite its limitations in validity, it can be utilized as an effective indicator to observe the severity of the disease. Currently the global CFR of COVID-19 is estimated to be about 3 deaths for every 100 infected individuals.
Doubling time is a mathematical concept which yields the amount of time for a particular population to double in its size, i.e. “How long it would take N to be 2N” Conversely, the number of a population at any given time could be obtained for a predetermined doubling time value, i.e. “How much N would be at the day x if N doubled every y days”. The converse concept utilizes the term doubling rate (Td). The particular term is defined for starting day d1 and ending day d2, and respective population values x1 and x2 as,
The principle of Td often uses variable intervals according to the necessity. Due to its high sensitivity to minute fluctuations Td is a useful measurement which indicates how fast the epidemic is growing and how fast the speed of growth is changing especially when it is applied to Cumulative Confirmed cases. But its sensitivity would be problematic especially when a sudden anomaly of figures is recorded. This could happen either due to an inconsistency of data collection or due to natural factors such as a surge of cases followed by a long period of very low accumulation of cases. The sensitivity can be varied by adjusting the interval. Considering a one-day interval would be the most acute condition however at the cost of the highest sensitivity. For instance, a sudden halt or an unexpected drop of the rate in the increase of counts between two adjacent days would return a very high Td value. And if the figures of the two days are equal the equation would return a mathematical error reaching positive infinity. Mathematically and logically it makes sense as if the population had stopped growing compared to the previous day thus virtually it would not double in size ever again, but in real situation this would not be the case until the epidemic reaches its end.
Figure – 6 examines the effect of using different Td intervals. To avoid cluttering, only two intervals are shown on the graph. It can be seen that the 1 – Day interval line has been truncated at several sites (e.g. 24 – 28 March). These voids account to the Td values that had reached infinity, corresponding to the days where the Cumulative Confirmed Cases count had remained equal to the previous day. Due to the very high sensitivity, its values have scattered at a greater degree than those of the 5 – Day interval line. The 5 – Day interval method shows a continuous line as its Td for each day is calculated against the Cumulative Confirmed Cases count that was recorded five days ago. By this method, any lapse or repetition of the data could be resolved in most of the cases especially during early stages of an outbreak status. A 3 – Day interval is commonly used in a situation where a steady daily accumulation of new cases is seen. Increasing the interval sufficiently dampens the Td values creating a less scattered series. Although Td is only less commonly used in the public media, with an appropriate interval it can be utilized as a very useful indicator to measure the rate of growth of the population of infected individuals.
For convenience, several ways in producing graphical representations prevail. Briefly they can be categorized into two: type of the graph and the type of the scales used. Type of the graph corresponds to the visual representation – bar, pie, line charts, and scatter plots etc. Type of the scales correspond to the scale utilized in each axis – linear, semi-logarithmic, logarithmic (log-log).
In the general case of a two dimensional graph of axes X and Y, linear graphs use the linear scale to both axes. Semi-log graphs utilize one axis in the logarithmic scale. Commonly, the variable that shows an exponential behavior is assigned to the logarithmic axis. Logarithmic graphs (log-log graphs) use both axes in the logarithmic scale. Employing a logarithmic scale to the exponentially behaving axis returns a linearized representation and assists in compressing values that are scattered throughout a great range. While any logarithmic base can be used, more commonly base – 10 and base – 2 are used. It should be noted that the behavior of these graphs vary in a great sense to one another, thus one should be cautious in interpreting a graph.
This is naturally the most intuitive mode of representation. From a casual standpoint, this mode can be understood as the ‘as-is’ method. The interval between two adjacent points at any portion of the axis would be equal. As an analogy, the interval between any two neighboring major ticks of a ruler would always be 1 cm irrespective of the location along the ruler. Due to the simplicity of the scale, it is used as the staple scale in representing data in most fields. In the field of disease statistics, figure – 2 is appropriate to such a situation where the linear representation is used. Although it is greatly intuitive, its limitation is that the data points are in an uncompressed state, thus representing a dataset that spans across several orders of 10 would be problematic. Figure – 7 provides a sufficient justification to the point made.
By Figure – 7, it seems as if USA is reporting a continuous increase of cases at a near constant rate, South Korea is reporting about 20,000 cases at a very low rate of increase, while Sri Lanka is showing almost no cases and its growth rate is negligible. When the range of a certain dataset exceeds by several orders of 10, as in the case of Figure – 7, the minute changes will be dramatically masked unless the graph is drawn at a very large scale. Changing the scale of the vertical axis to a logarithmic scale would result in a semi-log graph.
It is certain that by choosing a semi-log approach, the dataset now shows a sharp and well expressed behavior along the timeline. The analogy of the ruler would not generate an accurate image in this scenario as the difference between two adjacent major units on the Y-axis now, does not correspond to a subtracted value of 10, but a magnitude of 10. For instance, the difference between the major units 1,000 and 100 would be 990, and that of between 10,000 and 1,000 would be 9,990. But in both cases, the succeeding unit is ten times greater than the preceding unit. This concept should be understood thoroughly as it could often cause confusion when interpreting. Now it is apparent that despite the massive number of cases recorded in the USA, the rate of increase has started to show a decline, and also a steady increase of cases is observed in Sri Lanka. The aftermath of the Patient No.31, the effectiveness of the preventive measures taken in South Korea is well expressed in this semi-log representation, succeeding in revealing and amplifying subtle changes. When the logarithmic base is changed to 2, the doubling time can be approximated by using basic visual methods. One drawback in semi-log (and log-log) representation is that it is unable to plot zero or negative values. Thus, linear representations such as Figure – 4 cannot be represented by using this method. Hence it would only be applicable to datasets that contain positive values. Another drawback is that the lack of understanding in the logarithmic principle, and attempting to relate the linear principle would create confusion as well as exaggeration in the behavior of a certain graph. A specific example is that without the precise understanding, it is possible that one might conclude that South Korea has around 500,000 cases by noticing that the line of South Korea has fallen approximately in the middle between the lines of USA and Sri Lanka. Misapplying the linear and logarithmic principles would generate the notion that it is possible to calculate the case count of South Korea by roughly dividing the difference of cases in the USA and Sri Lanka by two. It is in fact an incorrect observation which is also a gross exaggeration of the situation.
Statistical manipulation of raw data often rewards with measurements and values that are convenient in presenting and effective in consolidating a justification. It should always be noted caveats do exist in each statistical handling and it is fairly easy to mislead the public and sometimes oneself as well. The use of fairly basic parameters that are not statistically overindulged is sufficient in generating an undistorted image of a true incident. The quote “The closer you look, the less you see” which is primarily circulated in the field of wizardry is also relevant to the field of statistics as well.