A little contact tracing data forensic

2021-04-04

Corona, Covid, Contact Tracing, Luca, Corona Warn App

Bianca Kastl

(DRAFT: Just a DeepL translation right now, don't take this to serious)

Or the complicated question: What actually needs to be done for digital tracing information to reveal infections, or at least indications of them?

What follows: an attempt to explain why data from contact tracing apps will not automatically end this pandemic. A tragedy in three acts, this is its first part: A Hope in Data.

It's only been a few weeks since my last article on the subject Interfaces with Corona Relevance, but the topic is definitely moving.

In the meantime I received more positive feedback than I thought. What also became clear, however, was that there are many assumptions and hopes in the area of contact tracing and data and interfaces - but also often little differentiated consideration of how far these apps or data can actually help us. There will be no simple answer to this in this article either. In fact, this article will probably raise more questions than it answers.
But that's not my goal at all, to answer this question, which is hardly easy to answer, but rather to give an insight into how we would have to deal with such data so that it generates a certain usefulness.

My background on the subject

As mentioned in the last article on the subject, I have been working with a health department for a few months on the subject of contact management and contact tracing - according to Time, I count as an insider on the subject.
The questions and approaches I mention in this article come from practice and not from purely theoretical considerations. However, I must also clearly delineate that I primarily provide technical support on the topic and the considerations here do not claim to be medically or epidemiologically exact. It is rather about the general approach, completely precise scientifically supported parameters do not exist here. The subject matter is anything but trivial and is constantly occupying experts and changing with new findings. There will be no solution here, but the many problems and challenges will be pointed out.

Contact Tracing

In contact tracing related to SARS-CoV-2 or infectious diseases in general, there are really two primary questions:

How did the infection happen?
How was the infection passed on?
Depending on which direction is being searched, it is then backward or forward tracing.
Hypothetically, if these two questions could be answered for every single infection in every COVID-19 case, it would be much easier to contain this pandemic.

Key temporal stages in contact tracing.

To begin, we consider the typical course of COVID disease, taken from the Science Media Center. Parameters may change due to viral mutations, but the course hardly changes.

Typical course of a COVID disease, after an infection an infectiousness develops after a latency, which results only with some delay in a symptomatology, which then leads to a possible test

Now, unfortunately, very few people can accurately determine that they became infected, for example, at 09:13 on July 30, 2020, by indirect aerosol transmission at latitude 54.020809, longitude 9.387842. And that they then subsequently, after a latency period of 3 days, still had contact in the infectivity period of 3 days with another 13 people before they then emerged with symptoms on August 4 and were tested and isolated on August 5.
In this very clearly defined case, the 13 contacts would be monitored and/or isolated and the situation would be under control.

Often, however, it is simply not known exactly how the infection occurred and how many possible contagions occurred as a result. Not all COVID cases become symptomatic at all, i.e. have no or barely noticeable symptoms. They are, however, contagious to others but may never be detected. It is precisely such cases, however, that make corona so difficult to control.
Contact tracing, then, is often a poke in the half-mist with many unknowns.

Modes of transmission

Assuming that the exact whence and whither of infection is unknown, we must therefore instead consider how exactly viral infection might have occurred.

In the case of SARS-CoV-2, two primary modes of transmission are relevant (or more, but we'll start with those two) - droplet infection on the one hand, and contact transmission on the other. Sounds simple at first. Should be possible to find out somehow with data, if we know that a person is infected and then had possible contacts with such transmission routes.

Droplet infection / respiratory uptake

Illustration of a droplet infection in the scheme where one person hurls droplets at another

. For example, a direct droplet infection is something like "person coughs in other person's face". This sounds easy to recognize at first. For this scenario it would be sufficient to assume that if the distance between the two persons was small enough at a certain time, that there could have been a droplet infection. Whereas, yes, sometime last year we determined that at least 1.5 meters of distance is reasonable. In this scenario, the exact location where close contact occurred doesn't really matter. Provided the condition of proximity is present, the exact location of it would be secondary in the case of a direct droplet infection. Sounds simple in principle at first, but the problem is in the details. In a very small detail, to be precise. Because we are now talking about a nanometer range. Unfortunately, because the SARS-CoV-2 virion (the single particle) is so small, we now have a second, different route of transmission in the droplet infection field - let's call it aerosol infection. We think of aerosols as clouds of viruses that waft through the air, but eventually fall to the ground - and that's where, unfortunately, it becomes more difficult to detect them.

Illustration of an aerosol infection in schematic, where one person scatters aerosols that pass on to another person

How long such infectious aerosols can remain airborne depends on various factors: Among other things, the air circulation at a location, weather data and other parameters that influence the air flow. In an evaluation by means of data it becomes here therefore very spongy to deduce directly whether an aerosol infection could have taken place. It can be considered as certain that place and time must approximately coincide, but the necessary overlap of these values can vary strongly. Aerosols can persist minutes to hours in the air and may well move.

Smear infection / contact transmission

Illustration of a smear infection in the scheme in which one person transmits viruses to another by handshake

. Now suppose an infectious person does not sneeze directly into another person's face, but into their own hands or something. Then there is a certain virus load on the hands. So it wasn't entirely wrong that we developed new creative greeting rituals without handshakes last year. Another person would otherwise pick up a viral load as a smear infection and possibly pick it up through the mucous membranes or conjunctivae. With better hand hygiene, therefore, the risk of smear infections also decreases in principle, but these can never be completely ruled out. If we were to try to identify the transmission path of a smear infection from data, however, things would unfortunately become complicated again: Except for a commonality of the location of two persons, other parameters would again be highly variable; such as the hygiene behavior of the respective person, the basic hygiene measures on site etc.

Illustration of a smear infection in the scheme, where a person picks up viruses from a cardboard box

But then there would be the exceedingly complex further scenario, suppose there is a hand-applied viral load on an item that is then moved from one location to another. In this case, there would not even be a discernible location connection. However, the risk of this is generally rather low - assuming basic hand hygiene - because the majority of COVID infections are respiratory.

For a somewhat more scientific evaluation of these various scenarios, please refer to the list at RKI.

The thing with the viral load

Just because one of these transmission routes occurred, however, does not mean that an infection will result. This is because a certain viral load is required to infect the person who is exposed to this transmission route.

Whether this viral load is sufficient for an infection in the respective situation depends on various parameters:

Duration of contact
Intensity of the contact
Own protective measures
External protective measures
distance
Conditions of the location
and, in the meantime, also quite decisively on the respective virus variant.

The more infectious a virus variant is, the less virus load is sufficient for infection. In the official RKI definitions of contact persons of category 1 (higher risk of infection), contact of at least 15 minutes is always mentioned. However, this can change due to virus variants. So let's rather assume a few minutes, which can be critical. Probably this is even worse, depending on the virus variant, but it is first about the model.

Excursus: Mitigations against droplet infections and smear infections

Ways to prevent infection: keep a distance of 2 meters, wear a mouth-nose-protector, wash hands

Yes, keeping your distance, wearing a mask and washing your hands is not as hip as an app now - but it at least helps to reduce the risk. Unfortunately, the effect of an app can only be seen retrospectively anyway. So wash your hands, keep your distance and cover your mouth and nose properly. Yes, even if no one is watching when you're around people.

Attempt of a data forensics of single possible infection ways

After all the epidemiological preliminaries, let's get to the data.

We summarize how we might begin to identify possible infections at the data level. Assuming we can narrow that down appropriately as a route of transmission. Attention: this is rather just a bumbling simplification and is only meant to create understanding for the topic.

Direct droplet infection

Category	Value
Precise data condition	Distance small threshold (e.g. 2 meters) for a certain time (e.g. 5 minutes)
Alternate data condition	location and time equal for certain time (e.g. 5 minutes)
Precision requirements	Distance and time precise
risk of infection	very high
risk factors	masks, viral mutations, infectivity of the index person, amount of aerosol emission (speaking, singing), duration of encounter, distance
mitigation	mask, distance

Thus, the detection of this category works best via distance measurement, such as implemented by the Corona Warn App (CWA) since its first public release in 2020. The technical implementation for this is based on the Exposure Notfication Framework, which uses Bluetooth LE for distance estimation. This now works relatively well - within the limits of the technical possibilities. This is because it is not possible to deduce from a small distance alone whether two people really had direct contact or whether they were perhaps separated in different rooms. Or whether there were people in the immediate vicinity, but they were safely separated spatially. But this recognition is more precise than, for example, the recognition via GPS, by which only the location can be taken into account, but not, for example, several floors of a building. Technically, this is still a makeshift solution, but one that has become quite acceptable.

For information on how the Corona Warn app performs these risk calculations, please refer to the corresponding posts on github.

indirect aerosol infection

category	value
Precise data condition	location same for a certain period of time to trigger infection (e.g. 30 minutes)
Precision requirements	location approximately the same, time may overlap but has "lag"
risk of infection	high
Risk factors	masks, viral mutations, infectiousness of index person, amount of aerosol output (talking, singing), ventilation of site, volume of site, weather, microclimate, duration of aerosol contact
mitigation	mask, ventilation

Deriving potential aerosol infections from data is much more difficult. Distance measurement in combination with time can be a surrogate indicator for this category of infections, but it is not that simple.
It can be taken as a given that at least one infectious person must have brought aerosols to a place and that the persons who were at that place at the same time have a risk of infection. The only difference here is that infection can still occur after the infectious person has left a location. An exact temporal overlap of the stay with the infectious person is therefore not necessarily required. How large this temporal span is after the infectious person has left the place, in which an indirect aerosol infection could have taken place, depends strongly on the climatic conditions of the place. So here we need even more information about the place beyond just the encounter.

In the "Zeit online" there is just also in relation to B.1.1.7 a nice visualization of the parameters, which influence the risk of aerosol infections.

Aerosol infections, however, already describe the problem the CWA had in version 1: the risk detection in the CWA is not directly location-based, but only distance-based. This has been okay for the category of direct droplet infections in the near field, but it does not do justice to large aerosols. So the fact that the CWA now also relies on QR codes in version 2 is a useful addition. Pure distance measurement alone no longer does justice to aerosols.

Contact transmission stationary

category	value
Precise data condition	Contact to same object carrying viral load
substitute data condition	location same to pick up sufficient viral load there by contact, small time interval
Precision requirement	Contact with same object
risk of infection	low
Risk factors	Hand hygiene, viral mutations, infectivity of the index person, material of the object
mitigation	hand hygiene

A possible smear infection should be recognized by the fact that there was basically a local overlap to have had contact with the same object carrying sufficient viral load (e.g. door handles). How close the temporal relationship between infectious person and contact person must be for a possible infection is strongly dependent on parameters such as the surface (metals, fabrics, etc) on which a viral load may have been transmitted. Thus, a similar - rather open-ended - query as for aerosol infections would be necessary here to obtain indications of possible smear infections based on data points.
Quite ungrateful at the situation is that Corona viruses can hold on certain materials very long, partially days e.g. on steel surfaces. Whether this viral load is then sufficient to cause an infection has not yet been clearly clarified. But it is at least clear that if places are not cleaned, this problem will certainly not get better.

Contact transmission mobile

Category	Value
Precise data condition	Contact to same object carrying viral load
substitute data condition	short distance in time
Precision requirement	Contact with same object
risk of infection	low
risk factors	hand hygiene, viral mutations, infectivity of the index person, material of the object
mitigation	hand hygiene

We come to the scenario that can hardly be captured by data in a meaningful way. For assuming that an object (e.g., an empty glass at a concert), which has been exposed to a certain viral load by an infectious person, is moved or passed on in the phase of possible infectivity, then ... a scenario that is possible, but rather very very difficult to substantiate with data. However, because this scenario is rather secondary.

Magic apps and their limitations

Now that we have painstakingly tried to find correlations for infection paths in data, we realize that it takes a great many data points to assess even one possible infection path for a possible infection opportunity.
The only problem is that not all of this data is necessarily available to us automatically from apps.
The problem with a purely data-based assessment of each of these infection scenarios arises simply from the fact that the degree of infectiousness of the person (which could be derived, for example, via a PCR test and the ct value) and the existing viral mutation change the risk of each individual scenario equally dynamically.
Regarding viral mutation, it must also be said that a clear statement of exactly which viral mutation is present can only ever be available days after the first analysis of a case. A cautious analysis in terms of data would therefore have to be based on the worst virus mutation and scenarios would have to be calculated here or these would have to be recalculated again and again with new parameters, if necessary.
For these apps to be able to assess an infection risk in a halfway valid way, we would also have to entrust you with very specific health data. Cleanly implemented, this results in a very high data security effort. How, for example, the Corona Warn app tries to prevent inferences about a person's test status? Please refer to the Video from the rC3 - the video also goes into a bit more detail about the risk calculation.
And even if these magic apps would have all these data always available in real time to derive automatic risks here with magic algorithms, this would then only be the result for exactly one possible risk encounter of a person to an infectious person or vice versa. This approach might be feasible for a few index cases. But certainly not in a worldwide pandemic with thousands of new cases every day, which in turn have multiple such risk encounters to be evaluated, which would have to be calculated. Oh yes, health authorities do not have thousands and thousands of data science experts at hand, let alone enough time and infrastructure "to simply calculate this on the side".
It will therefore not be possible to calculate the pandemic using data from possible risk encounters alone.

But the data situation is not completely hopeless. If we don't rely solely on magic apps to save us.

The half fog

At the beginning, I spoke of semi-fog - because many infection trajectories are actually quite well known. Even after a year of the pandemic, the same transmission routes I described still exist, except that the occasions on which they can occur are actually often rather well-defined.
People currently become infected primarily either

In the household or domestic environment
in day care centers or schools
At the workplace

A listing of outbreaks in Baden-Württemberg from the beginning of this month, for example, clearly shows here that very many outbreaks - that is, infection events with more than one case - take place in places where the data situation is actually already very good. If we add up the proportion of active outbreaks at the above-mentioned locations, we arrive at more than 75% of active outbreaks.
Data on households are available, data on daycare centers and schools are available, data on workplaces are available. After all, in all outbreaks, the people present and potentially at risk there are somehow very clearly known.
In practice, however, it is usually not so easy to simply quarantine an entire school class, daycare center or company, or to close schools and companies, but that is often a political issue.

Diffuse infection mist and app support.

Let's assume that in the future we want to open all restaurants and stores again and enable culture. Then this will also result in a more diffuse infection fog - besides the already known main part in work and so on. It will lead to more contacts away from known infection routes. This also increases the need to create more data-based clues for tracking here.
That's the added value that apps like "Luca App" promise. They are a bet on a more open future despite the pandemic. But they will only partially deliver on these promises. That's because, as hopefully became clear from the description of infection pathways, the data these apps provide to simply detect potential infections is insufficient on its own. They must be evaluated in the context of other factors not present in these apps.

Rough assessment of the suitability of the Corona Warn app and Luca to potential infection detection.

Let's go through the infection pathways and go through the Corona Warn app in version 1 and 2 (with QR codes) and the Luca app (and comparable) in their current state. And let's consider what data these three app functionalities can collect when fully used. So, for potential infection pathways, we get roughly the following suitability to provide clues for infection pathways.
*Disclaimer: This only refers to the general possibility of deriving possible infections with the data in the respective apps, the further usability of the data or back channels are another topic. The listing is highly simplified.

Category	Luca Principle	CWA 1	CWA 2
direct droplet infection	basic (location, time)	extended (distance)	extended (distance)
indirect aerosol infection	base (place, time)	none	base (place, time)
contact transmission stationary	basis (place, time)	none	basis (place, time)
contact transmission mobile	none	none

We note: In principle, Luca (or similar) provide parameters such as location and time that can provide a correlation to infection routes, but a boosted Corona Warn app with a check-in function can provide more precise detection overall. Sure, it still has distance estimation, which Luca (or similar) does not. The assumption, however, is always that all parties involved use the apps, all check in cleanly, etc.

Blurs

Check-in times

In both systems, there is a technical fuzziness in the check-in and check-out. In the CWA, check-in is performed by the person; in Luca, this is possible both by the location and by the person. Usually, however, people forget to check in, check in late, don't check in at all, etc.
In a typical restaurant situation, where you are checked in by the staff, there is a latency in the normal restaurant operation etc.. In short, it makes little sense to evaluate attendance times as exact and to apply exact temporal selectivity here. In view of the possible infection routes, however, excessive temporal precision is not sensible in any case.

Distance detection Corona Warn App

Using Bluetooth to measure distance from cell phones is not necessarily the use case Bluetooth was built for. It's more or less a hack based on measurements of signal strength and derivatives thereof. So the Corona Warn app's distance measurement is also never really accurate and is often prone to interference.
Measurements by the makers of the Corona Warn app put the accuracy at about 80%, which can be considered good.

Data validity of personal reference or contact information

For the detection of infection pathways, it does not matter at first whether possibly necessary personal data or contact data of persons are correct. In the area of infection risk detection, it is also not about personal data at all. The necessity of certain personal data in this whole chain is another topic for a later part.

Does that actually help us with these apps then?

Yes. Although the apps' data points may not be sufficient to fully clarify a fact of a possible infection with their data alone, they do give us clues to contacts that an infectious person may not even be aware of herself or be able to cite in a conversation with a health department.
For example, people who are not known to her, for example, but who nevertheless represent a risk of infection or have through the infected person. Corona apps, however, should only ever be seen as a second safeguard for chains of infection, never as the sole tool for uncovering possible chains of infection.
The explanation of the infection pathways here hopefully shows clearly: it is a whole lot of data that alone can belong to only one possible infection and its detection.
To now hope to magically recognize complex correlations from a whole lot of data all by oneself is extremely optimistic and shows a lack of understanding of the underlying problems.
Simply overwhelming public health authorities with data sets - which, in principle, are not even sufficiently complete - in the hope that something useful can be gleaned from them will not work. Data without relevance creates effort in your assessment, it creates data gaps that need to be filled before a final assessment can be made. Raw data from apps needs to be used wisely here to deliver potential value.

Cluster detection

Now that we've gone into the details of a single possible infection pathway, the question is how we can use it data-wise to begin to detect multiple infection events at a single location at a similar time - clusters, that is.
Epidemiologically, a cluster is effectively "more than chance." An event where something happens that is no longer a coincidence, for example, infecting multiple people at once in a single infection event.
But then that's the story for the next part. There will be very many coincidences and we will try to find the "more than coincidence" out of these many coincidences.

Stay safe until then.