This is a graph theory problem on a real dataset. I have a dataset of roughly 100M points, of camera sightings of cars over time in a city. They can be thought of as { camera ID, license plate ID, timestamp } tuples roughly. There are around 2000 cameras around the cities, we know their locations, and around 10M unique cars (license plates) throughout the one-month period for which we have data. I am trying to determine how many unique paths there are, empirically, between arbitrary cameras (nodes). A ‘path’ here is just consecutive sightings of the same license plate by different cameras close in time.

The problem is that 1/3 of all camera sightings are missing: most cameras have 80% identification accuracy, and a few have 0-10%. The main question is: how do we tell whether two paths are really different, and not just one underlying path with some missing sightings?

For instance, consider some high volume route A->C. If there is just one way to get from A->C, say via B, then you will still see a lot of cars at A then directly at C because camera B just happened not to pick them up (approx. 1/3 of the total traffic will be AC just because of this missing data points fact), and so AC will appear to be a popular path as well as ABC, when really ABC is the only real path people take from A->C. The goal is to tell these paths apart so that we can count the number of unique paths A->C at the end of the day to study robustness of the traffic network to closures of roads.

One approach is to compare average times taken through routes, but two different routes through space can take very similar amounts of time. Here’s how I solved the problem for a specific case: consider two routes from A to C: ABC and AB’C, where there is no edge B->B’. We know that this cannot in fact be one underlying route ABB’C because then we’d see an edge B->B’, which we don’t. I wonder if this heuristic can be generalized (but of course we still want to tell paths apart even if BB’ was an edge, since in that case we are not guaranteed that ABB’C is the only true path taken).

I’m also open to using external APIs like Google Maps for more information but not quite sure how that would make the problem easier. I am aiming to solve this problem for arbitrarily long paths between two arbitrarily far cameras, but I think most such questions can be reduced to a version of the ABC vs AC problem outlined above.

Here is a pictorial example of an instance of the problem. Bottom left can be thought of as node A and top right as C. Then it becomes an AC vs ABC vs ABDC problem. The data shows that AC was taken 19 times, ABC 46 times and ABDC 86 times. The question is to figure out how many of these three paths is a “real path” that people travel through. Of course, on a map it is easier to count by hand based on the road network, but the question is doing this algorithmically based on the data at hand for all points without plotting.

Here is an example of the data. We split the data into groups that represent real car trips. So sightings of the same license plate by different cameras sufficiently close in time (on the same day of course). Here is an example of one group (unrelated to picture above), like which there are about 10M others.

```
index camera_ID encoded_plate date time_seconds velocity
9200 480301111 660.0 2021-03-11 62000.0 54
9201 480321111 660.0 2021-03-11 62135.0 28
9202 480331111 660.0 2021-03-11 62235.0 5
9203 480341112 660.0 2021-03-11 62302.0 42
9204 450371112 660.0 2021-03-11 62648.0 32
```

2

## 2 Answers

if there is just one way to get from A->C, say via B, then you will

still see a lot of cars at A then directly at C because camera B just

happened not to pick them up, and so AC will appear to be a popular

path as well as ABC, when really it is just ABC

I do not see the problem. Why can you not just immediately replace all occurrences of AC with ABC?

8

The problem is in telling that AC and ABC are different in the first place. I am not sure how to do that algorithmically for paths of arbitrary lengths given the task at hand.

–You need to clarify your problem statement. Which is it? Telling AC and ABC are “different” or telling that they are the “same”?

Telling that they are different because we aim to count the number of unique paths A->B at the end of the day, though the two questions are of course equivalent because you know one iff you know the other.

–So, edit your question. Currently it says the problem is telling that they are the same.

Done; though if you can tell they are the same you can tell they are different. It is literally the same question.

–

Here’s a possible approach: sort the recorded journey types by number of camera points, descending (so e.g. ABCD has four points, it should appear before EFG which has three). Then for each journey type, estimate how many spurious observations of each subsequence you would expect to see, and subtract that number from the actual observations of each of those subsequences (e.g. if we think there were 100 jouneys for ABCD, then we’d expect 100 * p(1-p)^3 spurious observations of each of ABC, ABD, ACD and BCD, and so on for shorter subsequences).

Also keep track of the uncertainty in each count, which is initially zero, but each time you subtract an estimate from the count of some journey type, add on the variance of that estimate to the uncertainty for that journey type. The variance can then be used to estimate the likelihood that a measurement should really be zero; if so, just ignore that journey type when you come to it, and move onto the next one.

At the end, the “real” journey types are those where a count of zero isn’t likely based on the estimated count and variance. Don’t forget to also adjust the counts of “real” journeys upwards, e.g. a path ABCD has a (1-p)^4 chance of not being recorded correctly so you should divide the estimated count by that factor to compensate. This adjustment should be made before estimating how many spurious observations to deduct from the subsequences.

4

Interesting — how do you suggest calculating the variance? Are you saying the number of spurious observations of smaller subsequences is drawn from some binomial distribution?

–Yes, if there were 100 ABCD journeys each with a probability q = p(1-p)^3 of being recorded as ABD, then we can treat the number of spurious ABDs that were really ABCDs as a binomial distribution with parameters 100 and q, so it would have mean 100q and variance 100q(1-q). It gets more complicated when that 100 is also only an estimate (because it’s a measurement adjusted by a factor of 1/(1-p)^4, and possibly also adjusted by subtracting some estimated number of spurious observations), so if precision is important, you may want to consult a mathematician to get the details right.

– kaya3Another issue is that this algorithm could output a slightly different total number of journeys than were actually recorded, because when a count is estimated to be zero based on its variance but its mean is not zero, that non-zero mean will be “missing” from the total. E.g. if you think there were 100 ABCDs, perhaps you expect 10 spurious ABDs, and you observed 11 ABDs, so you’ll probably guess there were 0 real ABDs. Logically, the extra 1 should belong to the ABCDs – or the ABEDs or ABFDs… so there’s more details if you want to do it properly. But this is broadly the approach I’d use.

– kaya3I don’t think you can trust the probabilities that much (in terms of independence). Cameras might miss more during rush hours (hard to see when bumper to bumper or too many cars per second), which is also a time when people may change to alternate routes.

The question I have is how is this on topic for SO? You’re presenting a n bunch of problems and requirements here, but no apparent attempt to solve it yourself.

There is one, clearly defined problem: AB vs ABC. And I have an entire paragraph explaining how I have tried to solve it myself (One approach…here’s how I solved…).