Saturday 30 July 2022

The Statistics of Nessie Sightings



Is the study of the Loch Ness Monster a subjective or objective matter? I suppose the answer has to be both when one considers what the particular focus is. Photographs and films can be considered objective evidence although the interpretation of each item can be considered a mix of the subjective and objective. Likewise, eyewitness accounts are objective but again the interpretation is often in the eye of the beholder.

However, in all of these there is usually a collection of quantifiable data points than can have statistical techniques applied to them. For the next few articles, we shall have a look at some of these in no particular order. When it comes to the Loch Ness and large enough statistical datasets, we are mainly talking about eyewitness accounts which can be subdivided into a variety of metrics related to time, motion and dimensions.

The database I use is an augmented version of one based on the work of Charles Paxton and consists of over a thousand accounts divided into various sub-categories. By way of example, I will open up with a dataset that one would expect to be Nessie-agnostic, in other words, no correlations may be expected. In this case, I am using the day of the month on which individual eyewitness accounts happened. Using the total dataset, it was ascertained that 639 accounts had a complete date associated with them. The number of reported sightings on each day from day 1 to day 31 is shown below.


The expected average over days 1 to 31 is 20 sightings. However, there is a slight skewing of the data as four months have 30 days, seven have 31 and one has 28 or 29 days. So it is no surprise that the 31st has the lowest count of 10, but the 2nd comes in just behind at 11. However, the data is quite varied about that average of 20 reports, ranging from 10 on the 31st to 32 on the 27th. Does this mean Nessie is more disposed to putting in an appearance on the 27th of any given month or that more eyes are on the loch on the 27th of the month? Or is it just a random effect of the data?

One way to approach this statistically is by calculating the normal (or Gaussian) distribution curve for this dataset. This curve basically shows the distribution of data points around a mean value. For example, in a set of student exam marks, we may find that most students score a C grade at the peak of the bell curve while one half of the curve are the higher grades which tail off to the end where a small number achieve an A+ and on the other side of the curve the low achievers tail off to the small number who get F grades or worse.

Given enough data points, it is found that these bell curves occur in many areas of society and nature. This dataset of Nessie reports is no different as it produces a nice normal distribution below centred around the peak of the average of about 20 sightings per day of month. We see it also tails off towards the extremes of a high or low incidence of reports for the days mentioned above. This suggests the distribution of sightings is within statistical norms.



However, one further confirmation of this is the so-called "68-95-99 rule" which is a heuristic (or approximate rule of thumb) which states that:

68% of the data is within 1 standard deviation of the mean.

95% of the data is within 2 standard deviation of the mean.

99.7% of the data is within 3 standard deviation of the mean.

The standard deviation of a set of data points is a measure of how much the data varies from the mean or average value. The higher the variance, the higher the standard deviation value. The value of the standard deviation for this population of data points is 5.35. Applying the 68-95-99 rule, vertical lines are usually drawn at the 1st, 2nd and 3rd standard deviations (or sigmas). The 1st, 2nd and 3rd deviations are calculated by adding or subtracting 1, 2 or 3 times the standard deviation of 5.35 from the mean of 20.6 on either side of the peak.

Thus we should have 68% of the data within the 1st deviation or between the vertical line values of 15.3 and 26.0 and 95% of the data between the values of 9.5 and 31.3. In this case practically all the data is within the 2nd standard deviation. Statisticians tend to only regard data points as being significant or out of the normal if they go beyond the 3rd standard deviation (here between 4.6 and 36.7). None of the Nessie data points get near the 3rd deviation and so we conclude nothing of statistical significance is notable here. For example, if a third of all sightings had occurred on the 15th day of the month, then we would have had something to scratch our heads over.

But one might venture to say that the dataset is corrupted by accounts which are not sightings at all but are misidentifications and hoaxes. Will this not skew any analysis of the data? After all, do not experts claim that 90% of all monster sightings are explicable by known and unremarkable phenomenon, either natural or man-made? The quote below is from one of the leading Loch Ness Monster researchers, the late Roy Mackal, who wrote this in his conclusion at the end of "The Monsters of Loch Ness":

The realization that surface observations are rather rare has developed gradually over the years. The number of recorded reports during the 30-year period following 1933 was roughly 3,000. This figure, taken at face value, would mean that about 100 observations were made annually. From this it is clear why it was reasonable to expect photographic surveillance of the loch surface to produce evidence rapidly. However, as noted, a more careful examination of the reports tells us that a large proportion of these observations, perhaps 90%, can be identified as errors, mistakes, misinterpretations, and, in a few cases, conscious fraud.

In my opinion, this 90% statistic is as meaningful as saying nine out of ten cats prefer Whiskas. Firstly it is based on a premise that about three thousand sightings were recorded between 1933 and 1963. The database I use has been thoroughly researched and it lists about 570 unique reports over that period or one fifth of the claim by Roy Mackal. Where can these missing 2430 reports be found? I doubt they all exist and I cannot tell how this number was arrived at. Perhaps some were low grade LNIB eyewitness reports. So, perhaps the 100 observations per year that made Roy Mackal conclude a paradox and led him to a 90% reduction is not so paradoxical when it now reduces to 20 observations per annum over that thirty year time span.

I would add the caveat at this point that the original author of the database I use may have found 3000 reports and culled them. I know he excluded such dubious items as Frank Searle's accounts, but 3000 sounds unlikely. Secondly, this ninety odd percent assertion is something that occurs elsewhere and seems to just be a number symbolic of human error and the triumph of scepticism. The number was repeated again by LNIB member Clem Skelton (see link):

Skelton figures that eighty to ninety percent of the people who think they have seen the monster have really seen something else.

It gets worse as the same numbers appear in Bigfoot research (see link):

... even those who research Bigfoot will admit that roughly ninety-five percent of Bigfoot sightings are either mistakes or purposeful hoaxes.

Imagine that, a completely different phenomenon with completely different explanations, yet they come out at about 90% as well! And let us not forget the UFO phenomenon where, you guessed it, nine out of ten cats, I mean debunkers, prefer swamp gas and the planet Venus (see link):

UFO reports -- 90 percent explained; scientists say rest should be investigated 

Perhaps if it said often enough it will be believed. The truth of the matter is that as the observational qualities of the report deteriorate, the curve of natural explicability approaches the 90% range in an asymptotic manner. Deterioration can mean increasing distance, decreasing time to observe and assess or poor weather conditions, light levels and so on. It would be naive to claim that 90% of all reports seen at 100 yards are as easily explained as 90% of all reports at one mile away. I hope I have made my point, but having said it, a certain percentage of the total claimed sightings will be in error. What is that percentage?

No one knows and let us leave that guesstimation to personal opinion rather than being dogmatic about it. In that light, an attempt should be made to extract a subset of sightings which can be regarded as error free as possible. So we go back to Roy Mackal, who in the same book, tabulated a series of 251 reports he regarded as the best up to the year 1969. On page 84 of the paperback edition, he says he extracted them from the three thousand aforementioned reports. I have made my comment on that, but I will largely take these 251 sightings as of a higher standard and graph them. I say "largely" as the St. Columba account is included, but not surprisingly, the day of the month for this encounter is not given. The graph using the Mackal data is shown below.



We have 138 out of the 251 reports supplying a date or 55%. Does this graph bear any similarity to the first one now that the quality has been increased? One can see similarities such as the expected dip on the 31st and again the count for the 2nd is very low, but there are also divergences. Going back to statistical techniques, similarities in data sets can be quantified using a correlation function. The one I will use is the Pearson correlation coefficient formula. If the coefficient approaches a value of 1 then the two data sets are increasingly similar. If the value approaches 0, there is less similarity between the two. If the value goes past zero to approach -1 then the two are more negatively correlated, or a mirror image of each other.

The correlation coefficient for the two data sets turns out to be 0.32 which means they are not particularly correlated. Does this mean anything? Given that we are examining the Nessie-agnostic day of month, it shouldn't and we are looking at one set of random data against another. The normal distribution graph for the Mackal data is shown below.



This distribution is not as smooth as the larger data set which we can perhaps put down to the lack of data points compared to the other. If a data set gets too small, statistical analysis becomes less reliable. However, the lines indicating the 1st and 2nd standard deviations once again show that the data is within normal bounds and there are no anomalies here.


CONCLUSIONS

We deliberately started this series of articles with a data set which should bear no relation to monster or people activity. There is no theory that either should be more disposed to one day of the month over another, apart from the slight dip expected on the 31st. In other words, it should be data which varies around its average conforming to normal statistical techniques and this has been confirmed. So we can go forward with some confidence in the tools tried out. In the next article, we shall look at data more relevant to the statistics of Loch Ness Monster sightings.


Comments can also be made at the Loch Ness Mystery Blog Facebook group.

The author can be contacted at lochnesskelpie@gmail.com