Guest Post by Davide Barbieri*
During the last decades, large organizations – like multinational companies, governments, hospitals, public administrations, law enforcement agencies and the like – have accumulated huge quantities of data, often in the form of unstructured spreadsheets, emails and text documents. SCADA (Supervisory Control and Data Acquisition) systems have automatically collected production data from sensors and machines, while cheap storage devices, like terabyte-large hard disks, have reduced the necessity to filter the acquired data in advance, according to statistical criteria. The Internet has augmented the order of magnitude of the phenomenon. The mesh-like topology of the network allows to quickly and efficiently communicate and spread the data, or to store them in cloud-computing facilities. Opinions and comments can be collected from users in blogs or news sites. We may assume that structured data, stored in corporate databases, account for the minor part of the big data explosion. As a consequence, it is not so obvious to extract from those huge data repositories the meaningful bits, in order to distinguish reliable information (or intelligence) from noise.
Information technology allows to select and aggregate information from databases by means of query languages (like SQL, Structured Query Language, for instance). Such languages implement the most common statistical functions, like mean, range and standard deviation. In case queries were not enough, scientific software packages allow skilled users to perform more advanced statistical analyses. Still, since the end of the 90s of the last century, a new set of technologies has emerged, collectively known as data mining, which allows to extract meaningful – but often unpredicted or counterintuitive – knowledge from large datasets. As technology progresses, engineers, statisticians and mathematicians must face a new epistemological challenge: Is the information processed by means of data mining reliable? Can it be considered scientific knowledge? Beside the fact that even the possibility of answering such questions is arguable, since the definition of science is not obvious, we can try to shed some light on them sifting through the history of scientific thought.
Starting from the XVI-XVII century, the progress of modern science (during the Scientific Revolution) has been supported by the collection of empirical data, that is observable and measurable facts. This epistemic premise cannot deny the existence of non-observable reality, that is metaphysics, but denies that it can be investigated scientifically. Even if it can be questioned whether empirical science appeared first and it was then followed by epistemology or vice-versa, the philosophers who first tried to elaborate a scientific method in a formal (i.e. logically consistent) way were empiricists, like Francis Bacon and David Hume, affirming the main role of the inductive method in the production of scientific knowledge and denying the validity of a priori knowledge (Descartes can be considered one notable exception, more on the rationalistic side).
Even if it has become a wider concept in contemporary times, classic induction – as defined by Greek philosophers using the term epagoghè – essentially consists in the mental process of inferring a general conclusion from (possibly many) particular observations. Still, however large the amount of evidence, the conclusion can never considered to be certain. In fact, a single observation is sufficient to reject it. For example, no matter how many black crows we observe, the conclusion that all crows are black can be proved to be wrong by a single non-black crow. Therefore conclusions obtained by means of induction are only likely. Definitely, they cannot be considered universal (always true), but only contingent, leaving some room for inaccuracy and uncertainty, and therefore they are a bit generic.
For this and other reasons, during the ancient and middle ages, deduction – as formalized by Aristotle – was preferred. The Aristotelian method infers necessary conclusions from general premises, like in the famous syllogism: All men are mortal and Socrates is a man, therefore Socrates is mortal. Still, the fact that deduction is formally or logically correct does not guarantee that the conclusion is true: All animals fly, donkeys are animals, therefore donkeys fly. The syllogism is correct, but the conclusion is false, since the premises are false. A single non-flying animal (a donkey or a dog, whatever) can prove the main premise (all animals fly) to be wrong. General premises should therefore be considered hypotheses, so that a syllogism will assume the following form: If A then B, as in computer logic.
Therefore, neither inference method can lead us to certain knowledge. Historically, it was induction which paved the way to modern scientific thought. However, the inductive method has been countered by Karl Popper during the XX century. In his opinion, the whole scientific method consists in stating hypotheses in order to solve problems. These hypotheses must then face the challenge of evidence, which can corroborate them or prove them to be wrong (“falsify them”, in Popper’s words), but never verify them (prove them to be true), as in the following purely deductive schema: Problem 1 → Hypothesis (tentative solution) → Error elimination (confutation) → Problem 2. Popper’s hypotheses are similar to Plato’s ideas, in the sense that they are inside the scientist mind a priori, even if often they are not universal (some are, like the hypotheses of mathematics, which can be demonstrated to be necessarily true or wrong in a deductive way). Most of the ideas are in fact assumptions, conjectures, like medical or biological theories. Any of these hypotheses can therefore be falsified, sooner or later – and partially or totally rejected – raising new, deeper problems, which will need new, possibly more inventive and courageous attempts of solution (new hypotheses, to be tested against new evidence). According to Popper, the idea that scientific progress can be supported by means of induction is just an illusion, being effectively challenged by different shades of grey crows, white flies, Australian black swans, “inductivist” turkeys (I have to thank Bertrand Russell for this one) and other statistical outliers. Hypotheses are simply triggered by the unexpected, when ideas do not adhere to reality (the observed facts). Apparently, classic Aristotelian logic has had the upper hand on Bacon’s induction.
Still, the birth of information technology was faced by the following challenge: Can machines think? This question was put forward by the father of artificial intelligence, Alan Turing, in a paper published by Mind in 1950: “Computing Machinery and Intelligence”. Actually, the question did not concern the idea that machines could infer necessary conclusions in a deductive way, automatically – that is “mechanically” – since that was given for granted. This is the case, for example, of computation, when a computer obtains the result applying mathematical rules, which were given a priori, to input numbers. Rather, it was the idea that computers could adopt inductive reasoning – that they could learn from experience, empirically – which was being questioned. We can therefore rephrase the question as follows: Can machines have the capacity for abstraction? This capacity is interestingly similar to the human faculty of imagination, the ability of “seeing” something which is not immediately perceived by the senses. It is needed, for example, to solve the strictest CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). The question reminds us of the Scholastic problem of universals: What do similar objects have in common? What is the essence that they share?
Actually, statistical inference – the methodology by means of which scientific hypotheses are either rejected or accepted – is mainly inductive in nature. Samples, from which data are collected, must be as large as possible in order for the conclusions to be statistically significant, in which case they have the strength to generalize for an entire population, from which the sample is drawn. As any theory, also Popper’s method is facing a new challenge. In an interesting article published by Wired in 2008, purposefully and prophetically titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Chris Anderson tries to challenge the assumption of the deductive nature of the scientific method. He foresees the end of theory, that is of Popper’s hypotheses, to be made before searching for empirical evidence which can only falsify or corroborate them. His opinion is that data can speak for themselves, without any pre-assumption, as those which are needed in classic statistical inference. This idea that any assumption can be eliminated has been effectively rejected by Massimo Pigliucci in his “The end of theory in science?”, published in EMBO Reports in June 2009. In fact, he is right to the point when he states that collected data are someway selected, that observations are made according to pre-assumptions.
A frequentist approach has led to many advances in science, for example in cryptanalysis. In this trade, analysts do not know the rules by means of which the ciphertext has been coded. They may know the frequency distribution of letters or of their combinations. Since the distribution of the symbols in the ciphertext must resemble that of letters in natural language, they can break the code provided that they are given a large-enough portion of ciphertext. In fact, the probability that a symbol corresponds to a letter is very high if they have the same relative frequency. For example, since e is the most frequent letter in English texts, the corresponding symbol should be the most frequent in the ciphertext. This is true according to the law of large numbers, stating that relative frequency tends to theoretical probability as the amount of evidence increases. Once the cipher has been broken by means of inductive methods, the unveiled rule can be applied deductively to break any other incoming ciphertext. The contribution of Turing in this endeavor is well known, being part of the British Intelligence team – based at Bletchley Park – who de-ciphered the Nazi-German ENIGMA code during WWII.
A similar approach is effectively used by Google translator, allowing users to translate exotic languages without the need for the search engine to know their rules. It is also implemented by other challenging data mining tasks. For example, in classification, algorithms look for rules, like if A then B, and can effectively unveil unpredicted patterns, which may support marketing decisions or also medical diagnosis, provided that the relative frequency – or support – of the rule is high enough. Trivial rules, like if high temperature then flu, may have a strong support and few exceptions, but they are of little use (regardless of the fact that exceptions do not break the rule). Still, other rules may be found which would remain unspotted if investigations were made exclusively on the basis of pre-assumptions.
The big data phenomenon presents both risks and opportunities, including that of an epistemological upgrade of the way we do science today. As the match between induction and deduction goes on, I shall restrain from attempting to put an end to it in this article, leaving the conclusion to Juvenal: Rara avis in terris, nigroque simillima cygno. Corvo quoque rarior albo. A perfect epitome for the scientific method.
* Davide Barbieri, Dep. of Biomedical Sciences and Surgical Specialties, University of Ferrara (Italy), firstname.lastname@example.org