Big data: An epistemological revolution?

Guest Post by Davide Barbieri*

During the last decades, large organizations – like multinational companies, governments, hospitals, public administrations, law enforcement agencies and the like – have accumulated huge quantities of data, often in the form of unstructured spreadsheets, emails and text documents. SCADA (Supervisory Control and Data Acquisition) systems have automatically collected production data from sensors and machines, while cheap storage devices, like terabyte-large hard disks, have reduced the necessity to filter the acquired data in advance, according to statistical criteria. The Internet has augmented the order of magnitude of the phenomenon. The mesh-like topology of the network allows to quickly and efficiently communicate and spread the data, or to store them in cloud-computing facilities. Opinions and comments can be collected from users in blogs or news sites. We may assume that structured data, stored in corporate databases, account for the minor part of the big data explosion. As a consequence, it is not so obvious to extract from those huge data repositories the meaningful bits, in order to distinguish reliable information (or intelligence) from noise.

Information technology allows to select and aggregate information from databases by means of query languages (like SQL, Structured Query Language, for instance). Such languages implement the most common statistical functions, like mean, range and standard deviation. In case queries were not enough, scientific software packages allow skilled users to perform more advanced statistical analyses. Still, since the end of the 90s of the last century, a new set of technologies has emerged, collectively known as data mining, which allows to extract meaningful – but often unpredicted or counterintuitive – knowledge from large datasets. As technology progresses, engineers, statisticians and mathematicians must face a new epistemological challenge: Is the information processed by means of data mining reliable? Can it be considered scientific knowledge? Beside the fact that even the possibility of answering such questions is arguable, since the definition of science is not obvious, we can try to shed some light on them sifting through the history of scientific thought.

Starting from the XVI-XVII century, the progress of modern science (during the Scientific Revolution) has been supported by the collection of empirical data, that is observable and measurable facts. This epistemic premise cannot deny the existence of non-observable reality, that is metaphysics, but denies that it can be investigated scientifically. Even if it can be questioned whether empirical science appeared first and it was then followed by epistemology or vice-versa, the philosophers who first tried to elaborate a scientific method in a formal (i.e. logically consistent) way were empiricists, like Francis Bacon and David Hume, affirming the main role of the inductive method in the production of scientific knowledge and denying the validity of a priori knowledge (Descartes can be considered one notable exception, more on the rationalistic side).

Even if it has become a wider concept in contemporary times, classic induction – as defined by Greek philosophers using the term epagoghè  – essentially consists in the mental process of inferring a general conclusion from (possibly many) particular observations. Still, however large the amount of evidence, the conclusion can never considered to be certain. In fact, a single observation is sufficient to reject it. For example, no matter how many black crows we observe, the conclusion that all crows are black can be proved to be wrong by a single non-black crow. Therefore conclusions obtained by means of induction are only likely. Definitely, they cannot be considered universal (always true), but only contingent, leaving some room for inaccuracy and uncertainty, and therefore they are a bit generic.

For this and other reasons, during the ancient and middle ages, deduction – as formalized by Aristotle – was preferred. The Aristotelian method infers necessary conclusions from general premises, like in the famous syllogism: All men are mortal and Socrates is a man, therefore Socrates is mortal. Still, the fact that deduction is formally or logically correct does not guarantee that the conclusion is true: All animals fly, donkeys are animals, therefore donkeys fly. The syllogism is correct, but the conclusion is false, since the premises are false. A single non-flying animal (a donkey or a dog, whatever) can prove the main premise (all animals fly) to be wrong. General premises should therefore be considered hypotheses, so that a syllogism will assume the following form: If A then B, as in computer logic.

Therefore, neither inference method can lead us to certain knowledge. Historically, it was induction which paved the way to modern scientific thought. However, the inductive method has been countered by Karl Popper during the XX century. In his opinion, the whole scientific method consists in stating hypotheses in order to solve problems. These hypotheses must then face the challenge of evidence, which can corroborate them or prove them to be wrong (“falsify them”, in Popper’s words), but never verify them (prove them to be true), as in the following purely deductive schema: Problem 1 → Hypothesis (tentative solution) → Error elimination (confutation) → Problem 2. Popper’s hypotheses are similar to Plato’s ideas, in the sense that they are inside the scientist mind a priori, even if often they are not universal (some are, like the hypotheses of mathematics, which can be demonstrated to be necessarily true or wrong in a deductive way). Most of the ideas are in fact assumptions, conjectures, like medical or biological theories. Any of these hypotheses can therefore be falsified, sooner or later – and partially or totally rejected – raising new, deeper problems, which will need new, possibly more inventive and courageous attempts of solution (new hypotheses, to be tested against new evidence). According to Popper, the idea that scientific progress can be supported by means of induction is just an illusion, being effectively challenged by different shades of grey crows, white flies, Australian black swans, “inductivist” turkeys (I have to thank Bertrand Russell for this one) and other statistical outliers. Hypotheses are simply triggered by the unexpected, when ideas do not adhere to reality (the observed facts). Apparently, classic Aristotelian logic has had the upper hand on Bacon’s induction.

Still, the birth of information technology was faced by the following challenge: Can machines think? This question was put forward by the father of artificial intelligence, Alan Turing, in a paper published by Mind in 1950: “Computing Machinery and Intelligence”. Actually, the question did not concern the idea that machines could infer necessary conclusions in a deductive way, automatically – that is “mechanically” – since that was given for granted. This is the case, for example, of computation, when a computer obtains the result applying mathematical rules, which were given a priori, to input numbers. Rather, it was the idea that computers could adopt inductive reasoning – that they could learn from experience, empirically – which was being questioned. We can therefore rephrase the question as follows: Can machines have the capacity for abstraction? This capacity is interestingly similar to the human faculty of imagination, the ability of “seeing” something which is not immediately perceived by the senses. It is needed, for example, to solve the strictest CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). The question reminds us of the Scholastic problem of universals: What do similar objects have in common? What is the essence that they share?

Actually, statistical inference – the methodology by means of which scientific hypotheses are either rejected or accepted –  is mainly inductive in nature. Samples, from which data are collected, must be as large as possible in order for the conclusions to be statistically significant, in which case they have the strength to generalize for an entire population, from which the sample is drawn. As any theory, also Popper’s method is facing a new challenge. In an interesting article published by Wired in 2008, purposefully and prophetically titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Chris Anderson tries to challenge the assumption of the deductive nature of the scientific method. He foresees the end of theory, that is of Popper’s hypotheses, to be made before searching for empirical evidence which can only falsify or corroborate them. His opinion is that data can speak for themselves, without any pre-assumption, as those which are needed in classic statistical inference. This idea that any assumption can be eliminated has been effectively rejected by Massimo Pigliucci in his “The end of theory in science?”, published in EMBO Reports  in June 2009. In fact, he is right to the point when he states that collected data are someway selected, that observations are made according to pre-assumptions.

A frequentist approach has led to many advances in science, for example in cryptanalysis. In this trade, analysts do not know the rules by means of which the ciphertext has been coded. They may know the frequency distribution of letters or of their combinations. Since the distribution of the symbols in the ciphertext must resemble that of letters in natural language, they can break the code provided that they are given a large-enough portion of ciphertext. In fact, the probability that a symbol corresponds to a letter is very high if they have the same relative frequency. For example, since e is the most frequent letter in English texts, the corresponding symbol should be the most frequent in the ciphertext. This is true according to the law of large numbers, stating that relative frequency tends to theoretical probability as the amount of evidence increases. Once the cipher has been broken by means of inductive methods, the unveiled rule can be applied deductively to break any other incoming ciphertext. The contribution of Turing in this endeavor is well known, being part of the British Intelligence team – based at Bletchley Park – who de-ciphered the Nazi-German ENIGMA code during WWII.

A similar approach is effectively used by Google translator, allowing users to translate exotic languages without the need for the search engine to know their rules. It is also implemented by other challenging data mining tasks. For example, in classification, algorithms look for rules, like if A then B, and can effectively unveil unpredicted patterns, which may support marketing decisions or also medical diagnosis, provided that the relative frequency – or support – of the rule is high enough. Trivial rules, like if high temperature then flu, may have a strong support and few exceptions, but they are of little use (regardless of the fact that exceptions do not break the rule). Still, other rules may be found which would remain unspotted if investigations were made exclusively on the basis of pre-assumptions.

The big data phenomenon presents both risks and opportunities, including that of an epistemological upgrade of the way we do science today. As the match between induction and deduction goes on, I shall restrain from attempting to put an end to it in this article, leaving the conclusion to Juvenal: Rara avis in terris, nigroque simillima cygno. Corvo quoque rarior albo. A perfect epitome for the scientific method.

 

 

* Davide Barbieri, Dep. of Biomedical Sciences and Surgical Specialties, University of Ferrara (Italy), davide.barbieri@unife.it

Share Button

Comments

  1. says

    Dear prof Mayer,

    I totally agree on this one. The fact that the human mind is not a tabula rasa is well established.
    Only, it would be interesting to know whether those frames are innate and how they evolve during our lifetime thanks to experience. Any cognitivist here?

    Also, the main point was: There seems to be an induction revival, thanks to the big data, which give machine learning statistical significance. Does it make any sense from an epistemological point of view?
    Thanks for your comments,

  2. lisa winter says

    I agree to both of you. Very good article! And with regard to the chances and risks of Big Data; this is absolutely going to be a huge topic in the future, and I’m pretty excited to see the developments.

  3. says

    Hi, I do believe this is an excellent web site. I stumbledupon it ;
    ) I am going to come back once again since I saved
    as a favorite it. Money and freedom is the greatest way to change, may you be rich and continue to
    help others.

  4. says

    The mess and the stink were not worth the trouble or the savings.
    Instead of courting athlete’s foot, you should at least try to train your ferret to use
    a litter box. Most ferrets are easily trained to use a litter box
    and it helps if it is kept in the same place in their cage or in the corners of your “ferret room”.

  5. says

    First off I want to say excellent blog! I had a quick question that I’d
    like to ask if you don’t mind. I was interested to know how
    you center yourself and clear your thoughts prior
    to writing. I have had trouble clearing my thoughts in getting my ideas out.
    I truly do enjoy writing however it just seems like the first 10 to 15 minutes are
    generally lost simply just trying to figure out how to begin. Any ideas or hints?
    Appreciate it!

    Feel free to visit my web site – cheap cialis in canada

  6. says

    One in five black boys and greater than one inch 10 black girls received an out-of-school
    suspension not fake there are
    concerns that small loans, intended to get short-term, have become prohibitively expensive, and in some cases ruinous, if not
    rapidly repaid.

  7. says

    Hi, I think your website might be having browser compatibility issues.
    When I look at your blog site in Chrome, it looks fine
    but when opening in Internet Explorer, it has some overlapping.
    I just wanted to give you a quick heads up! Other then that, awesome blog!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code