Full Circle Magazine FR

Ceci est une ancienne révision du document !

Table des matières

1
2
3
4
5
6
7
8
9

1

Big Data is a term that has morphed into a buzzword that is doing the rounds in today’s information technology circles. Managers speak of Big Data as a concept to somehow make use of the information that is collected by businesses from their customers. Computer system and software vendors see in it a commercial opportunity, and advocates of human rights and freedom are concerned about the social and personal implications - after all, Big Data rhymes well with Big Brother. But what connections are there between the concept of Big Data and the Ubuntu distribution? Let us start out by trying to define the concept somehow. It should be stressed that this cannot be a very definite and precise definition, since geometry is often variable in computer science and the concept of Big Data is about at the limit of what is possible with today’s off-the-shelf hardware. Tomorrow’s technology may change the basic tenets.

Le Big data et Ubuntu

Big data est une expression qui s'est métamorphosé en mot à la mode qui circule dans les sphères actuelles des technologies de l'information. Des administrateurs utilise le terme comme un concept qui permette de se servir, d'une manière ou d'une autre, des informations concernant les clients collectées par les entreprises. Les vendeurs de systèmes informatiques et de logiciels le perçoivent comme une occasion commerciale et les défenseurs des droits de l'homme et de la liberté s'inquiètent des implications sociales et personnelles – après tout, Big data rime bien avec Big brother.

Mais quels sont les liens entre le concept du Big data et la distribution Ubuntu ?

Commençons par essayer de trouver une quelconque définition du concept. Je dois souligner que la définition ne peut ni être très concrète ni très précise, puisque la géométrie de l'informatique est souvent variable et le concept de Big data s'approche de la limite de ce qui est possible avec le matériel que l'on peut se procurer aujourd'hui dans un magasin lambda. La technologie de demain peut en modifier les principes fondamentaux.

2

Naturally, Big Data is, in the first place, about large amounts of data, as the name itself implies. “Large” may have different meanings depending on context, so let us settle on the idea that “large” in “Big Data” is functionally equivalent to saying a data set “larger than what can be handled (stored or processed) in a reasonable amount of time on a single computer.” As you can see, there is some imprecision here, since the capacities in terms of processing power and disk space vary quite a bit between my laptop and a large mainframe computer. But it is clear from this definition that parallel processing is very much part of what Big Data is all about. This being said, parallel processing can very well take place in separate threads inside a single computer, so caveats apply. Another way of clarifying the concept is by considering how Big Data is generated in the first place. In the initial stages of computer science, data was collected mostly as written documents (printed matter, forms), and then entered painstakingly by hand into computers. Large companies employed whole rooms of people whose task was to type data into perforated cards, paper tape and then magnetic media.

En premier lieu, de par sa nature, le Big data concerne de très grandes quantités de données.

3

This is no longer the case. Nowadays, large amounts of commercial data are actually entered by the user him- or herself. One of the effects of the growth of e-commerce, and its companions e-business and e-administration, is that normal people actually end up filling in more forms, and with larger amounts of information, than in the day of paper forms. The electronic nature of the data that is input also makes processing much easier and faster. Finally, large amounts of information are now available that have been generated not through an active participation by a human being, but simply through automatic methods. Some examples may help clarify this. Let us say Average Joe walks out of his house one fine morning, gets into his car and goes to the grocery two streets away for a loaf of bread. Naturally, he has his mobile phone in his pocket, so his local telephone utility company has automatic information on his whereabouts by tracking to which cell base-stations his phone connects. If he has left GPS connected on his phone, this may result in both a depleted battery life, and whatever app is running in the background and has been authorized to consult GPS data to track his physical movements. If his route has passed in front of a police automatic license plate reader, data has been generated on his car’s movements. Finally, if he has used a credit card to pay for his acquisition at the store, at least two different financial organizations (his bank, and the store’s bank), now have data on the whereabouts of his credit card, and its usage patterns.

4

It should be stressed that all of this data has been collected purely by automatic means, through the use of machines that are never turned off. Some of the information may be considered private (a private transaction between Average Joe and his baker), but much of it will in fact be considered public information in many jurisdictions. A street is, by nature, a public place, and nobody can have valid expectations of privacy concerning his movements when exposed to public view. Processing Big Data is also quite a handful. Although there is a wide variety of programming languages and data access APIs that can be used to “do” Big Data, the main paradigm is the MapReduce pattern introduced by Google, and nowadays often implemented by Hadoop. To produce an output from large amounts of data, two individual steps are taken - and both are parallelizable. In the first “Map” step, data goes through filtering and sorting. The output from this step is then piped through into a second “Reduce” step, where global outputs can be calculated. For instance, we could consider a classical SQL query such as the largely self-explanatory: SELECT SUM(salary) FROM employee WHERE division = “Logistics” In classical computing, this would require all employee files to be held within a single relational database table, which is then read sequentially - perhaps using an index - to select registers in which field “division” equals “Logistics”. The contents of field “salary” in the chosen rows are then added up and returned.

5

In Big Data, things are a bit different, since employee registers will probably be spread out over several different physical computers. Actually, this is seen as desirable, since each individual record can be replicated several times, thus giving us redundancy. This can be done automatically, for example using Hadoop’s HDFS (Hadoop Distributed File System). In the “Map” step, each individual worker node will analyse the registers at its disposal, producing an intermediate output that consists of, for instance, the names and salaries of those employees that satisfy the criterion: division = “Logistics”. This data is then input into “Reduce”, where possible duplicates are eliminated and the final total is computed. Two interesting considerations arise here. In the first place, each worker node that performs initial selection and sorting is handling large amounts of data. Thus, it makes sense to try and keep the distance between the worker and the data it needs to work on as small as possible. In an ideal situation, the data would reside physically on the same computer system as the worker itself. A second point is that intermediate results are a digested version of the original data; for this reason they will, in most cases, occupy much lower volumes than the original complete set. This will then take up less space and network bandwidth when it needs to be reshuffled and distributed to other workers in preparation for the “Reduce” phase.

6

Now, let us consider the role our favourite operating system has to play, regarding both the storage and processing of Big Data, and its original collection. It will come as no surprise to the reader that most servers on the Internet run some distribution of the GNU/Linux operating system. This is pertinent to the handling of Big Data, since the very concept could be said to have been invented by Google - who is in a privileged position to collect large amounts of data on its users, and who has been a market leader in trying to make commercial use of this information. Google is also known to be a large GNU/Linux user, at various levels (its own servers, in software development, and as a basis to build Android). There are also quite a few large mainframes in service, running various operating systems. However, in a tendency started many years back, mainframe resources are often parcelled out into various virtual machines. System builders such as IBM are major players in this field, for example with the zSeries and “Linux on a z”: many instances of virtual machines running Linux coexist within the mainframe’s process and memory space. Others such as Amazon run large clouds of virtual machines, though on smaller servers with Intel x86_64 processors. This is where Ubuntu comes in. There are several choices of Linux distributions for a virtual machine, but in actual practice a choice is often made either from the RedHat subset of distributions (Red Hat Enterprise Linux with paid subscription, or CentOS without), or Debian’s. In Debian’s case, we find either Debian itself, without much choice of paid support, or Ubuntu Server with or without commercial support.

7

The insistence on the availability of paid support options may seem strange to some users. However, it should be taken into account that operating systems used as servers are in a commercial environment. Information systems are mission-critical to businesses’ workflow. Computer department heads are under pressure to deliver, and make sure they can continue to deliver in a timely manner. With these considerations taken into account, it makes sense to pay for quality service to make sure that if and when problems occur, they can be dealt with using not only the company’s own resources, but also high-quality external expertise. This explains why most large Linux distributions propose specific solutions to set up and configure virtual computers in the cloud, often prominently displayed in their web pages. CentOS proposes “a generic cloud-init enabled image” within their first paragraph https://www.centos.org/). Both RedHat (http://www.redhat.com/en/insights/openstack) and Ubuntu (http://www.ubuntu.com/cloud/openstack) are actively involved in building cloud-based farms of virtual servers using OpenStack, thus making convergence between the two Linux server distributions quite straighforward. The virtual (cloud) server technology is often applied to Big Data processing. In the first place, as has already been pointed out, it makes sense to have the workers placed near the data they will be working on, thus reducing network overhead. But most data processed by large organizations is already in the cloud, having been collected through e-commerce servers that are already virtual machines. When the collection point, storage and processing takes place within the same physical facility, data transmission costs are null to negligible, and transfers can take advantage of the server farm’s LAN infrastructure for speed.

8

In the second place, using virtualization as a basis for data processing means organizations (which need to process large amounts of data) no longer need to acquire and maintain large server farms. The infrastructure costs are externalized to cloud computing providers such as Amazon, and leased only as needed. This introduces more flexibility, since smaller or larger numbers of servers may be provisioned as needed according to the size or complexity of each specific problem or data set. Big Data processing today seems to be firmly in the domain of Linux-based virtual machines in the cloud, with Ubuntu as at least one of the main players in the field. But what about data collection in the first place? Please take a moment to navigate to the Ubuntu project’s homepage, http://www.ubuntu.com/. Now consider the main menu options. Beside “Cloud”, “Server” and “Desktop”, we find three further options that relate to devices that may be used for Big Data collection: “Phone”, “Tablet” and “Things”. This last category can be interpreted as Canonical’s interest in putting a version of Ubuntu (Core) on relatively lightweight and inexpensive computing devices, mostly based on versions of the same ARM platform that powers most phones and tablets.

9

Once these are used to make Average Joe’s consumer electronics more smart - televisions, car entertainment systems, heating systems, and more - and above all more connected to the Internet, possibilities are endless to collect data and forward them on to whatever service provider gets in on the act. Average Joe may even benefit from the innovation. Obtaining real-time access to traffic conditions while driving may be seen as a useful bit of progress for the large part of humanity that lives in congested urban areas. Being able to monitor and fine-tune home central heating at a distance, thus reducing heating bills and carbon emissions cannot be seen as a bad proposition. Ubuntu Core is a perfect fit for this type of application, since its modular structure fits in well with getting “just enough Operating System” for lightweight hardware, leaving the system integrator just with the task of building his own code module for the specific task required of the system. So, basically, Ubuntu is well set to occupy large chunks of the Big Data ecosystem, from storage servers and processing workers in virtual machines, down to the very smart devices that populate data sets. There is just one niggling doubt at the back of my mind. What about individual freedom, including the freedom not to be tracked not only in the digital world, but also in our real-life physical existence? When will the community - and Canonical itself - stand up and clearly state its position on the matter? Just saying…