• Hadoop: Tutorial and BigData

    What’s Hadoop?

    Hadoop is a framework or tooapache hadoopls that enable the partition and split of tasks across multiple server and nodes on a network. Hadoop then provides the required framework to MAP and REDUCE a process into multiple chunks or segments.

    Hadoop has multiple projects that include:

    Hive, Hbase, Chukwa, Tex, Pig, Spark, Tez, and some others that are designed for instance HIVE for a data warehouse that provides data summarization and adhoc querying. HBAse as well is a database that support structured data storage for large tables.

    However the common projects are: Common, HDFS, YARN (job scheduling and cluster management), and MapReduce.

    source: http://hadoop.apache.org  

    High-Level Architecture of Hadoop

    As shown in the figure from opensource.com, Hadoop includes a Master Node and Slave Node(s).  The Master Node contains a TaslkTracker and  a JobTracker that interfaces will all the Slave Nodes.

    The MapReduce Layer is the set of applications used to split the process in hand, into several SlaveNodes. Each SlaveNode will then process a piece of the problem and once completed it will be sent over from the process of “Mapping” to “Reducing,”

    hadoop-HighLevel_hadoop_architecture-640x460

    High Level Architecture of Hadoop

    MapReduce workflow

    As shown in the figure, the MapReduce logic is shown here.

    • On the left side,BigData, is a set of files or huge file, a huge log file or a database,
    • The HDFS refer to the “Hadoop Distributed Filesystem,” which is used to copy part of the data, split it across all the cluster and then later on to be merged with the data
    • The generated output is then copied over to a destinatary node.
    mapreduce-workflow

    MapReduce Workflow

    Example of MapReduce

    For example,lets say we need to count th number of words in a file, and we will assign a line to each server in the hadoop cluster, we can run the following code.  MRWordCounter()  does the job of wording each line and mapping all the jobs

    from mrjob.job import MRJob
    
    class MRWordCounter(MRJob):
        def mapper(self, key, line):
            for word in line.split():
                yield word, 1
    
        def reducer(self, word, occurrences):
            yield word, sum(occurrences)
    
    if __name__ == '__main__':
        MRWordCounter.run()

    Using : mrjob

    A music example can be found here:

    class MRDensity(MRJob):
        """ A  map-reduce job that calculates the density """
    
        def mapper(self, _, line):
            """ The mapper loads a track and yields its density """
            t = track.load_track(line)
            if t:
                if t['tempo'] > 0:
                    density = len(t['segments']) / t['duration']
                    yield (t['artist_name'], t['title'], t['song_id']), density

    As shown here, the mapper will grace a line of  file and use the “track.load_track()” function to obtain “tempo”, the number of “segments” and all additional metadata to create a density value.

    In this particular case, there is no need to Reduce it, simply it is split across the board of all Hadoop nodes.

    Server Components

    As shown in the figure below from cloudier, Hadoop uses HDFS as the lower layer filesystem, then MapReduce resides between the HBase and MapReduce (as HBase can be used by MapReduce, and finally on top of MapReduce we have Pig, Hive, Snoop and many other systems. Including an RDMS running on top of Sqoop, or Bi Reporting on Hive, or any other tool.

    hadoop ecosystem

     

    Download Hadoop

    If you want to download hadoop do so at https://hadoop.apache.org/releases.html

    References

    [1] Hadoop Tutorial 1 -3,
    [2] http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
    [3] http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
    [
    4] https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html#MapReduce 

     

  • Cloud: Docker and XenServer

    February 6, 2016 eglacorp Cloud, Linux, software

    Docker and XenServer: What’s Virtualization

    According to google search,  Virtualization consists on:

    ”  Virtualization is the creation of a virtual (rather than actual) version of something, such as an operating system, a server, a storage device or network resources.”

    In general, virtualization is nothing more than using the software of the original operating system, e.g. Linux in  let’s say an 8-core intel processor, and virtualize a machine with a 2-core machine with code for ARM that runs on an x86 machine. This process of generating a virtual machine that runs on top of other machine is called virtualization.  I think we became familiar with qemu and other projects like that to emulate mobile devices and other processors, back then they were called “Emulators.” However, virtualization is nothing different in terms of the same effect, it might differ in terms of kernel and driver usage, which is called in general as a HyperVisor. In simple words, a HyperVisor is nothing more than an Operating System designed to run virtual machines in an efficient manner. As such we have VMWare, Hyper-V and XenServer or even VirtualBox (Hypervisor lists)

    hypervisorsWe use XenServer in our cloud, and has shown to be efficient and minimal overhead. We also use VMWare but due to its price, we limit the number of instances. and we have used VirtualBox in our laptops.

    We can then run Windows over Mac, Mac over Windows, and many other combinations that enable a wide flexibility of software packages that with a good machine, can clearly be of use.

    Obviously there are limitations in each hypervisor, from scalability to what type of network cards are exposed to the Virtual Machines (VMs) or how those resources are virtualized. As an example, Hyper-V can handle 1024 VMs per host and 320 processors, whereas XenServer only 160 processors and 50 – 130 hosts. There are also limitations on the amount of memory per VM that can be handled by the Hyper-V.  A good paper Identified containing interesting http://ijicse.in/wp-content/uploads/2015/07/v2i3-14.pdf  that concludes:

    ” Our results indicate that Xen Hypervisor, which uses Para- virtualization, was not able to outperform ESXi, which uses full-virtualization. VMware ESXi Server is far better to meet the demand of an enterprise than the Xen hypervisor.”

     

     docker-logo-370x290Docker

    Docker comes to change many ways we do and see virtualization. By using a default linux baseline, you can run a docker image, a docker image is nothing or a file what contains all the required components to run a virtual server. The great thing about docker is that this file, or image, called a “linux container” which includes:

    • scripts
    • configuration files
    • virtualenvs
    • jars
    • tarbals
    • etc

    A docker container is nothing but a file, just as the hypervisor runs a VMWare or XenServer image. In this case, the Operating System, let’s say Ubuntu will handle all the context switching and management of the Docker behavior as a process.

    xen_project_logo_767x314XenServer Virtualization

    XenServer is a hypervisor, which I am very familiar with. XenServer can be installed in almost any hardware and the VMs can be moved and ported over each XenServer instance. As you connect with the XenServer box, you may be able to launch or start the VM and have access to its console.  The process of connecting to the XenServer is used by using the standard VNC prfocol with usually ports 5900 and beyond.

    Docker over XenServer Virtualization

    An overkil seeksml in many cases, as you may have for example a XenServer machine running one or several virtual computers. Let’s say you decide to load Ubuntu 14 LTS on XenServer. The ubuntu machine is ready to go after a while, and then you run a docker container on top of this configuration.

    However, Citrix understands this situation and has created a supplementary pack for Docker.

    xe-install-supplemental-pack xscontainer-6.5.0-100205c.iso
    mount: xscontainer-6.5.0-100205c.iso is write-protected, mounting read-only
    Installing 'XenServer Container Management'...
    
    Preparing...                ########################################### [100%]
       1:guest-templates        ########################################### [ 50%]
    Waiting for xapi to signal init complete
    Removing any existing built-in templates
    Regenerating built-in templates
       2:xscontainer            ########################################### [100%]
    Pack installation successful. - See more at: http://xenserver.org/discuss-virtualization/virtualization-blog/entry/preview-of-xenserver-support-for-docker-and-container-management.html#sthash.vSWFSUaD.dpuf

    Once you install this supplemental pack, XenServer is aware of a container managed VM. As shown in this capture, The machine hp-d385-1, has a virtual machine called CoreOS-817 and includes a hadoop container, tomcat, mysql, apache, that can be launched from the XenServer user interface.

    xenserver docker container

    You may think? XenSserver->Ubuntu->Docker ? Will this be too much overhead? I have not done the bechmarking comparing XenServer with Docker.  However, a paper was presented showing a performance comparison.

    Source: https://www.mirantis.com/blog/need-openstack-use-docker/ 

    While the performance calculation shows a similar result between a KVM, Docker, and just the bare metal. More studies are required to confirm performance. Further analysis is required to really determine if a Docker container running on a native machine shows a higher performance than a KVM with a docker instance.

    Performance Analysis Docker vrs KVM

    The performance of KVM vrs Docker has been researched as well, an IBM team took the time to compare KVM with Docker: http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf

    A shown in the figures latency from Docker is lower than KVM, storage as well, using EXT4 filesystem and KVM shows that Docker depicted a CDF (%) better than KWM and as good as Native. Obviously, Docker is just running as a process in the native.  In fact these researches conclude what I initially stated, that using a Hyper-visor with a VM is not a good practice:

    ” We also question the practice of deploying containers inside VMs, since this imposes the performance overheads of VMs while giving no benefit compared to deploying containers directly on non-virtualized Linux. If one must use a VM, running it inside a container can create an extra layer of security since an attacker who can exploit QEMU would still be inside the container.”

     

    Conclusion

    Docker simply comes to solve a problem using a native environment, using a hypervisor is just unnecessary and not required unless you really need to use an image that was built, tested, and validated for a particular Hyper-V Or you believe you have a special hardware that the Hyper-V can handle or arbitrate better than a version of Linux you may have.  One special case is running Windows container on Linux. Apparently Windows showed in 2015 how to run a docker container on Windows, however the opposite seems to be a problem.