• Introduction Big Data in RF Analysis

    Big Data in RF Analysis

    Big Data provides tools and a framework to analyze data, in fact, large amounts of data. Radio Frequency, RF, provides  amounts of information that depending n how it is modeled or created, its analysis fits many statistical models and is in general  predicted using passive filtering techniques.

    The main tools for Big Data include statistical aggregation functions,  learning algorithms, and the use of tools. There are many that can be purchased but many that are free but may require certain level of software engineering.  I love Python and specially the main modules used in python are:

    • Pandas
    • SciPy
    • NumPy
    • SKLearn

    and, there are many more used for the analysis and post-processing of RF captures.

    Drive Test and Data Simulation

    In general, many drive test tools are used to capture RF data form LTE/4G, and many other systems. As vendors, we can find Spirent, and many others, and we can capture RF from multiple base stations and map those to Lat/Long in a particular area covered by many base stations.  It’s obvious that drive test cannot cover the entire area, as  expected extrapolation and statistical models are required to complete the drive test.

    In a simulator, just as in MobileCDS and other simulators, specially those in “Ray Tracing,” the simulator uses electromagnetic models to compute the RF received by an antenna.


    Big Data Processing for a Massive Simulation

    Unstructured data models are loaded with KML and other 3D simulation systems that include polygons and buildings that are situated on top of a google earth map or any other map vendor.  The intersection of the model with the 3D database produces the propagation model that needs massive data processing, Map-Reduce and Hadoop to handle the simulation.

    HADOOP and MAP Reduce for RF Processing

    The data is then stored in unstructured models with RF information, that include the Electromagnetic field, frequency, time, delay, error, and other parameters that are mapped to each Lat/Log or x,y, z coordinates in the plane being modeled.  The tools are usually written in Python and parallelization can be done in multiple hadoop nodes and processing of CSV/TXT files with all the electromagnetic data and the 3D map being rendered.


    As you can see the Hadoop/GlusterFS is our choice, as we don’t see that much value for HDFS or the Hadoop Data File System are the ones that handle all the files and worker systems.  As you can tell, we are fans of GlusterFS and processing of all Hadoop cluster nodes is managed in a massive processing network of high-performance networks and 10Gb Fiber network.

    Big Data models: OLTP and OLAP  Processing

    The OLTP and OLAP data models definitions can be found online:

    ” – OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).


    – OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). “


    We have different research areas:

    • Analysis of data for handover protocols,
    • Data mining for better antenna positioning,
    • Machine learning techniques for better PCRF polices and more




  • Hadoop: Tutorial and BigData

    What’s Hadoop?

    What’s Hadoop? Hadoop is a framework or tooapache hadoopls that enable the partition and split of tasks across multiple server and nodes on a network. Hadoop then provides the required framework to MAP and REDUCE a process into multiple chunks or segments.

    Hadoop has multiple projects that include:

    Hive, Hbase, Chukwa, Tex, Pig, Spark, Tez, and some others that are designed for instance HIVE for a data warehouse that provides data summarization and adhoc querying. HBAse as well is a database that support structured data storage for large tables.

    However the common projects are: Common, HDFS, YARN (job scheduling and cluster management), and MapReduce.

    source: http://hadoop.apache.org  

    High-Level Architecture of Hadoop

    As shown in the figure from opensource.com, Hadoop includes a Master Node and Slave Node(s).  The Master Node contains a TaslkTracker and  a JobTracker that interfaces will all the Slave Nodes.

    The MapReduce Layer is the set of applications used to split the process in hand, into several SlaveNodes. Each SlaveNode will then process a piece of the problem and once completed it will be sent over from the process of “Mapping” to “Reducing,”


    High Level Architecture of Hadoop

    MapReduce workflow

    As shown in the figure, the MapReduce logic is shown here.

    • On the left side,BigData, is a set of files or huge file, a huge log file or a database,
    • The HDFS refer to the “Hadoop Distributed Filesystem,” which is used to copy part of the data, split it across all the cluster and then later on to be merged with the data
    • The generated output is then copied over to a destinatary node.

    MapReduce Workflow

    Example of MapReduce

    For example,lets say we need to count th number of words in a file, and we will assign a line to each server in the hadoop cluster, we can run the following code.  MRWordCounter()  does the job of wording each line and mapping all the jobs

    from mrjob.job import MRJob
    class MRWordCounter(MRJob):
        def mapper(self, key, line):
            for word in line.split():
                yield word, 1
        def reducer(self, word, occurrences):
            yield word, sum(occurrences)
    if __name__ == '__main__':

    Using : mrjob

    A music example can be found here:

    class MRDensity(MRJob):
        """ A  map-reduce job that calculates the density """
        def mapper(self, _, line):
            """ The mapper loads a track and yields its density """
            t = track.load_track(line)
            if t:
                if t['tempo'] > 0:
                    density = len(t['segments']) / t['duration']
                    yield (t['artist_name'], t['title'], t['song_id']), density

    As shown here, the mapper will grace a line of  file and use the “track.load_track()” function to obtain “tempo”, the number of “segments” and all additional metadata to create a density value.

    In this particular case, there is no need to Reduce it, simply it is split across the board of all Hadoop nodes.

    Server Components

    As shown in the figure below from cloudier, Hadoop uses HDFS as the lower layer filesystem, then MapReduce resides between the HBase and MapReduce (as HBase can be used by MapReduce, and finally on top of MapReduce we have Pig, Hive, Snoop and many other systems. Including an RDMS running on top of Sqoop, or Bi Reporting on Hive, or any other tool.

    hadoop ecosystem


    Download Hadoop

    If you want to download hadoop do so at https://hadoop.apache.org/releases.html


    [1] Hadoop Tutorial 1 -3,
    [2] http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
    [3] http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
    4] https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html#MapReduce