Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Showing posts with label Analytics. Show all posts

Tuesday, 21 October 2014

XML Data Processing Using Talend Open Studio

For my Netflix Search Engine project, Flicksery, I get data from Netflix as XML files. The catalog file for instant titles is around 600 MB. Following is a sample XML entry for one title/movie.

As you can see it is not very easy to read. Not only the alignment, but the number of elements too. May be we could use a better text editor or an XML editor to see this properly. But it still would be very difficult to play around with the data and do any kind of transformation or analysis. Talend Open Studiois something that can be really useful to analyze the data embedded within the large XML.

In this post we will try and analyze the XML file using some really neat features available in Talend Open Studio.

Let’s get started:

Open up Talend Open Studio and create a new project – NetflixDataAnalysis:

Right click on Job Designs and select Create job – InstantFeedTest

Right click on File xml and select Create file xml:

This brings up a wizard. Enter the details as shown below and click Next:

In step 2 of the wizard select Input XML and click, Next.

In step 3 of the wizard select the XML file. For this test, I took only part of the XML file as loading the entire 600 MB file would cause Java Heap issues and would prevent the file from loading correctly. As we just want to analyze and see the different fields available in the XML, a sample should be sufficient. Once the file is selected you should see the schema of the XML file in the Schema Viewer.

Step 4 is where you start to see the real power of Talend Open Studio. The Source Schema list on the left displays the schema of the XML file. The Target Schema section provides you with a way of defining an output schema for the XML. Using XPath you can now define the required elements from the input XML file. You can drag the element which will repeat itself in the XML to Xpath loop expression section. In this case the element catalog_title is the element that embeds all information for a single movie/title.

Next, you can traverse through all the nodes on the left and drag the required elements to the right under the Fields to extract section. You can also provide custom column names under the Column Namesection. Once you are done dragging all the required fields, click on Refresh Preview to see a preview of the data. This preview helps one get a quick idea of how the data will be parsed. Click Finish.

Double click on the job, InstantFeedTest to open it up in the workspace. Drag the newly created XML Metadata, NetflixInstantFeed. Also, drag the tFileOutputDelimited_1 component from the Palette on the right.

Right click on the NetflixInstantFeed node and select, Row->Main and join it to tFileOutputDelimited_1 node. Once joined it should look like the image below:

Select the tFileOutputDelimited_1 node and go to the “Component” tab at the bottom of the workspace. Update the configurations, Field Separator to “,” and set the File Name to the desired path and name.

We are now ready to test out our job. Click on the Run icon on the toolbar to execute the job. Once executed you should see processing status as shown below:

The above job is going to read the XML file, extract the fields and generate a comma separated text file with the extracted data.

http://api-public.netflix.com/catalog/titles/movies/780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,

As you can see the big XML node has now more readable as a simple comma separated record. This was a simple one-to-one mapping to from XML to CSV. Talend Open Studio is way more powerful than this. You can add a new component to the job to apply transformations to data coming in from the XML.

As you see in the above record the first column/value in the comma separated record is a link. All I am interested is in the last 6 digits of the link. I want my final output to have only 6 digits and not the entire link. To do this delete the link between NetflixInstantFeed and tFileOutputDelimited_1 nodes. Next, drag tMap_1 component from the Pallete to the Job workspace. Right click on NetflixInstantFeed, select, Row->Main and join it to tMap_1. Next, right click on tMap_1, select, Row->New Output (Main)and join it to tFileOutputDelimited_1 node. You will be prompted to enter a name. The name entered for this example is processed_op. Once done, the job should now look as shown below:

Select the tMap_1 component and click on the Map Editor button on Component tab at the bottom of the workspace. The Map Editor opens up with the metadata information from the XML file on the left and desired output on the right. As you see below I have dragged all the columns from the left to the right. The only modification I have done is for the “id” column. I have applied a function to get only the last 6 digits from the right.

As you can see we can easily apply functions to transform the input data. The function shown above is aStringHandling function. There are several other functions that can be applied using the Expression Builder as shown below:

After you are done applying your function, click OK to close the screen. Now, you can re-run the job to see if the transformation has been applied correctly. After a successful run you should see the results of the job as shown below:

Let us look at the output file to see the effect of the transformation:

780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,

We have successfully built a job to transform an XML file to a comma separated file. This job can be exported and run as a standalone job on any environment running Java. Also, we chose to output the data to CSV file, however, Talend Open Studio can read from multiple data formats, databases and also write to different file formats or directly insert into databases.

This was just a quick introduction (tip of the iceberg) to experience the usefulness of Talend Open Studio for data processing. The features of this tool are vast and can’t be covered in a blog post. But this should get you started on using Talend. Hope you found this fast paced tutorial useful.

Monday, 20 October 2014

Apache Web Log Analysis using PIG

Enter into the Pig shell. using the 'pig -x local'

Load the log file into Pig using the LOAD command.

grunt>raw_logs = LOAD '/home/hadoop/work/input/apacheLog.log'

USING TextLoader AS (line:chararray);

Parse the log file and assign different field to different varriable.

logs_base = FOREACH raw_logs GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') )
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,

request: chararray, status: int, bytes_string: chararray, eferrer: chararray, browser: chararray);

We need only time (time), IP Address (remoteAddr), and user (remoteLogname). So we extract

these three variables for each record and assign them to a placeholder.

logs = FOREACH logs_base GENERATE remoteAddr,remoteLogname, time;

Now we need to find out the number of hits and number of unique users based on time.
We can achieve this in Pig by grouping all the records based on some variable

or combination of variables. In our case, it would be datetime.

group_time = GROUP logs BY (time);

In this grouping, we need to find out the count of number of hits and number of unique users.
In order to find out the number of hits, we simply take count of the number of IP addresses

in a given year using the COUNT.

Putting it all together, we can find out the number of hits and number of unique users

(but in our case it will come 1 because name of user is '-') for each time using this statement.

X = FOREACH group_time {

unique_users = DISTINCT logs.remoteLogname;

GENERATE FLATTEN(group), COUNT(unique_users) AS UniqueUsers,

COUNT(logs) as counts;

}

(Results are in the form of Time, Unique Users, No. of Hits)

Sentimental Analysis using WordNet Dictionary

This is the sample program to calculate the each and every word of semantic orientation by using sentiwordnet dictionary.

package com.orienit.hadoop.training;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
import java.util.Vector;

    public class SentimentalWordNet {
        private String pathToSWN = "/home/hadoop/work/SentiWordNet_3.0.0.txt";
        private HashMap<String, Double> _dict;

        public SentimentalWordNet(){

            _dict = new HashMap<String, Double>();
            HashMap<String, Vector<Double>> _temp = new HashMap<String, Vector<Double>>();
            try{
                BufferedReader csv =  new BufferedReader(new FileReader(pathToSWN));
                String line = "";
                while((line = csv.readLine()) != null)
                {
                    String[] data = line.split("\t");
                    Double score = Double.parseDouble(data[2])-Double.parseDouble(data[3]);
                    String[] words = data[4].split(" ");
                    for(String w:words)
                    {
                        String[] w_n = w.split("#");
                        w_n[0] += "#"+data[0];
                        int index = Integer.parseInt(w_n[1])-1;
                        if(_temp.containsKey(w_n[0]))
                        {
                            Vector<Double> v = _temp.get(w_n[0]);
                            if(index>v.size())
                                for(int i = v.size();i<index; i++)
                                    v.add(0.0);
                            v.add(index, score);
                            _temp.put(w_n[0], v);
                        }
                        else
                        {
                            Vector<Double> v = new Vector<Double>();
                            for(int i = 0;i<index; i++)
                                v.add(0.0);
                            v.add(index, score);
                            _temp.put(w_n[0], v);
                        }
                    }
                }
                Set<String> temp = _temp.keySet();
                for (Iterator<String> iterator = temp.iterator(); iterator.hasNext();) {
                    String word = (String) iterator.next();
                    Vector<Double> v = _temp.get(word);
                    double score = 0.0;
                    double sum = 0.0;
                    for(int i = 0; i < v.size(); i++)
                        score += ((double)1/(double)(i+1))*v.get(i);
                    for(int i = 1; i<=v.size(); i++)
                        sum += (double)1/(double)i;
                    score /= sum;
                    String sent = "";
                    if(score>=0.75)
                        sent = "strong_positive";
                    else
                    if(score > 0.25 && score<=0.5)
                        sent = "positive";
                    else
                    if(score > 0 && score>=0.25)
                        sent = "weak_positive";
                    else
                    if(score < 0 && score>=-0.25)
                        sent = "weak_negative";
                    else
                    if(score < -0.25 && score>=-0.5)
                        sent = "negative";
                    else
                    if(score<=-0.75)
                        sent = "strong_negative";
                    _dict.put(word, score);
                }
            }
            catch(Exception e){
             //e.printStackTrace();
             }        

        }

public Double extract(String word)
{
    Double total = new Double(0);
    if(_dict.get(word+"#n") != null)
         total = _dict.get(word+"#n") + total;
    if(_dict.get(word+"#a") != null)
        total = _dict.get(word+"#a") + total;
    if(_dict.get(word+"#r") != null)
        total = _dict.get(word+"#r") + total;
    if(_dict.get(word+"#v") != null)
        total = _dict.get(word+"#v") + total;
    return total;
}

public static void main(String[] args) {
    SentimentalWordNet test = new SentimentalWordNet();
    String sentence="hey i had a wonderful an experience in barista";
    String[] words = sentence.split("\\s+");
    double totalScore = 0;
    for(String word : words) {
        word = word.replaceAll("([^a-zA-Z\\s])", "");
        if (test.extract(word) == null)
            continue;
        totalScore += test.extract(word);
    }
    if(totalScore == 0)
    {
      System.out.println("Neutral Statement :" + totalScore);
    } else if(totalScore > 0) {
      System.out.println("Postive Statement :" + totalScore);
    } else {
      System.out.println("Negative Statement :" + totalScore);
    }
}

}

OUTPUT:-

Postive Statement :0.34798245187641463

Note: Find the attached link for sentimentalwordnetdictionary. Remove the all comments and hash tag from the beginning and end of the file.

http://sentiwordnet.isti.cnr.it/download.php

Friday, 17 October 2014

Unlocking Insight – How to Extract User Experience by Complementing Splunk

Splunk is a great Operational Intelligence solution capable of processing, searching and analyzing masses of machine-generated data from a multitude of disparate sources. By complementing it with an APM solution you can deliver insights that provide value beyond the traditional log analytics Splunk is built upon:

True Operational Intelligence with dynaTrace and Splunk for Application Monitoring

Operational Intelligence: Let Your Data Drive Your Business

In a nutshell, the purpose behind Operational Intelligence is the ability to make well-informed decisions quickly based on insights gained from business activity data, with data sources ranging anywhere from applications to the infrastructure to social media platforms.

Analyze data from a multitude of disparate data sources with Splunk (courtesy of splunk.com)

Splunk’s capabilities to process and fuse large volumes of continuously streamed and discretely pushed event data with masses of historical data reliably and with low latency supports businesses to continuously improve their processes, detect anomalies and deficiencies, as well as discover new opportunities.

Many industries are realizing this insight hidden in their log data. Financial services companies, for example, use Splunk to dashboard analytics on infrastructure log files over long periods of time, understanding trends allows for smarter decisions to be made. Analysis on this infrastructure has critical impact when applications are transmitting billions of dollars a day.

Financial services companies are not the only ones taking advantage of this level of log analysis. SaaS companies are using Splunk to analyze log data from many siloed apps hosted for their customers, all with separate system profiles. Splunk allows them to set up custom views with insights and alerts on all their separate application infrastructures.

Why complement Splunk with dynaTrace?

“So, with a solution like Splunk, gaining insights from all our data will be a snap, right?” Unfortunately not. What if I told you that you are essentially building your insights on masses of machine-generated log data. Let’s discuss why this matters.

Machine-generated data in the “Big Data Pyramid” (courtesy of hadoopilluminated.com)

In Big Data parlance, machine-generated data, as opposed to human-generated data, signifies data which were generated by a computer process without human intervention, and which typically appear in large quantities. Machine-generated data originates from sources such as: applications, application servers, web servers, firewalls, etc. and thus often occur as traditional log data. However, unstructured log data is not exactly convenient to drive an analytics solution because they require you to:

1. Tell your Solution What Matters to You

Because log data is essentially unstructured, you cannot easily access the various bits and pieces of information encoded into a log message. You will need to teach your analytics solution the patterns by which any valuable information can be identified for later search and analyses:

Identify bits and pieces of valuable information inside log messages

2. Reconsider your Application Logging Strategy

While there is not much you can do about how your firewall logs data, you will need to put large efforts into designing and maintaining a thorough logging strategy that serves all the information you will want to have monitored for your application. However, you may want to contemplate whether you really want to take these efforts for a variety of reasons:

Semantic Logging is, undoubtedly, a useful concept around writing log messages specifically for gathering analytics data that also emphasizes structure and readability. However, it can help toimprove your logging only where you own the code and thus leaves out code from any third-party libraries.
Operational Intelligence solutions rely on you to provide context for your log messages, as outlined in Splunk’s Logging Best Practices. Only then will you be able to correlate events of a particular user transaction and understand the paths your users are taking through your application. Again, context cannot be retained easily once you leave your code.
Efforts to establish and maintain a robust logging strategy that delivers must be aligned with ongoing development activities. You would also need to make sure that what your strategy provides is kept in sync with the expectations of your Operational Intelligence solution. If in doubt, and you better be, you will want to enforce automated testing of your strategy to verify your assumptions.

What this would mean to you? Establishing and maintaining an application logging strategy for your analytics solution that delivers actionable insights involves a lot of disciplined work from everyone involved:

Developers: need to maintain a logging strategy whose messages are scattered all over their code base: whenever functionality is added or changed, several parts of the application need to be augmented. This makes developing a thorough logging strategy a poorly maintainable, time-consuming and thus error-prone cross-cutting concern.
Test Automation Engineers: should enforce automated testing to assert that the assumptions of the Operational Intelligence solution on the setup of the input log data hold.
Product Owners and Software Architects: need to cope with a decrease in velocity when they buy into developing and maintaining a thorough logging strategy. They also need to accept that the visibility into user transactions ends where the ownership of their code ends.
Operations: continuously need to test and verify the correct functionality of the overall solution.

Why I am telling you all this? Because we have a lot of customers who were already using Splunk before they implemented dynaTrace. They had a really hard time correlating log messages due to the lack of context and were unable to answer one of the most important questions: “how many users were affected by this particular application failure?” We were able to solve their worries by delivering these features out-of-the-box:

They could keep their talent focused on critical development, testing and operations since there is no need to change your code, no logging, testing and verification involved.
They could quickly get to the root cause of performance issues because they had full end-to-end context for all user interactions including any third-party code which brings you full transaction visibility including method arguments, HTTP headers, HTTP query string parameters, etc.
They had analytics customized to their critical focuses because they could decide which data needs to be captured.

Easy Steps to True Operational Intelligence with Splunk and dynaTrace

Get and install Splunk
Get and install the 15 Days Free Trial of dynaTrace
Get and install the Compuware APM dynaTrace for Splunk App
Enable the Real-Time Business Transactions Feed in dynaTrace:

Enable the Real-Time Business Transaction Feed in dynaTrace
Selectively export Business Transactions data to Splunk in dynaTrace:

Configure a particular Business Transaction to export data

That’s it. You may refer to the documentation of our dynaTrace for Splunk App for additional information. Here is a selection of insights you could already get today:

Dashboard #1: Top Conversions by Country, Top Landing- and Exit Pages

Top Conversions by Country, Top Landing- and Exit Pages

Dashboard #2: Visits Across the Globe

Visits across the globe

Dashboard #3: KPIs

KPIs: Conversion Rates, Bounce Rates, Average Visit Duration, etc.

Dashboard #4: Transaction Timeline and Details

Transactions timeline and details

However, there is more to it: should you feel the need to drill down deeper on a particular transaction to understand the root cause of an issue or precisely who was affected, you can fire up the PurePath in dynaTrace from within Splunk:

Drill down to dynaTrace from raw transactions data in Splunk

…and see deep analysis happen:

Deeply analyzing a transaction in dynaTrace

Conclusion

The road to true Operational Intelligence can be a tough one – but it does not necessarily need to be that way! By integrating dynaTrace with Splunk you won’t have to rely on application logging or require any code changes and that does not slow you down. Instead, it will help accelerate your business by providing true visibility into your applications, independent of whether it is your machine, your code or not. This level of end-user visibility enables you to communicate in terms of what matters most to your organization – customer experience.

Should you want to know more about the inherent limitations of logging, you might want to refer to one of my recent articles “Software Quality Metrics for your Continuous Delivery Pipeline – Part III – Logging”.

Alternate to PAID Log Analytic tools

The following are available open source Log Analytic tools.

1. Scribe - Real time log aggregation used in Facebook
Scribe is a server for aggregating log data that's streamed in real time from clients. It is designed to be scalable and reliable. It is developed and maintained by Facebook. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
https://github.com/fa...

2. Logstash - Centralized log storage, indexing, and searching
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use. Logstash comes with a web interface for searching and drilling into all of your logs.
http://logstash.net/...

3. Octopussy - Perl/XML Logs Analyzer, Alerter & Reporter
Octopussy is a Log analyzer tool. It analyzes the log, generates reports and alerts the admin. It has LDAP support to maintain users list. It exports report by Email, FTP & SCP. Scheduled reports could be generated. RRD tool to generate graphs.
http://sourceforge.ne...

4. Awstats - Advanced web, streaming, ftp and mail server statistics
AWStats is a powerful tool that generates advanced web, streaming, ftp or mail server statistics graphically. It can analyze log files from all major server tools like Apache log files, ebStar, IIS and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages.
http://awstats.source...

5. nxlog - Multi platform Log management
nxlog is a modular, multi-threaded, high-performance log management solution with multi-platform support. In concept it is similar to syslog-ng or rsyslog but is not limited to unix/syslog only. It can collect logs from files in various formats, receive logs from the network remotely over UDP, TCP or TLS/SSL . It supports platform specific sources such as the Windows Eventlog, Linux kernel logs, Android logs, local syslog etc.
http://nxlog.org/...

6. Graylog2 - Open Source Log Management
Graylog2 is an open source log management solution that stores your logs in ElasticSearch. It consists of a server written in Java that accepts your syslog messages via TCP, UDP or AMQP and stores it in the database. The second part is a web interface that allows you to manage the log messages from your web browser. Take a look at the screenshots or the latest release info page to get a feeling of what you can do with Graylog2.
http://graylog2.org/...

7. Fluentd - Data collector, Log Everything in JSON
Fluentd is an event collector system. It is a generalized version of syslogd, which handles JSON objects for its log messages. It collects logs from various data sources and writes them to files, database or other types of storages.
http://fluentd.org/...

8. Meniscus - The Python Event Logging Service
Meniscus is a Python based system for event collection, transit and processing in the large. It's primary use case is for large-scale Cloud logging, but can be used in many other scenarios including usage reporting and API tracing. Its components include Collection, Transport, Storage, Event Processing & Enhancement, Complex Event Processing, Analytics.
https://github.com/Pr...

9. lucene-log4j - Log4j file rolling appender which indexes log with Lucene
lucene-log4j solves a recurrent problem that production support team face whenever a live incident happens: filtering production log statements to match a session/transaction/user ID. It works by extending Log4j's RollingFileAppender with Lucene indexing routines. Then with a LuceneLogSearchServlet, you get access to your log using web front end.
https://code.google.c...

10. Chainsaw - log viewer and analysis tool
Chainsaw is a companion application to Log4j written by members of the Log4j development community. Chainsaw can read log files formatted in Log4j's XMLLayout, receive events from remote locations, read events from a DB, it can even work with the JDK 1.4 logging events.
http://logging.apache...

11. Logsandra - log management using Cassandra
Logsandra is a log management application written in Python and using Cassandra as back-end. It is written as demo for cassandra but it is worth to take a look. It provides support to create your own parser.
https://github.com/jb...

12. Clarity - Web interface for the grep
Clarity is a Splunk like web interface for your server log files. It supports searching (using grep) as well as trailing log files in realtime. It has been written using the event based architecture based on EventMachine and so allows real-time search of very large log files.
https://github.com/to...

13. Webalizer - fast web server log file analysis
The Webalizer is a fast web server log file analysis program. It produces highly detailed, easily configurable usage reports in HTML format, for viewing with a standard web browser. It andles standard Common logfile format (CLF) server logs, several variations of the NCSA Combined logfile format, wu-ftpd/proftpd xferlog (FTP) format logs, Squid proxy server native format, and W3C Extended log formats.
http://www.webalizer....

14. Zenoss - Open Source IT Management
Zenoss Core is an open source IT monitoring product that delivers the functionality to effectively manage the configuration, health, performance of networks, servers and applications through a single, integrated software package.
http://sourceforge.ne...

15. OtrosLogViewer - Log parser and Viewer
OtrosLogViewer can read log files formatted in Log4j (pattern and XMLL yout), java.util.logging. Source of events can be local or remote file (ftp, sftp, sa ba, http) or sockets. It has many powerful features like filtering marking, formatting, adding notes, etc. It could also format SOAP messages in logs.
https://code.google.c...

16. Kafka - A high-throughput distributed messaging system
Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to Hadoop.
https://kafka.apache....

17. Kibana - Web Interface for Logstash and ElasticSearch
Kibana is a highly scalable interface for Logstash and ElasticSearch that allows you to efficiently search, graph, analyze and otherwise make sense of a mountain of logs. Kibana will load balance against your Elasticsearch cluster. Logstash's daily rolling indicies let you scale to huge datasets, while Kibana's sequential querying gets you most relevant data quickly, with more as it becomes available.
https://github.com/ra...

18. Pylogdb
A Python-powered, column-oriented database suitable for web log analysis pylogdb is a database suitable for web log analysis.
http://code.ohloh.net...

19. Epylog - a Syslog parser
Epylog is a syslog parser which runs periodically, looks at your logs, processes some of the entries in order to present them in a more comprehensible format, and then mails you the output. It is written specifically for large network clusters where a lot of machines (around 50 and upwards) log to the same loghost using syslog or syslog-ng.
https://fedorahosted....

20. Indihiang - IIS and Apache log analyzing tool
Indihiang Project is a web log analyzing tool. This tool analyzes IIS and Apache Web logs and generates real time reports. It has Web Log Viewer and analyzer. It is capable to analyze the trend from the logs. This tool also integrate with windows Explorer so you can attach a log file in to indihiang tool via context menu.
http://help.eazyworks...