Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

The following are available open source Log Analytic tools.

1. Scribe - Real time log aggregation used in Facebook
Scribe is a server for aggregating log data that's streamed in real time from clients. It is designed to be scalable and reliable. It is developed and maintained by Facebook. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
https://github.com/fa...

2. Logstash - Centralized log storage, indexing, and searching
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use. Logstash comes with a web interface for searching and drilling into all of your logs.
http://logstash.net/...

3. Octopussy - Perl/XML Logs Analyzer, Alerter & Reporter
Octopussy is a Log analyzer tool. It analyzes the log, generates reports and alerts the admin. It has LDAP support to maintain users list. It exports report by Email, FTP & SCP. Scheduled reports could be generated. RRD tool to generate graphs.
http://sourceforge.ne...

4. Awstats - Advanced web, streaming, ftp and mail server statistics
AWStats is a powerful tool that generates advanced web, streaming, ftp or mail server statistics graphically. It can analyze log files from all major server tools like Apache log files, ebStar, IIS and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages.
http://awstats.source...

5. nxlog - Multi platform Log management
nxlog is a modular, multi-threaded, high-performance log management solution with multi-platform support. In concept it is similar to syslog-ng or rsyslog but is not limited to unix/syslog only. It can collect logs from files in various formats, receive logs from the network remotely over UDP, TCP or TLS/SSL . It supports platform specific sources such as the Windows Eventlog, Linux kernel logs, Android logs, local syslog etc.
http://nxlog.org/...

6. Graylog2 - Open Source Log Management
Graylog2 is an open source log management solution that stores your logs in ElasticSearch. It consists of a server written in Java that accepts your syslog messages via TCP, UDP or AMQP and stores it in the database. The second part is a web interface that allows you to manage the log messages from your web browser. Take a look at the screenshots or the latest release info page to get a feeling of what you can do with Graylog2.
http://graylog2.org/...

7. Fluentd - Data collector, Log Everything in JSON
Fluentd is an event collector system. It is a generalized version of syslogd, which handles JSON objects for its log messages. It collects logs from various data sources and writes them to files, database or other types of storages.
http://fluentd.org/...

8. Meniscus - The Python Event Logging Service
Meniscus is a Python based system for event collection, transit and processing in the large. It's primary use case is for large-scale Cloud logging, but can be used in many other scenarios including usage reporting and API tracing. Its components include Collection, Transport, Storage, Event Processing & Enhancement, Complex Event Processing, Analytics.
https://github.com/Pr...

9. lucene-log4j - Log4j file rolling appender which indexes log with Lucene
lucene-log4j solves a recurrent problem that production support team face whenever a live incident happens: filtering production log statements to match a session/transaction/user ID. It works by extending Log4j's RollingFileAppender with Lucene indexing routines. Then with a LuceneLogSearchServlet, you get access to your log using web front end.
https://code.google.c...

10. Chainsaw - log viewer and analysis tool
Chainsaw is a companion application to Log4j written by members of the Log4j development community. Chainsaw can read log files formatted in Log4j's XMLLayout, receive events from remote locations, read events from a DB, it can even work with the JDK 1.4 logging events.
http://logging.apache...

11. Logsandra - log management using Cassandra
Logsandra is a log management application written in Python and using Cassandra as back-end. It is written as demo for cassandra but it is worth to take a look. It provides support to create your own parser.
https://github.com/jb...

12. Clarity - Web interface for the grep
Clarity is a Splunk like web interface for your server log files. It supports searching (using grep) as well as trailing log files in realtime. It has been written using the event based architecture based on EventMachine and so allows real-time search of very large log files.
https://github.com/to...

13. Webalizer - fast web server log file analysis
The Webalizer is a fast web server log file analysis program. It produces highly detailed, easily configurable usage reports in HTML format, for viewing with a standard web browser. It andles standard Common logfile format (CLF) server logs, several variations of the NCSA Combined logfile format, wu-ftpd/proftpd xferlog (FTP) format logs, Squid proxy server native format, and W3C Extended log formats.
http://www.webalizer....

14. Zenoss - Open Source IT Management
Zenoss Core is an open source IT monitoring product that delivers the functionality to effectively manage the configuration, health, performance of networks, servers and applications through a single, integrated software package.
http://sourceforge.ne...

15. OtrosLogViewer - Log parser and Viewer
OtrosLogViewer can read log files formatted in Log4j (pattern and XMLL yout), java.util.logging. Source of events can be local or remote file (ftp, sftp, sa ba, http) or sockets. It has many powerful features like filtering marking, formatting, adding notes, etc. It could also format SOAP messages in logs.
https://code.google.c...

16. Kafka - A high-throughput distributed messaging system
Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to Hadoop.
https://kafka.apache....

17. Kibana - Web Interface for Logstash and ElasticSearch
Kibana is a highly scalable interface for Logstash and ElasticSearch that allows you to efficiently search, graph, analyze and otherwise make sense of a mountain of logs. Kibana will load balance against your Elasticsearch cluster. Logstash's daily rolling indicies let you scale to huge datasets, while Kibana's sequential querying gets you most relevant data quickly, with more as it becomes available.
https://github.com/ra...

18. Pylogdb
A Python-powered, column-oriented database suitable for web log analysis pylogdb is a database suitable for web log analysis.
http://code.ohloh.net...

19. Epylog - a Syslog parser
Epylog is a syslog parser which runs periodically, looks at your logs, processes some of the entries in order to present them in a more comprehensible format, and then mails you the output. It is written specifically for large network clusters where a lot of machines (around 50 and upwards) log to the same loghost using syslog or syslog-ng.
https://fedorahosted....

20. Indihiang - IIS and Apache log analyzing tool
Indihiang Project is a web log analyzing tool. This tool analyzes IIS and Apache Web logs and generates real time reports. It has Web Log Viewer and analyzer. It is capable to analyze the trend from the logs. This tool also integrate with windows Explorer so you can attach a log file in to indihiang tool via context menu.
http://help.eazyworks...

What is Cacti?

Cacti is an RRD front end. You can learn more about it on the Cacti website.

Cacti differs from Ganglia in that Cacti polls using SNMP or shell scripts while applications push data at Ganglia. Both Ganglia and Cacti have feature overlaps, but for those with a large Cacti deployment, installing a secondary statistic system just for Hadoop may not be an option.

I have had great success over the years graphing everything from user CPU, NetApp disk reads to environmental sensors with Cacti. When I saw the information in Hadoop JMX, I started working on a set of Hadoop templates,hadoop-cacti-jtg. My goal was to provide visual representation for all pertinent Hadoop JMX information.

Administrators and developers can use these templates to better manage Hadoop and understand how it is working behind the scenes. Currently, the package has several predefined graphs covering the Hadoop NameNode and DataNode. Let’s walk through some of them.

Hadoop Capacity

Hadoop Capacity provides the same type of information you get from monitoring a standard disk. The top black line represents the maximum capacity. This is all the possible storage on all currently active DataNodes.

You also have the used and free capacity information stacked on top of each other. You can use these variables to trend your file system growth. In most cases your file system should be growing steadily, assuming you have batch processes running on a schedule. You may want to use a Cacti Threshhold alarm at 80%. If the alarm goes off, it’s good practice to clean up unused files, or you can take the lazy way and order more DataNodes :)

If you are wondering why the sum of used plus free is not equal to capacity, then remember that Hadoop has reserve for each DataNode. Also, your disk file system might have a reserve. If a disk is solely devoted to serving HDFS, you can tune the reserve down with the following string:

tunefs -m <percent>

Live vs. Dead Nodes

The Hadoop live and dead node information is available on the NameNode’s web interface. This stack-style graph shows both values together. Blue represents the number of live DataNodes, while the red area of the graph shows the number of dead DataNodes. If you are using the Cacti Threshhold system, you can use it to set off a warning if the number of Dead DataNodes exceeds 20%.

NameNode Stats

Hadoop JMX gives us a breakdown of file operations by type. This graph provides details about requests to which the NameNode is responding. I ran several teragens and terasorts from the examples.jar. Below, we can see the process both creating and reading files from the system as the map reduce jobs run.

DataNode Blocks

The DataNode statistics are similar to the NameNode statistics. This graph template can be applied to each DataNode, allowing you to track BlocksRead, BlocksWritten, BlocksRemoved, and BlocksReplicated. You can use this to find “hot spots” in your data. A hot spot is a piece of data that is commonly or frequently accessed. Increasing the replication to those files would help by spreading the access to other DataNodes.

Cacti Extras

Cacti offers many excellent out-of-the box features. The following add-on features are helpful for monitoring Hadoop deployments. You can find these on the Cacti site:

Linux Full CPU Graph – Adds IOWait and other kernel states. The default CPU graph only shows nice, user, and system.
Linux Full Memory Graph – The standard memory graph does not show swap usage.
Disk Utilization Graph – You can graph bytes written to physical devices from SNMP. This is helpful for underlying disk utilization and maximum possible disk performance.
RealTime Plugin – Used to graph data at 5-second intervals. By default, Cacti is running at 1-minute or 5-minute intervals, which is not helpful for Hadoop since the JMX is probably updating at 5-minute intervals. However, it is generally useful for real time reporting of other SNMP information.
THold Plugin – The Threshold plugin creates some overlap between Nagios and Cacti, and sends alarms when data exceeds high or low values.
Aggregate Plugin – The aggregate plugin is ideal for graphing clusters into a single graph. You may want to graph the “Open File Count” across several nodes – this plugin makes the graphing process fast and easy.

Where to go from Here

If you want to see the Hadoop Cacti templates in action, check out the Live Sample (user: hadoop, password: hadoop). To get started, simply follow the Installation Instructions. The project has the Apache V2 license. You can view theSource Repository. A Hudson system provides the latest build if you want to dig into the project source code.

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Friday, 17 October 2014

Alternate to PAID Log Analytic tools

The following are available open source Log Analytic tools.

Hadoop Graphing with Cacti

What is Cacti?

Cacti Extras

Where to go from Here