Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Tuesday, 16 September 2014

Top 45 Big Data Tools for Developers

Big Data is everywhere. Even small to medium-sized businesses are seeking ways to gain more visibility into processes, monetize additional streams, and derive more actionable insights from their data. With data traditionally contained in information silos within applications or databases, taking advantage of Big Data was initially a tedious and complex process. But thanks to Big Data tools, Big Data management can now streamlined in a comprehensive dashboard.

Sophisticated platforms enable end-to-end data management and business intelligence with solutions for gathering, integrating, analyzing, and even predicting data in ways never before possible. Listed in no particular order of importance, the following Big Data tools for developers offer platforms for rapid deployment of apps, the ability to integrated data collection and analysis from multitudes of sources and applications, and even integrating offline and online data to put actions and events into context.

1. SpliceMachine

@SpliceMachine

A real-time SQL-on-Hadoop database, Splice Machine takes Big Data beyond analytics with the ability to derive real-time, actionable insights for rapid decision-making. Not only can Splice Machine process real-time updates, but it offers the ability to utilize standard SQL and is capable of scaling out on commodity hardware. Splice Machine can be used in circumstances where MySQL or Oracle can’t scale.

Key Features:

SQL-99 compliant, with standard ANSI SQL
Easily scales from gigabytes to petabytes using cost-effective, commodity hardware
Real-time updates with transactional integrity
Distributed computing architecture
Multiple Version Concurrency Control (MVCC)

Cost: Contact for a quote

2. Palantir

@PalantirTech

Palantir was founded in 2004 by a group of former PayPal employees and Stanford computer scientists. The company has doubled in size every year to date, but strives to maintain its startup culture. Offering a suite Big Data solutions for integrating, visualizing and analyzing information, Palantir’s product line emphasizes scalability, security, ease of use, and collaboration. Palantir’s solutions are most commonly used in intelligence, defense, financial and law enforcement applications, but it’s quickly growing in other verticals.

Key Features:

Solutions for integrating, visualizing and analyzing data
Serves a multitude of industries with custom solutions
Exploit and analyze data
Extract data from multiple sources
Privacy and data protection policies
Simplify workflows by integrating data into a single dashboard

Cost: Contact for a quote

3. Attivio

@Attivio

As enterprises are coping with a broader variety of information sources, eliminating information silos is critical to gaining comprehensive insights and identifying key relationships among data. Attivio’s Active Intelligence Engine combines Big Data and Big Content to analyze everything, including human-generated text through advanced text analytics. Combined with universal indexing and automatic ad-hoc JOIN, Attivio is a powerful solution for making valuable connections between all your data.

Key Features:

Combines Big Data and Big Content
Eliminates information silos
Adds context and signals from human-generated information sources
Supports BI/data visualization tools
In-engine analytics

Cost: Contact for a quote

4. Google Charts

Google Charts is a free tool with a wide range of capabilities for visualizing data from a website. From simple charts to complex hierarchical tree maps, Google Charts offers a gallery with a multitude of pre-configured chart types to choose from. Google Charts is easily implemented by embedding simple JavaScript code on a website, yet it offers complex functionality with the ability to connect to dashboards, sort, modify, and filter data, connect to a database or integrate with and pull data from a website. Take it a step further by implementing the Chart Tools Datasource protocol and allow other entities to source data from your website.

Key Features:

Charts exposed as JavaScript classes
Customize to match the look and feel of a website
Charts populated using DataTable class
Sort, modify, filter data
Populate data from a variety of sources

Cost: FREE

5. Mortar

@Mortardata

Mortar is a “general purpose platform for high-scale data science” designed to help data scientists spend more time analyzing their data and deriving actionable insights, instead of dedicating valuable time to building infrastructure and re-configuring systems. With Mortar, you can build a custom-built recommendation engine in days, not months.

Key Features:

Open-source tools for building a recommendation engine
Built on Hadoop and Apache Pig
Create, test and run jobs from in-browser IDE
Snapshots monitor changes and progress
Instant feedback on code for fewer bugs

Cost:

Public – FREE
Solo – $99/month
Team – $499/month

6. SAP

@SAPInMemory

SAP’s HANA platform can be combined with Apache Hadoop for the ability to integrate and analyze massive loads of data in real time. The platform makes it possible to derive actionable insights by making valuable connections between all types of information, from a multitude of sources. Combine SAP HANA with applications that leverage Big Data insights to quickly create additional revenue streams and improve operations.

Key Features:

Infinite storage
Flexible data management for all types of data
Discover insights with analytics solutions
Runs processes 1,000 to 100,000 times faster in-memory
SAP IQ analytics holds the Guinness World Record for data loading

Cost: Contact for a quote

7. Cambridge Semantics

@CamSemantics

Collecting, integrating, and analyzing Big Data doesn’t have to be a major effort. Cambridge Semantics makes it all possible with the Anzo Software Suite, an open platform for building Unified Information Access (UIA) solutions. That means replacing the information silos that leave data isolated and useless with a powerful, seamless data integration machine that streamlines data collection and enables sophisticated analysis for rapid decision making. And you can implement all this within hours or days — not the typical weeks or months required for an initiative at this level.

Key Features:

Combine data from a multitude of sources
Customized, interactive web dashboards for analysis
Share spreadsheets in sync automatically
Useful for CRM, billing, project management and more

Cost: Contact for a quote

8. Fusion Charts

@FusionCharts

Just because development is in your blood doesn’t mean you should embrace complexity when simpler solutions are readily available. Fusion Charts enables you to create sophisticated, cross-device compatible JavaScript charts with animation, rich interactivity, and impressive design with ease. Don’t spend your valuable time on child’s play; with Fusion Charts, you can spend more time on complex development tasks and deliver even better results.

Key Features:

Developer resources center
Interactive zooming and scrolling
Real-time charts and gauges
Multi-lingual charts
Visually editable charts and gauges
Linked charts and a variety of effects

Cost: Contact for a quote

9. MarkLogic

@MarkLogic

MarkLogic is built to support the world’s biggest data loads, bringing all types of relevant content back to users who can turn it into action. With real-time updates and alerts, connections between information make new opportunities immediately obvious. MarkLogic is ideal for enterprises that count on revenue through paid content search. With geographic data combined with content, location relevance is built in, and geographic boundaries make advanced data filtering possible.

Key Features:

Range of Big Data solutions
Speeds development
Flexible APIs
NoSQL
Real-time analysis and updates
Bring all types of content back to end users

Cost: Contact for a quote

10. Syncsort

@Syncsort

Syncsort offers a range of products and solutions to help you tap into Big Data. Hadoop, Linux, Unix and Windows, and Mainframe solutions, Syncsort’s product lineup offers a solution to meet practically any configuration needs. A GUI-based solution, Syncsort enables developers to create solutions for collecting, processing, and distributing more data in less time.

Key Features:

Solutions for Hadoop, Mainframe, Windows, Linux, Unix
Lowers the barriers to Hadoop adoption
Eliminates the need for custom code for Hadoop implementation
High-performance sorting
Improve efficiency

Cost: Contact for a quote

11. DataStax

@DataStax

DataStax helps companies like Netflix, Healthcare Anytime, eBay, and even Adobe harness the power of Big Data with less effort and at a lower cost than traditional solutions. Tapped as the first alternative to Oracle, DataStax provides the constant uptime and lightning speed required for modern customer-facing applications. When you need the capacity to handle massive data loads at maximum speed for real-time analysis, DataStax packs a major punch with a robust visual query tool for developers.

Key Features:

Visual query tool for developers
Create and run Cassandra Query Language (CQL) queries and commands
Visually navigate and interact with data clusters
Works with DataStax Community and Enterprise editions

Cost: Contact for a quote

12. Guavus

@Guavus

Need to create a rich, engaging and meaningful customer experience? Guavus drives better decision making with powerful analytics capabilities combined with advanced data science and the ability to distill data in real time to derive actionable insights at the precise moment of opportunity. Through continuous correlation and ongoing analysis, vast amounts of static and dynamic data are handled with ease and revealing opportunities to generate more revenue, reduce overhead costs, and monetize new streams.

Key Features:

Analyze-First Analytics Architecture
Analyze high-volume data streams in near real time
Handles multiple data sources with ease
Continual data analysis from moment of capture

Cost: Contact for a quote

13. mongoDB

@MongoDB

An open-source document database, mongoDB is ideal for developers who want precise control over the final results and processes for handling Big Data. With full index support, you have flexibility to index any attribute and scale horizontally without compromising functionality. Rich, document-based queries and GridFS for storing files of any size without the risk of compromising your stack, mongoDB is a scaleable, flexible, and powerful solution for Big Data.

Key Features:

Open-source platform
Document-oriented storage
Flexible aggregation and data processing
Full index support
GridFS

Cost: FREE

14. Infochimps Cloud

@Infochimps

A cloud service solution for Big Data, Infochimps Cloud makes it possible to deploy Big Data applications rapidly and without the typical time commitment. For applications requiring real-time analysis, multi-source streaming data, a NoSQL database, or a Hadoop cluster, Infochimps Cloud offers a solution that facilitates rapid implementation. Real-time analytics, ad hoc analytics, and batch analytics comprise Infochimps Cloud’s three essential cloud services.

Key Features:

Integrate with any data source – CRM solutions, etc.
Log analysis
Mobile data analytics
Fraud detection and risk analysis
Ad targeting
Customer insights via social media sources, website clickstreams and more

Cost: Contact for a quote

15. Pentaho

@Pentaho

Pentaho brings IT and business users together by joining data integration and business analytics for integrating, visualizing, analyzing and blending Big Data in ways never before possible for better business results. When you need the ability to put robust information at your users’ fingertips in real time and at a reasonable cost, Pentaho’s open, embeddable and extensible analytics platform makes it easy to visualize, explore, and predict — turning data into value.

Key Features:

High-volume data processing
Adaptive Big Data layer
Data mining and predictive analysis
Instaview – data to insights in 3 steps

Cost: Contact for a quote

16. Karmasphere

@Karmasphere

Designed for teams of analysts who need to explore and analyze Big Data on Hadoop, Karmasphere is a key solution for self-service analytics on Hadoop. With companies seeking more effective and efficient ways to make use of Big Data, users can turn data into real business value by discovering insights and influencing outcomes.

Key Features:

Organized dashboard
SQL data explorer
250-plus pre-packaged Hadoop algorithms
SAS, SPSS and R Analytic Models
Dynamic data lenses for self-service analytics

Cost: Contact for a quote

17. Placed

@Placed

Placed facilitates data collection from offline sources, enabling enterprises to derive actionable insights through a combined analysis of both offline and online behavior and data metrics. Placed targeting and placed attribution facilitates better results from mobile advertising by mapping the relationship between people and places by capitalizing on Big Data capabilities.

Key Features:

Measure visitation trends over time
Measures 100 million locations a day, across more than 100,000 opted-in US smartphones
Inference Pipeline references a place database with nearly 300 million features for the US alone
Largest repository of offline insights into the paths and behaviors of consumers
Audience segmentation by demographics and other data points
Affinity modeling for understanding relationships between data
Monitor and understand how consumer behavior changes over time

Cost: Contact for a quote

18. Upsight

@GetUpsight

Upsight, formerly Kontagent, provides actionable analytics for developers to understand what’s happening with your apps and derive actionable insights from data to impact acquisition, engagement, retention and revenue. The platform also enables the creation of targeted in-app and out-of-app metrics in line with KPIs.

Key Features:

Free, enterprise-grade analytics
Unlimited data storage
Data mining with Hadoop
Measure anything from social apps to games and mobile dating apps
Funnel analysis
Cohort explorer
Predictive LTV

Cost:

FREE – Analytics and unlimited data storage, 250k push
Core – $500/month - Custom Events up to 100k MAU, 500k push
Pro – $2,000/month - Custom Events up to 250K MAU, 1M push
Enterprise – Starting at $3,000/month - Unlimited Data Storage & Custom Events + Data Mine + Predictive LTV + A/B, unlimited push

19. Talend

@Talend

Talend Open Studio is “a powerful and versatile set of open source products for developing, testing, deploying and administrating data management and application integration projects.” Providing a unified environment for managing the full lifecycle, even across enterprise boundaries, Talend enables developers to reclaim their productivity with a fully integrated platform for joining data integration, data quality, MDM, application integration and big data.

Key Features:

Data integration at a cluster scale
No need to write or maintain code
Works with leading Hadoop distributions
Pull source data from anywhere, including NoSQL

Cost: FREE

20. Jaspersoft

@Jaspersoft

Connect and visualize data for Hadoop Analytics, MongoDB Analytics, Cassandra Analytics, and other platforms in one central repository. Using Big Data, developers can configure reports, analytics, dashboards, and more, without having to migrate data to multiple databases.

Key Features:

Real-time analytics
Integrate all your data
Blend data through innovative data virtualization metadata layer or raditional data warehouse using ETL
Present integrated visualizations and dashboards within your apps
Create intuitive design tools for non-designers to create visualizations

Cost:

Free
Jaspersoft for AWS – Less than $1/hour

21. Keen IO

@Keen_io

With powerful APIs for gathering all the data you need and deriving the actionable insights you need to drive your business forward, Keen IO is a powerful, flexible, and scaleable solution that’s literally Big Data, easy-to-implement and at your fingertips.

Key Features:

Send as much data as you want, from any source
Set up event data on any action, such as upgrades, impressions or purchases
Arbitrary JSON format
Custom properties

Cost:

Developer – FREE – 50,000 events/month
Startup – $20/month – 100,000 events/month
Growth – $125/month – 1M events/month
Premium – $300/month – 4M events/month
Professional – $600/month – 8M events/month
Business – $1,000/month – 15M events/month
Enterprise – $2,000/month – 50M events/month
Custom – Negotiable – 50M – 100B events/month

22. Skytree

@SkytreeHQ

High-performance machine learning on Big Data for advanced analytics, Skytree offers the ideal platform for fully exploiting the opportunities presented by Big Data. With a multitude of industry-focused solutions as well as solutions encompassing everything from predictive analytics to algorithmic pricing, Skytree is a comprehensive Machine Learning platform emphasizing the growing importance of Predictive Analytics in Big Data.

Key Features:

Business Analytics range from value analytics to fraud detection and what-if analytics
Marketing Analytics offer solutions ranging from ad optimization to lead scoring and recommender systems
Only general purpose scalable Machine Learning system on the market
Highest accuracy on the market; unprecedented speed and scale
Power Packs modules are plugged into the Skytree Server Foundation

Cost: Contact for a quote

23. Tableau Software

@tableau

Tableau was launched by a computer scientist, an Academy Award-winning professor, and a business leader with a passion for data. This perfect trio created a powerful suite of solutions designed to put more data at users’ fingertips — and help them understand it in more meaningful ways. With an advanced query language for powerful visualizations, the ability to natively query databases, cubes, warehouses, and more, a lightning-fast in-memory analytics database designed to eliminate silos and more, Tableau addresses every corner of Big Data demands.

Key Features:

In-memory analytics database eliminates memory silos
Leverages the complete memory hierarchy from disk to L1 cache
Tableau Public – free tool bringing data to life on the web
Touch, swipe and tab functionality for mobile
Easily layer in additional data sources
Access any data with a few clicks

Cost:

Tableau Public – FREE
Tableau Desktop, Tableau Server, and Tableau Online – Contact for a quote

24. Splunk

@Splunk

Splunk harnesses all the machine data created by websites, applications, servers, networks, sensors, mobile devices, and other sources to monitor actions, activities and events, analyzing those data sources to derive actionable insights. Splunk is a self-contained software package downloadable and functional on any device.

Key Features:

Derive insights from Big Data with speed and simplicity
Works on most major Hadoop distributions, including including first-generation MapReduce and YARN
Splunk Hadoop Connect enables bi-directional integration
Real-time collection, indexing, and analyzing

Cost:

Splunk Storm – FREE cloud service for developers
Splunk Enterprise – Perpetual License – Starts at $4,500 for 1 GB/day, plus annual support fees
Splunk Enterprise – Term License – Start at $1,800 per year, which include annual support fees
Hunk – One-year term license of Hunk starts at $2,500 per Hadoop TaskTracker or Compute Node with a minimum of ten TaskTrackers or Compute Nodes
Splunk Cloud – Annual subscription pricing, data volumes of 5GB/day to 1TB/day

25. Platfora

@Platfora

Platfora hides the complex nature of Hadoop, making it simpler for enterprises to discover and understand facts in their business across events, actions, behaviors and time. Built by Silicon Valley veterans who have built market-leading companies around big ideas, the Platfora team understands the power of Big Data and aims to change customers’ lives with Platfora as they’ve done with companies in the past.

Key Features:

Vizboards for self-service, interactive data visualization
Analytics Engine, In-Memory Accelerator, and Hadoop Processor
Entity-centric data catalog
Build interest-driven pipelines of facts
Analyze data iteratively with segmentation
Collaboration features
On-premise or cloud deployment

Cost: Contact for a quote

26. Continuuity

@Continunity

Continuuity enables developers to build Big Data applications quickly, easily, and seamlessly, deploying instantly on-premise or to the cloud. It’s all made possible through simple APIs that can be used with virtually any platforms.

Key Features:

User-implemented real-time stream processors (Flows)
Process a batch of data objects with the same transaction
More than one instance possible with each Flowlet
Programmatic control with REST interfaces
Three partitioning strategies to choose from
DataSets for higher-level abstractions

Cost: Contact for a quote

27. BitDeli

@Bitdeli

BitDeli is an analytics tool for GitHub, enabling developers to gather data on who is viewing their repositories, where and when. With a one-click install, you can easily add analytics to your repositories and start gathering valuable data, including aggregate statistics across all forks for a given repository.

Key Features:

One-click install
Automatically generated pull requests
Trending badge indicator shows repository popularity
Global rankings for comparison
Fork aggregation for a broad picture of project health

Cost: Based on GitHub Enterprise pricing

28. Flurry

@FlurryMobile

Flurry is an end-to-end solution for analyzing consumer behavior, advertising to the right audience, at the right time, and discovering new ways to monetize audiences. Flurry makes use of 3.5 billion app session reports per day totaling more than 3 terabytes to provide valuable insights for app developers, such as a deep understanding of the user base, engagement benchmarks, and other key metrics.

Key Features:

Demographic estimations
App engagement benchmarks
App category and consumer interests
World’s largest app-audience data set
Reach more than 250 million customers per month
Data-powered targeting

Cost:

Flurry Analytics – FREE
Flurry AppCircle, FlurryPersonas, Flurry AppSpot – Contact for a quote

29. Spring Data /Pivotal

@springcentral

Spring Data is a set of projects designed to make it easier to use new data access technologies and provide improved support for relational database technologies. An umbrella project with many sub-projects designed for specific databases, Spring Data is an excellent source of Big Data tools to streamline the use of Big Data in the modern enterprise environment.

Key Features:

Support for Hadoop, mongoDB, Data Rest and more
Also provides consulting services
Customized, all-in-one Eclipse-based distribution
Tool suites for ready-to-use solutions

Cost: FREE

30. Hortonworks

@hortonworks

Hortonworks offers a complete distribution for fully harnessing the power of Hadoop. With Hortonworks, you can process and analyze everything from Sentiment to Sensors with a 100% open-source, enterprise-grade distribution of Hadoop for every platform.

Key Features:

Interact with all data, in multiple ways, simultaneously
Stable, tested, complete package of all services required for the platform
Integrates with other tools
HDP is built and supported by original architects, builders and operators of Hadoop

Cost: FREE

31. StatsMix

@cooperio

A complete dashboard for all your business needs, StatsMix dashboards are customized to your specific requirements, with the data you need from sources like Salesforce, MySQL, Google Analytics, and other tools and services.

Key Features:

Chart and track anything
Measure KPIs
Track any metric with API
Automatic social monitoring
Share metrics and dashboards via email, embed them, or create guest accounts
Aggregate metrics to eliminate silos
Custom dashboards

Cost:

Basic – $24/month – 100k API requests
Standard – $49/month – 300k API requests
Pro – $99/month – 1M API requests
Premium – $199/month – 3M API requests
Enterprise – $499/month – 8M API requests

32. Pervasive

@ActianCorp

Pervasive offers a number of Big Data tools, including several solutions for Hadoop and a free RushLoader for Hadoop. From DataFlow Analytics to ParAccel Dataflow ETL/DQ for designing end-to-end ETL and quality data workflows, Pervasive is a Big Data power suite.

Key Features:

Partnership with Actian for powering Big Data 2.0
Predictive Analytics for Big Data
Simple interface for loading massive amounts of data at rapid speeds
Fastest data-crunching engine in the world

Cost:

RushLoader for Hadoop – FREE
ParAccel Dataflow Loader for Hadoop – FREE for 12 months
All other products – Contact for a quote

33. InfiniDB

@InfiniDB

Real-time enablement of Big Data insights is yours with InfiniDB. A 100% open-source platform, you can harness the power of Big Data without the typical cost.

Key Features:

Three open-source versions available
Completely MySQL-accessible
Familiar, MySQL interface for large-scale, ad hoc BI
Dimensional and predictive analytics
Integrates with the Hadoop ™ Distributed File System (HDFS)
Real-time, ad hoc analytics within an Apache Hadoop cluster

Cost: FREE

34. GridGain

@GridGain

GridGain reimagines in-memory computing for a competitive edge in the modern business environment. Nikita Ivanov and Dmitriy Setrakyan share a passion for high-performance computing, a shared vision on which they based the first release of GridGain in 2007. The list of features, functionality and capabilities of these solutions is astounding.

Key Features:

In-Memory Data Grid
Supports SQL, K/V, MongoDB, MPP, MapReduce
Hyper Clustering
Zero Deployment
Advanced Security
Fault Tolerance
Load Balancing
Customizable Event Workflow
Programmatic Querying
Minimal or no integration
No ETL required
Eliminate MapReduce overhead
Works with any Hadoop distribution

Cost: Contact for a quote

35. DeepDive

@HazyResearch

A new type of system to help developers analyze data on a deeper level, DeepDive is an open-source project with a simple four-step process for writing applications on the platform. With calibrated probabilities for every assertion it makes, DeepDive is designed to navigate around the problematic nature of human error in development.

Key Features:

Handles large amounts of data from multiple sources
Write simple rules and offer feedback on prediction accuracy
“Distantly” learns, rather than requiring a tedious machine-learning process for training predictions
Scaleable, high-performance inference and learning engine

Cost: FREE

36. Lavastorm Analytics

@Lavastorm_News

Business turn to analytics to develop faster, better, and cheaper methods and processes. Interestingly, that’s precisely what Lavastorm offers in its platform: faster, better, and cheaper analytics for achieving business goals. Enterprises are demanding more capabilities and faster speeds, and Lavastorm eliminates the need for a disjointed approach with visualization tools, spreadsheets, BI applications, databases and other information silos with a seamless solution delivering end-to-end analytics.

Key Features:

Reduce analytic development by 90% or more
Large volumes of data in short amounts of time
Reuse and share analytics knowledge across teams
Detect hard-to-find issues with 40% less false positives
Visibility control for management and executives

Cost: Contact for a quote

37. SpagoWorld

@SpagoWorld

From business intelligence to middleware, SpagoWorld offers a range of solutions for enterprises — all on an open-source platform. SpagoWorld’s Big Data BI solution enables the collection of massive quantities of data, in rapid timeframes, for use across SpagoWorld’s other platforms for further analysis and business intelligence derivatives.

Key Features:

Extract data from various platforms, from database and analytics platforms to NoSQL databases or enriched distributions
Supports real-time analysis of streaming data
Charts, reports, thematic maps, cockpits
Translate information to self-service BI
Reporting, multi-dimensional analysis
Ad hoc reporting
Location intelligence
Real-time dashboards and console

Cost: FREE

38. RapidMiner

@RapidMiner

A code-free zone, RapidMiner provides advanced analytics with no programming required for configuration. With all-new application wizards for churn reduction, sentiment analysis, predictive maintenance and direct marketing in RapidMiner 6.0, this tool is one of the fastest advanced analytics solutions available.

Key Features:

Hundreds of methods for data integration
Runs on every major platform
No programming required
Drag-and-drop interface

Cost: Contact for a quote

39. Orange

Orange is an open-source data visualization and analysis tool for both novices and experts. Data mining is conducted either through visual programming or Python scripting, with components for machine learning and ad-ons for bioinformatics and text mining.

Key Features:

Remembers choices and makes suggestions
Intelligently chooses communication channels between widgets
Packed with visualization options from bar charts to dendograms
Integration and data analytics
Combine widgets to design the framework of your choice
Toolbox with more than 100 widgets

Cost: FREE

40. OpenDataSoft

@opendatasoft

OpenDataSoft is a comprehensive discovery tool with maps, charts, and graphs to explore public data sets. A cloud-based platform, OpenDataSoft is designed for seamless and unlimited data publishing, sharing, and resuse.

Key Features:

Reuse data through APIs and apps models
Collect data from any source
Read and understand all formats
Make databases findable and reusable
Standard access formats
Interactive & shareable visualization
API factory
Web extensions and open source

Cost (pricing based on Euros):

FREE – Civic initiatives and academic projects
200/month – 100k records, 20K UI/API queries/day
700/month – 10M records, 100K UI/API queries/day
Contact for a quote – Unlimited records, UI/API queries/day

41. Angoss

@Angoss

A comprehensive marketing analytics solution, Angoss offers real-time Big Data insights for a variety of verticals and business sectors. From credit scoring to opportunity and lead scoring, fraud deterrence and claims management, Angoss is capable of capturing and analyzing data for a multitude of applications.

Key Features:

Automated workflows to develop scorecards
Select the most predictive variables
Advanced predictive modeling
Angoss Decision and Strategy Trees
Data preparation and profiling
Model validation and deployment

Cost: Contact for a quote

42. Mu Sigma

@MuSigmaInc

Mu Sigma is one of the world’s largest Decision Sciences and analytics firms, helping companies to institutionalize data-driven decision making by harnessing Big Data. With a set of proprietary platforms to enable rapid decision-making and comprehensive data collection and integration that eliminates information silos, Mu Sigma is a powerful tool for machine learning, operational research, artificial intelligence, and more.

Key Features:

Hosts Mu Sigma problem DNAs
Real-time analytics and event stream processing
Load models into an enterprise ecosystem for consumption
Embedded advanced analytics engine
Influence analysis and topic modeling
Sentiment evaluation
Easily scaled on commodity hardware

Cost: Contact for a quote

43. ERwin

@ERwinModeling

A collaborative data-monitoring environment, ERwin offers an intuitive, graphical interface with a centralized view of key definitions, enabling the leveraging of data as strategic business asset. The product is comprised of a number of editions designed for different stakeholders within an organization, providing a targeted level of information availability and display and configurations for better understanding and usability.

Key Features:

Achieve business agility through model-driven collaboration
Collaborate via web or desktop
Active model templates and naming standards
Display themes, custom data types, macro language and API
Custom reporting
Metadata integration tools

Cost:

CA ERwin Data Modeler Standard Edition r9.5 – Product plus 1 Year Enterprise Maintenance – $4,794
CA ERwin Data Modeler Standard Edition r9.5 – Product plus 3 Years Enterprise Maintenance – $6,392
CA ERwin Data Modeler Workgroup Edition r9.5 – Product plus 1 Year Enterprise Maintenance – $6,708
CA ERwin Data Modeler Workgroup Edition r9.5 – Product plus 3 Years Enterprise Maintenance – $8,944
CA ERwin r9.5 Data Modeler for Microsoft SQL Azure – Product plus 1 Year Enterprise Maintenance – $1,679.94
CA ERwin r9.5 Data Modeler for Microsoft SQL Azure – Product plus 3 Years Enterprise Maintenance – $2,239.92
CA ERwin r9.5 Web Portal Standard Edition 1-5 Users – Product plus 1 Year Enterprise Maintenance – $8,399.70
CA ERwin 9.5 Web Portal Standard Edition 1-5 Users – Product plus 3 Years Enterprise Maintenance – $11,199.60

44. HPCC Systems

@hpccsystems

A proven and battle-tested platform for manipulating, querying, transforming, and data warehousing Big Data, HPCC Systems solves Big Data problems facing modern enterprises in any vertical.

Key Features:

Processing clusters use off-the-shelf hardware
Clusters typically homogeneous, but not required
Distributed, Thor, Roxie file systems
Linux operating system
Build multi-key, multi-field (aka compound) indexes on DFS files
Data warehouse capabilities for structured queries and data analysis applications
Supports thousands of users with sub-second response time, depending on application

Cost: FREE

45. pmOne

@pmOneAG

Offering Big Data and Business Intelligence solutions, pmOne’s cMORE enables users to quickly build, flexibly grow and efficiently administer solutions. It leverages and extends SQL Server functionality, as well as that of Excel, SharePoint, and other components in the Microsoft BI stack.

Key Features:

Simplified standard and ad hoc reporting
Credible alternative to SAP-based data warehouse
Consistent reporting company-wide
Personalize reports; distribute books
Easy access to SAP data and other systems
Based on Microsoft BI

Cost: Contact for a quote

Are Big Data tools changing the way you develop apps? What Big Data tools are changing the way you architect and run your big data projects?

Friday, 15 August 2014

Building a Real-time, Personalized Recommendation System with Kiji

Today, recommendations are everywhere online. Major e-commerce websites like Amazon provide product recommendations in many different forms across their web properties. Financial planning sites like Mint.com provide recommendations for things like credit cards that a user might want to sign up for or banks that can offer better interest rates. Google augments search results based on its knowledge of the users’ past searches to find the most relevant results.

These brands use recommendations to provide contextual, relevant user experience in order to increase conversion rates and user satisfaction. Traditionally, these sorts of recommendations have been computed by batch processes that generate new recommendations on a nightly, weekly or even monthly basis.

However, for certain types of recommendations, it’s necessary to react in a much shorter timeframe than batch processing allows, such as offering a consumer a geo-location-based recommendation. Consider a movie recommendation system -- If a user historically watches action movies, but is currently searching for a comedy, batch recommendations will likely result in recommendations for more action movies instead of the most relevant comedy. In this article, you will learn how to use the Kiji framework, an open source framework for building Big Data Applications, to build a system that provides real-time recommendations.

Kiji, Entity-Centric Data, and the 360º View

To build a real-time recommendation system, we first need a system that can be used to store a 360º view of our customers. Moreover, we need to be able to retrieve data about a particular customer quickly in order to produce recommendations as they interact with our website or mobile app. Kiji is an open-source, modular framework for building real-time applications that collect, store and analyze this sort of data.

More generally, the data necessary for a 360º view can be termed entity-centric data. An entity could be any number of things such as a customer, user, account, or something more abstract like a point-of-sale system or a mobile device.

The goal of an entity-centric storage system is to be able to store everything about a particular entity in a single row. This is challenging with traditional, relational databases because the information may be both stateful data (like name, email address, etc.) and event streams (like clicks). A traditional system requires storing this data in multiple tables, which get joined together at processing time, which makes it harder to do real-time processing. To deal with this challenge, Kiji leverages Apache HBase, which stores data in four dimensions -- row, column family, column qualifier, and timestamp. By leveraging the timestamp dimension, and the ability of HBase to store multiple versions of a cell, Kiji is able to store event-stream data alongside the more stateful, slowly-changing data.

HBase is a key-value store built on top of HDFS and used by Apache Hadoop, which provides the scalability that is necessary for a Big Data solution. A large challenge with developing applications on HBase is that it requires that all the data going in and out of the system be byte arrays. To deal with this, the final core component of Kiji is Apache Avro, which is used by Kiji to store easily-processed data types like standard strings and integers, as well as more complex user-defined data types. Kiji handles any necessary serialization and deserialization for the application when reading or writing data.

Developing Models for Use in Real Time

Kiji provides two APIs for developing models, in Java or Scala, both of which have a batch and a real-time component. The purpose of this split is to break down a model into distinct phases of model execution. The batch phase is a training phase, which is typically a learning process, in which the model is trained over a dataset for the entire population. The output of this phase might be things like parameters for a linear classifier or locations of clusters for a clustering algorithm or a similarity matrix for relating items to one another in a collaborative filtering system. The real-time phase is known as the scoring phase, and takes the trained model and combines it with an entity’s data to produce derived information. Critically, this derived data is considered first-class, in that it can be stored back in the entity’s row for use in serving recommendations or for use as input in later computations.

The Java APIs are called KijiMR, and the Scala APIs form the core of a tool called KijiExpress. KijiExpress leverages the Scalding library to provide APIs for building complex MapReduce workflows, while avoiding a significant amount of boilerplate code typically associated with Java, as well as the job scheduling and coordination that is necessary for stringing together MapReduce jobs.

Individuals Versus Populations

The reason for the differentiation between batch training and real-time scoring is that Kiji makes the observation that population trends change slowly, while individual trends change quickly.

Consider a dataset for a user population that contains ten million purchases. One more purchase is not likely to dramatically affect trends for the population and their likes or dislikes. However, if a particular user has only ever made ten purchases, the eleventh purchase will have a huge affect on what a system can determine that the user is interested in. Given this assertion, an application will only need to retrain its model once enough data has been gathered to affect the population trends. However, we can improve recommendation relevancy for an individual user by reacting to their behavior in real time.

Scoring Against a Model in Real Time

In order to score in real time, the KijiScoring module provides a lazy computation system that allows an application to generate refreshed recommendations only for users that are actively interacting with the application. Through lazy computation, Kiji applications can avoid generating recommendations for users that don’t frequently or may never return for a second visit. This also has the added benefit that Kiji can take into account contextual information like the location of their mobile device at the time of the recommendation.

The primary component in KijiScoring is called a Freshener. Fresheners are really a combination of a couple of other Kiji components: ScoringFunctions and FreshnessPolicies. As mentioned earlier, a model will consist of both a training and a scoring phase. The ScoringFunction is the piece of code that describes how a trained model and a single entity’s data are combined to produce a score or recommendations. A FreshnessPolicy defines when data becomes stale or out-of-date. For example, a common FreshnessPolicy will say that data is out-of-date when it is older than an hour or so. A more complex policy might mark data as out-of-date once an entity has experienced some number of events, like clicks or product views. Finally, the ScoringFunction and FreshnessPolicy are attached to a particular column in a Kiji table which will trigger a refresh of the data, if necessary.

Applications that do real-time scoring will include a tier of servers called KijiScoring servers, which fill the role of an execution layer for refreshing stale data. When a user interacts with the application, the request will be passed to the KijiScoring server tier, which communicates directly with the HBase cluster. The KijiScoring server will request the data, and once retrieved, determine whether or not the data is up-to-date, according to the FreshnessPolicy. If the data is up-to-date, it can just be returned to the client. However, if the data comes back stale, the KijiScoring server will run the specified ScoringFunction for the user that made the request. The important piece to understand is that the data or recommendations that are being refreshed are only being refreshed for the user that is making the request, rather than a batch operation, which would refresh the data for all users. This is how Kiji avoids doing more work than is necessary. Once the data is refreshed, it’s returned to the user, and written back to HBase for use later on.

A typical Kiji application will include some number of KijiScoring servers, which are stateless Java processes that can be scaled out, and that are able to run a ScoringFunction using a single entity’s data as input. A Kiji application will funnel client requests through the KijiScoring server, which determines whether or not data is fresh. If necessary, it will run a ScoringFunction to refresh any recommendations before they are passed back to the client, and write the recomputed data back to HBase for later use.

Deploying Models to a Production System

A major goal in a real-time recommendation system is to be able to iterate on the underlying predictive models easily, and avoid application downtime to push new or improved models into production. To do that, Kiji provides the Kiji Model Repository, which combines metadata about how the models execute with the code that is used to train and score the models. The KijiScoring server needs to know what column accesses should trigger freshening, the FreshnessPolicy to be applied, and the ScoringFunction that will be executed against user data, as well as the locations of any trained models or external data necessary for scoring against the model. This metadata is stored in a Kiji system table, which is just another HBase table at the lowest level. Additionally, the Model Repository stores code artifacts for registered models in a managed Maven repository. The KijiScoring server periodically polls the Model Repository for newly-registered or -unregistered models, and loads or unloads code as necessary.

Putting It All Together

A very common way to provide recommendations is through the use of collaborative filtering. Collaborative filtering algorithms typically involve building a large similarity matrix to store information relating to a product to other products in the product catalog. Each row in the matrix represents a product p_i, and each column represents another product p_j. The value at (p_i, p_j) is the similarity between the two products.

In Kiji, the similarity matrix is computed via a batch training process, and then can be stored in a file or a Kiji table. Each row of the similarity matrix would be stored in a single row in the product table in Kiji in its own column. In practice, this column has the potential to be very large, since it would be a list of all the products in the catalog and similarities. Typically, the batch job will also do the work of picking only the most similar items to put into the table.

This similarity matrix is accessed at scoring time through the KeyValueStore API, which gives processes access to external data. For matrices that are too large to store in memory, storing the matrix in a distributed table enables the application to only request the data that is necessary for the computation, and dramatically reduce the memory requirements.

Since we’ve done a lot of the heavy lifting ahead of the scoring phase, scoring becomes a fairly simple operation. If we wanted to display recommendations based on an item that was viewed, a non-personalized scoring function would just look up the related products from the product table and display those.

It’s a relatively simple task to take this process a little further and personalize the results. In a personalized system, the scoring function would take a user’s recent ratings and use the KeyValueStore API to find products similar to the products that the user had rated. By combining the ratings and the product similarities stored in the products table, the application can predict the ratings that the user would give related items and offer recommendations of the products with the highest predicted ratings. By limiting both the number of ratings used and the number of similar products per rated product, the system can easily handle this operation as the user is interacting with the application.

Conclusion

In this article, we’ve seen, at a high level, how Kiji can be used to develop a recommendation system that refreshes recommendations in real time. By leveraging HBase to do low latency processing, using Avro to store complex data types, and processing data using MapReduce and Scalding, applications can provide relevant recommendations to users in a real-time context. For those who are interested in seeing an example of this system, there is code for a very similar application located on the WibiData Github.

About the Author

Jon Natkins (@nattyice) is a field engineer at WibiData where he is focused on helping users build Big Data Applications on Kiji and WibiEnterprise. Prior to WibiData, Jon worked in software engineer roles for Cloudera and Vertica Systems.

Pages

Tuesday, 16 September 2014

Top 45 Big Data Tools for Developers

Friday, 15 August 2014

Building a Real-time, Personalized Recommendation System with Kiji

Kiji, Entity-Centric Data, and the 360º View

Developing Models for Use in Real Time

Individuals Versus Populations

Scoring Against a Model in Real Time

Deploying Models to a Production System

Putting It All Together

Conclusion

About the Author