Review: Splunk Reigns In Error Log Collection, Detection
Splunk Professional Server 2.1, unveiled this week, marks another milestone in error log collection and detection. The Splunk search engine now can scale across multiple data centers and geographies, and it provides unlimited indexing by clustering multiple servers.
The software also comes with new C++, SOAP and REST APIs, including command-line APIs, to automate search on the command line with scripting technologies.
Essentially, the Splunk engine enables operators, administrators and developers find out what's going on in an IT infrastructure. Because Splunk is relatively new, most companies aren't aware of the technology and -- with so much data generated in real time from complex systems -- are still using simple tracking solutions to pinpoint system failures.
Today, that process is done manually, whereby operators must retrace all connected transactions by crawling through many log files. And the tracing process is slow, even when following application server and packaged middleware logs, since every piece of technology generates its own format.
Different versions of the same technology also sometimes generate different log formats, making the process even more difficult.
The amount of data exacerbates this problem. One server can generate more than 100 Mbytes of log data a day and 10 times that amount when working with a multitiered system. So in one day, an enterprise may generate a terabyte of log data.
Operators typically have to collect feeds from SNMP traps, ports, JMS messages, audit logs inside database tables, and application server logs. Even with the most skilled operators, a manual approach can take hours to figure out.
Another approach is using parsing technologies to trace information. However, home-grown systems that parse log files don't scale up well.
To add to the frustration, once a fault has been pinned down to an application or a set of modules, developers often must get involved in the identification process, because application-specific log data is almost always written only for developers to find specific errors. Most application-specific logs aren't standardized because of the variety of products and technologies in each system. Pressing deadlines also counteract the time required for developers to compose log information that everyone can understand and trace. If that's not managed correctly, chasing errors manually becomes expensive, often diverting critical IT resources for a large part of every work day.
Splunk aims to solve that problem with its search engine, which indexes any kind of log files in real time.
The engine uses free-form indexing to match fields between any freely formatted files. It also employs learning technologies to analyze and trace machine- and system-generated data.
What's more, Splunk turns logs into interactive events, rather than simply displaying documents. The engine correlates series of events that lead to transactions by automatically normalizing time stamps, parsing and indexing keywords. Instead of finding log files, operators can concentrate on events within logs.
As with any search engine, Splunk allows operators to type in error numbers to begin the search. Splunk then starts dynamically grouping events into sections by matching segments based on the original logs where errors were reported.
Splunk supports regular expressions for more advanced searching so operators can omit logs with any unwanted combination of errors to reduce a search.
Since time is the easiest way to trace problems, operators can quickly match events based on time proximity. But because many errors often occur at once, identifying a time window isn't enough. Operators also must sift through many error numbers to figure out if one or more problems are occurring in a given time window.
Splunk's automatic event classification is the key to a search. The classification process is based on keywords used by developers or generated by systems.
During this process, the search engine can ignore general log context such as IPs, thread IDs and time stamps. Operators also can set scheduled job searches with alerts that can trigger e-mails if it identifies search conditions.
That's an extremely useful feature, since once operators identify a type of event, they can create a search job that looks for that event automatically.
If operators disagree with how Splunk creates multiline events, they can override the results by creating new expressions or changing the tool's properties inside configuration files. During setup of the Splunk software, some formats -- such as time zones or special date formats -- must be taught to Splunk.
Splunk also supports event tagging so that operators can associate messages with events as new errors are identified and classified.
\
\
Splunk Professional
Server 2.1
\
\
\
Tech Rating:
\
Channel Rating:
\
Distribution:
Direct to VARs
\
\
Note: "Recommended" status is earned with a score of 8 stars out of 10.
Tagging is a huge help to operators because they don't have to search through codes or chase developers every time errors occur.
Splunk Professional Server 2.1 carries a starting price of $2,500, with a limit of about 500 Mbytes per day.
The company provides on-site, live phone support, as well as Web-based support based on partnership level. In addition, users can access free online knowledge base and support forums.
In addition, Splunk offers developer training at a cost of $2,000 for two days per student. User and operator training costs $1,000 for one day per student. Support level-one certification training runs $1,000 for one day per student.