Wednesday, 6 August 2014

OpenTSDB and HBase rough performance test

OpenTSDB and HBase rough performance test

In order to see what technological choices we have to implement a charting solution for hundreds of millions of points we decided to try OpenTSDB and check results against its underlying HBase.

The point of this test is to get a rough idea if this technology would be appropriate for our needs. Planned the following tests:

fastest data retrieval to get 5000 points out of 10 million points,
fastest data retrieval to get 5000 points out of 200 million points.

We use these points to generate JS charts. On this benchmark did not test scalability, only used 1-8 threads to gather data to see how this impacts the performance.

OpenTSDB v2.0 Benchmark

From the OpenTSDB site, the description is:
OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

Retrieval of 5000 out of 10 million points

System, configuration and data retrieval procedure

The benchmark machine is Linux (Ubuntu 12.04 64-bit) with 4 Cores and 10GB of RAM.
With OpenTSDB v2.0 used HBase version 0-94.5. Disabled compression (COMPRESSION = NONE) because had problems with compression=lzo on Ubuntu, it happened from time to time to receive errors on database creation.
On OpenTSDB put the data through socket and got the data through the OpenTSDB http api.

Generated an OpenTSDB database with 10 million records by inserting into a metric: a long int date, an int value, and a tag named “sel” that is a string.

The insert operation is made with one thread. For data retrieval used threads (1,2,4 and 8 threads per run), in our test case every thread runs the same operation.

5000 rows selected out of 10 million rows

Database	Operation	Total rows	Threads	No of selected rows	Run time (Select + Fetch)
OpenTSDB	Insert	10000000	1	0	127305 ms
OpenTSDB	Select+Fetch	10000000	1	5000	2224 ms	no cache
OpenTSDB	Select+Fetch	10000000	1	5000	161 ms
OpenTSDB	Select+Fetch	10000000	2	5000	146 ms
OpenTSDB	Select+Fetch	10000000	4	5000	237 ms
OpenTSDB	Select+Fetch	10000000	8	5000	228 ms

threads – represents the number of threads that ran in the same time
first run – it’s the first run without any cache made, the other runs are with cache
run time – it’s the total run time of select+fetch

Problems encountered

While using OpenTSDB encountered the following problems:

The retrieved data at this moment can be only as ASCII (raw data) or as png image. The JSON option it’s not yet implemented,
Failed to run the test case with 200 million points inserted into a metric, even when runing the OpenTSDB Java instance with 10GB of RAM (-Xmx10240m -Xms10240m -XX:MaxPermSize=10g ) always receieved an OutOfMemory error. Received this error from OpenTSDB logs not from HBase or our Java process,
On OpenTSDB if you insert 2 points with the same date in the same metric (in seconds) all queries that will include the duplicate date will fail with an exception (net.opentsdb.core.IllegalDataException: Found out of order or duplicate data),
The connection to the HBase server dropped suddenly several times,
Not an error but maybe a limitation: when tried inserting 10 million metrics got an “[New I/O worker #1] UniqueId: Failed to lock the `MAXID_ROW’ row”.

Conclusions for this test case

OpenTSDB beats MySql and MongoDB at every test. It is 2-4X faster than MySql with or without cache, 7 – 328X times faster than MongoDB.

The problems encountered at the current version show that this version can’t be used in production, needs fixes.

Retrieval of 5000 out of 200 million points

As stated in the “Encountered problems” section of the previous test it was not possible to test performance for 200 million points due to OutOfMemory errors.
Failed to run the test case with 200 million points, even when runing the OpenTSDB Java instance with 10GB of RAM (-Xmx10240m -Xms10240m -XX:MaxPermSize=10g ) always receieved an OutOfMemory error. Received this error from OpenTSDB logs not from HBase or our Java process

Code used for tests

Insert Code:

s.connect(new InetSocketAddress(host, 4242));
s_out = new PrintWriter(s.getOutputStream(), true);

for (i = 0; i <= tsdbgen.argu; i++) {
if (i % 10000 == 0) {
System.gc();
} 

ri = generator.nextInt(99999); 

if (tsdbgen.nrselect >= nrcount) {
nrcount++;
tsdbgen.startLong++;
message = "put test " + tsdbgen.startLong + " " + ri
+" sel=12345";

} else {
tsdbgen.startLong++;
message = "put test " + tsdbgen.startLong + " " + ri
+ " sel=ubuntu";

}

s_out.println(message);
}

s_out.close();
s.close();

s.connect(new InetSocketAddress(host, 4242));

s_out = new PrintWriter(s.getOutputStream(), true);

for (i = 0; i <= tsdbgen.argu; i++) {

if (i % 10000 == 0) {

System.gc();

}

ri = generator.nextInt(99999);

if (tsdbgen.nrselect >= nrcount) {

nrcount++;

tsdbgen.startLong++;

message = "put test " + tsdbgen.startLong + " " + ri

+" sel=12345";

} else {

tsdbgen.startLong++;

message = "put test " + tsdbgen.startLong + " " + ri

+ " sel=ubuntu";

}

s_out.println(message);

}

s_out.close();

s.close();

Get Code no cache ( for no cache used the end time the actual date so openTSDB doesn’t use cache):

URL start = new URL(
"http://localhost:4242/q?start=2006/02/03-12:00:00&end="+tsdbno.data+"&m=sum:test{sel=12345}&ascii");
URLConnection yc = start.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;

while ((inputLine = in.readLine()) != null) {
count++;
// System.out.println(inputLine);
}

in.close();
System.out.println("Rows: " + count);

URL start = new URL(

"http://localhost:4242/q?start=2006/02/03-12:00:00&end="+tsdbno.data+"&m=sum:test{sel=12345}&ascii");

URLConnection yc = start.openConnection();

BufferedReader in = new BufferedReader(new InputStreamReader(

yc.getInputStream()));

String inputLine;

while ((inputLine = in.readLine()) != null) {

count++;

// System.out.println(inputLine);

}

in.close();

System.out.println("Rows: " + count);

Get code with cache (for cache ran the same query several times):

URL start = new URL(
"http://localhost:4242/q?start=2006/05/03-12:00:00&m=sum:test{sel=12345}&ascii");
URLConnection yc = start.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;

while ((inputLine = in.readLine()) != null) {
count++;
// System.out.println(inputLine);
}

in.close();
System.out.println("Rows: " + count);

URL start = new URL(

"http://localhost:4242/q?start=2006/05/03-12:00:00&m=sum:test{sel=12345}&ascii");

URLConnection yc = start.openConnection();

BufferedReader in = new BufferedReader(new InputStreamReader(

yc.getInputStream()));

String inputLine;

while ((inputLine = in.readLine()) != null) {

count++;

// System.out.println(inputLine);

}

in.close();

System.out.println("Rows: " + count);

HBase Test

Made this test mostly as a check on the previous OpenTSDB performance results.
On Hbase inserted into rowKey (in this case “ubuntu” or “another”): String family, String qualifier and String Value

database	operation	total rows	threads	no of selected rows	run time (select + fetch)
HBase	Insert	10000000	8	0	4285229 ms
HBase	Select+Fetch	10000000	1	5000	ms	no cache
HBase	Select+Fetch	10000000	1	5000	134 ms
HBase	Select+Fetch	10000000	2	5000	184 ms
HBase	Select+Fetch	10000000	4	5000	337 ms
HBase	Select+Fetch	10000000	8	5000	257 ms

Insert Code:

public static void addRecord(HTable table, String rowKey, String family,
String qualifier, String value) throws Exception {
try {
// HTable table = new HTable(conf, tableName);
Put put = new Put(Bytes.toBytes(rowKey));
put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier),
Bytes.toBytes(value));
table.put(put);
// System.out.println("insert recored " + rowKey + " to table "
// + tableName + " ok.");
} catch (IOException e) {
e.printStackTrace();
}
}

Configuration conf = null;
conf = HBaseConfiguration.create();
String tablename = "test";
try {
HTable table = new HTable(conf, tablename);
for (i = 0; i < hbasegen.argu; i++) {
if (i % 250000 == 0) {
System.gc();
}
final String randomString = UUID.randomUUID().toString();
if (hbasegen.nrselect >= nrcount) {
nrcount++;

c = randomString.substring(0, 5);
d = randomString.substring(5, 10);
e = randomString.substring(10, 15);

HBaseTest.addRecord(table, "ubuntu", "test1", c, d);
} else {
c = randomString.substring(0, 5);
d = randomString.substring(5, 10);
e = randomString.substring(10, 15);

HBaseTest.addRecord(table, "another", "test1", c, d);
}

}

} catch (Exception x) {
x.printStackTrace();
}

public static void addRecord(HTable table, String rowKey, String family,

String qualifier, String value) throws Exception {

try {

// HTable table = new HTable(conf, tableName);

Put put = new Put(Bytes.toBytes(rowKey));

put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier),

Bytes.toBytes(value));

table.put(put);

// System.out.println("insert recored " + rowKey + " to table "

// + tableName + " ok.");

} catch (IOException e) {

e.printStackTrace();

}

Configuration conf = null;

conf = HBaseConfiguration.create();

String tablename = "test";

try {

HTable table = new HTable(conf, tablename);

for (i = 0; i < hbasegen.argu; i++) {

if (i % 250000 == 0) {

System.gc();

}

final String randomString = UUID.randomUUID().toString();

if (hbasegen.nrselect >= nrcount) {

nrcount++;

c = randomString.substring(0, 5);

d = randomString.substring(5, 10);

e = randomString.substring(10, 15);

HBaseTest.addRecord(table, "ubuntu", "test1", c, d);

} else {

c = randomString.substring(0, 5);

d = randomString.substring(5, 10);

e = randomString.substring(10, 15);

HBaseTest.addRecord(table, "another", "test1", c, d);

}

} catch (Exception x) {

x.printStackTrace();

}

Get code:

Configuration conf = null;
conf = HBaseConfiguration.create();

try {
HTable table = new HTable(conf, tablename);
HBaseTest.getOneRecord(table,"ubuntu");

} catch (Exception x) {
x.printStackTrace();
}

public static void getOneRecord(HTable table, String rowKey)
throws IOException {
int count = 0;
Get get = new Get(rowKey.getBytes());
Result rs = table.get(get);
for (KeyValue kv : rs.raw()) {}

}

Configuration conf = null;

conf = HBaseConfiguration.create();

try {

HTable table = new HTable(conf, tablename);

HBaseTest.getOneRecord(table,"ubuntu");

} catch (Exception x) {

x.printStackTrace();

}

public static void getOneRecord(HTable table, String rowKey)

throws IOException {

int count = 0;

Get get = new Get(rowKey.getBytes());

Result rs = table.get(get);

for (KeyValue kv : rs.raw()) {}

}

OpenTSDB v1.1.0 Benchmark

Retrieval of 5000 out of 10 million points

System, configuration and data retrieval procedure

The benchmark machine is Linux (Ubuntu 12.04 64-bit) with 4 Cores and 10GB of RAM.
For the LZO compression built hadoop-lzo and copied the lib to the HBase instance. Also created the tsdb tables with COMPRESSION=lzo.

On OpenTSDB put the data through socket and got the data through the OpenTSDB http api.
5000 rows selected out of 10 million rows with LZO

Database	Operation	Total rows	Threads	No of selected rows	Run time (Select + Fetch)
OpenTSDB 1.1.0	Insert	10000000	1	0	113651 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	1	5000	2895 ms	no cache
OpenTSDB 1.1.0	Select+Fetch	10000000	1	5000	97 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	2	5000	140 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	4	5000	128 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	8	5000	207 ms

threads – represents the number of threads that ran in the same time
run time – it’s the total run time of select+fetch
5000 rows selected out of 10 million rows with Date Range

Inserted 10 million points (5000 points in 2013 and the rest before 2013). Made a query for 5000 points with max value from 2013/01/03-12:00:00 – current date

Database	Operation	Total rows	Threads	No of selected rows	Run time (Select + Fetch)
OpenTSDB 1.1.0	Insert	10000000	1	0	124185 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	1	5000	136 ms	no cache
OpenTSDB 1.1.0	Select+Fetch	10000000	1	5000	126 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	2	5000	170 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	4	5000	179 ms
OpenTSDB 1.1.0	Select+Fetch	10000000	8	5000	227 ms

Problems encountered

While using OpenTSDB 1.1.0 encountered the following problems:

When tried to insert 200 million points with the default parametrs (./tsdb tsd –port=4242 –staticroot=staticroot –cachedir=”$tsdtmp”) got an OutOfMemory exeception after a while,
When tried to insert 200 million points by modifying the startup of the tsdb to have have access to more RAM (java -Xmx10240m -Xms10240m -XX:MaxPermSize=10g -enableassertions -enablesystemassertions -classpath /root/opentsdb-1.1.0/third_party/hbase/asynchbase-1.4.1.jar:/root/opentsdb-1.1.0/third_party/guava/guava-13.0.1.jar:/root/opentsdb-1.1.0/third_party/slf4j/log4j-over-slf4j-1.7.2.jar:/root/opentsdb-1.1.0/third_party/logback/logback-classic-1.0.9.jar:/root/opentsdb-1.1.0/third_party/logback/logback-core-1.0.9.jar:/root/opentsdb-1.1.0/third_party/netty/netty-3.6.2.Final.jar:/root/opentsdb-1.1.0/third_party/slf4j/slf4j-api-1.7.2.jar:/root/opentsdb-1.1.0/third_party/suasync/suasync-1.3.1.jar:/root/opentsdb-1.1.0/third_party/zookeeper/zookeeper-3.3.6.jar:/root/opentsdb-1.1.0/tsdb-1.1.0.jar:/root/opentsdb-1.1.0/src net.opentsdb.tools.TSDMain –port=4242 –staticroot=staticroot –cachedir=/tmp/tsd/tsd) got often the following exception:2013-04-03 18:07:18,965 ERROR [New I/O worker #10] CompactionQueue: Failed to write a row to re-compact
org.hbase.async.RemoteException: org.apache.hadoop.hbase.RegionTooBusyException: region is flushing
at org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:2592)
…,
If tried to query the data from point 2. got the following exception:java.lang.ArrayIndexOutOfBoundsException: nullopentsdb process
java.lang.OutOfMemoryError: GC overhead limit exceeded,
If in the same metric there are two identical dates got the following exception (and didn’t get any data from OpenTSDB):
Error 2
Duplicate data cell:
Request failed: Internal Server Error
net.opentsdb.core.IllegalDataException: Found out of order or duplicate data: cell=Cell([56, 87], [0, 0, 0, 0, 0, 1, 12, -112]), delta=901.

Conclusions for this test case

The 1.1.0 is more stable than 2.0, didn’t crash but gave the above exceptions.

Seems that the LZO option doesn’t make the insert and no cache retrieval faster for this test case. Some improvements can be seen on the data retrieval with cache.

Got a surprisingly good value on the date range query with no cache.

Code used for tests

Insert Code:

s.connect(new
 InetSocketAddress(host, 4242));
s_out = new PrintWriter(s.getOutputStream(), true);

for (i = 0; i <= tsdbgen.argu; i++) { if (i % 10000 == 0) { 
System.gc(); } ri = generator.nextInt(99999); if (tsdbgen.nrselect >=
 nrcount) {
nrcount++;
tsdbgen.selectLong++;
message = "put test " + tsdbgen.selectLong + " " + ri
+" sel=ubuntu";
} else {
tsdbgen.startLong++;
message = "put test " + tsdbgen.startLong + " " + ri
+ " sel=mac";

}

// System.out.println(message);
s_out.println(message);
}

s_out.close();
s.close();

Date start = new Date("2007/12/24-08:24:15");
Date select = new Date("2013/01/24-08:24:15");

startLong = start.getTime() / 1000;
selectLong = select.getTime() / 1000;

s.connect(new InetSocketAddress(host, 4242));

s_out = new PrintWriter(s.getOutputStream(), true);

for (i = 0; i <= tsdbgen.argu; i++) { if (i % 10000 == 0) { System.gc(); } ri = generator.nextInt(99999); if (tsdbgen.nrselect >= nrcount) {

nrcount++;

tsdbgen.selectLong++;

message = "put test " + tsdbgen.selectLong + " " + ri

+" sel=ubuntu";

} else {

tsdbgen.startLong++;

message = "put test " + tsdbgen.startLong + " " + ri

+ " sel=mac";

}

// System.out.println(message);

s_out.println(message);

}

s_out.close();

s.close();

Date start = new Date("2007/12/24-08:24:15");

Date select = new Date("2013/01/24-08:24:15");

startLong = start.getTime() / 1000;

selectLong = select.getTime() / 1000;

Get Code (for one theread):

URL start = new URL(
"http://localhost:4242/q?start=2013/01/03-12:00:00&end="+tsdbno.data+"&m=max:test&ascii");
URLConnection yc = start.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;

while ((inputLine = in.readLine()) != null) {
count++;
//System.out.println(inputLine);
}

in.close();
System.out.println("Rows: " + count);

// tsdbno.data = is the current date

URL start = new URL(

"http://localhost:4242/q?start=2013/01/03-12:00:00&end="+tsdbno.data+"&m=max:test&ascii");

URLConnection yc = start.openConnection();

BufferedReader in = new BufferedReader(new InputStreamReader(

yc.getInputStream()));

String inputLine;

while ((inputLine = in.readLine()) != null) {

count++;

//System.out.println(inputLine);

}

in.close();

System.out.println("Rows: " + count);

// tsdbno.data = is the current date

Get Code (with cache for one thread):

URL start = new URL(
"http://localhost:4242/q?start=2013/01/03-12:00:00&m=max:test&ascii");
URLConnection yc = start.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;

while ((inputLine = in.readLine()) != null) {
count++;
// System.out.println(inputLine);
}

in.close();
System.out.println("Rows: " + count);

URL start = new URL(

"http://localhost:4242/q?start=2013/01/03-12:00:00&m=max:test&ascii");

URLConnection yc = start.openConnection();

BufferedReader in = new BufferedReader(new InputStreamReader(

yc.getInputStream()));

String inputLine;

while ((inputLine = in.readLine()) != null) {

count++;

// System.out.println(inputLine);

}

in.close();

System.out.println("Rows: " + count);

MySql vs MongoDB performance benchmark

Problem

We’re searching for the fastest solution to solve the following two use cases:

fastest data retrieval to get 5000/50000 points out of 10 million points,
fastest data retrieval to get 5000/50000/500000 points out of 200 million points.

We use these points to generate JS charts.
For testing choose MongoDB vs MySql. There is no complex relation between tables/objects, the test case is very simple.
Beside the select time we also monitored the fetch time as we need the data to be fetched. We also tested with and without cache (where possible) – our specific project has the possibility to cache data. On this benchmark not tested scalability in detail but it seems that running the test on 8-20 threads the situation doesn’t change very much.

System and config

The benchmark machine is Linux (Ubuntu 12.04 64-bit) with 4 Cores and 10GB of RAM.
MySql cache config (in my.cnf):
query_cache_limit = 1024M
query_cache_type = 1
query_cache_size = 1512M

On Mongo used the default config with journal = true (because didn’t notice any considerably difference on runtime with journal = false and went for the safer option)

Test results

I) Retrieval of 5000/50000 out of 10 million points – lower database load

Generated a random database with 10 million rows. Every row has 5 fields: 2 integers and 3 random strings.
The insert operation is made on 8 threads, inserts are split among threads.
Also for select and fetch used threads (1,2,4 and 8 threads per run) but in our test case every thread runs the same operation.
On select and fetch selected the rows (5000 and 50000 rows) from the total of 10 million rows and made a similar fetch operation for both MySql and MongoDB.
The results output is streamed from shell scripts into csv.

5000 rows selected out of 10 million rows

database	operation	total rows	threads	no of selected rows	run time (select + fetch)	CPU usage (4 cores)	select time
Mongo	Insert	10000000	8	0	411121 ms	137%CPU	0
MySQL	Insert	10000000	8	0	1130493 ms	149%CPU	0
Mongo	Select+Fetch	10000000	1	5000	17411 ms	105%CPU	3 ms	first run
MySQL	Select+Fetch	10000000	1	5000	5836 ms	109%CPU	5369 ms	first run
Mongo	Select+Fetch	10000000	1	5000	6450 ms	116%CPU	2 ms
MySQL	Select+Fetch	10000000	1	5000	512 ms	208%CPU	66 ms
Mongo	Select+Fetch	10000000	2	5000	12507 ms	110%CPU	3 ms
MySQL	Select+Fetch	10000000	2	5000	565 ms	236%CPU	69 ms
Mongo	Select+Fetch	10000000	4	5000	28129 ms	106%CPU	2 ms
MySQL	Select+Fetch	10000000	4	5000	592 ms	255%CPU	72 ms
Mongo	Select+Fetch	10000000	8	5000	75047 ms	64%CPU	1 ms
MySQL	Select+Fetch	10000000	8	5000	759 ms	275%CPU	66 ms

50000 rows selected out of 10 million rows

database	operation	total rows	threads	no of selected rows	run time (select + fetch)	CPU usage (4 cores)	select time
Mongo	Insert	10000000	8	0	410866 ms	137%CPU	0
MySQL	Insert	10000000	8	0	1150706 ms	147%CPU	0
Mongo	Select+Fetch	10000000	1	500000	13049 ms	118%CPU	1 ms	first run
MySQL	Select+Fetch	10000000	1	500000	6128 ms	117%CPU	5599 ms	first run
Mongo	Select+Fetch	10000000	1	500000	6955 ms	138%CPU	2 ms
MySQL	Select+Fetch	10000000	1	500000	959 ms	212%CPU	447 ms
Mongo	Select+Fetch	10000000	2	500000	12445 ms	142%CPU	3 ms
MySQL	Select+Fetch	10000000	2	500000	1110 ms	264%CPU	533 ms
Mongo	Select+Fetch	10000000	4	500000	30251 ms	133%CPU	2 ms
MySQL	Select+Fetch	10000000	4	500000	1483 ms	307%CPU	676 ms
Mongo	Select+Fetch	10000000	8	500000	82642 ms	99%CPU	3 ms
MySQL	Select+Fetch	10000000	8	500000	2220 ms	346%CPU	1433 ms

On operation Select + Fetch on multiple threads ran the same operation for each thread. Ex: on 8 threads made 8 operations of select + fetch (one on each thread).
threads – represents the thread number

CPU usage – represents the CPU usage that the Java process took
first run – it’s the first run without any cache made, the other runs are with cache
run time – it’s the total runtime of select+fetch
select time – it’s the average time of select

Conclusions for this test case

On first run (no cache on MySql): total fetch + select Mongo time is 17.4s (3ms select time) at MySql it is 5.8s (5.3s select time). Notice here the Mongo select time 3 ms! This is probably caused by the fact that it has some lazy processing.

On next runs (the MySql cache enters): total fetch + select Mongo time is 6.9s (2ms select time) at MySql it is 0.9s (0.4s select time). Notice here the big impact of cache on MySql!

For a project of this scale that also has the possibility to cache the data requests it seems that overall MySql is much better for read operations.
Even If not possible to cache the data MySql would still be a winner. MySql better by a 2X – 5X speed for reading.
Notice the good insert speed on Mongo: it is 2-3X faster than MySql.

The 50000 test gives similar results.

II) Retrieval of 5000/50000/500000 out of 200 million points – higher database load

Generated a random database with 200 million rows. Every row has 5 fields: 2 integers and 3 random strings.
The insert operation is made on 8 threads, inserts are split among threads.
Also for select and fetch used threads (1,2,4 and 8 threads per run) but in our test case every thread runs the same operation.
On select and fetch selected the rows (5000, 50000 and 500000 rows) from the total of 200 million rows and made a similar fetch operation for both MySql and Mongo.
The results output is streamed from some shell scripts into csv.

5000 selected rows out of 200 million rows

database	operation	total rows	threads	no of selected rows	run time (select + fetch)	CPU usage (4 cores)	select time
Mongo	Insert	200000000	8	0	8283056 ms	134%CPU	0
MySQL	Insert	200000000	8	0	22270948 ms	147%CPU	0
Mongo	Select+Fetch	200000000	1	5000	185969 ms	100%CPU	4 ms	first run
MySQL	Select+Fetch	200000000	1	5000	325609 ms	100%CPU	324876 ms	first run
Mongo	Select+Fetch	200000000	1	5000	170406 ms	100%CPU	0 ms
MySQL	Select+Fetch	200000000	1	5000	797 ms	131%CPU	64 ms
Mongo	Select+Fetch	200000000	2	5000	366705 ms	100%CPU	3 ms
MySQL	Select+Fetch	200000000	2	5000	1088 ms	151%CPU	71 ms
Mongo	Select+Fetch	200000000	4	5000	626023 ms	100%CPU	3 ms
MySQL	Select+Fetch	200000000	4	5000	846 ms	158%CPU	85 ms
Mongo	Select+Fetch	200000000	8	5000	1564526 ms	73%CPU	2 ms
MySQL	Select+Fetch	200000000	8	5000	1048 ms	181%CPU	38 ms

50000 selected rows out of 200 million rows

database	operation	total rows	threads	no of selected rows	run time (select + fetch)	CPU usage (4 cores)	select time
Mongo	Insert	200000000	8	0	8366874 ms	134%CPU	0
MySQL	Insert	200000000	8	0	22592623 ms	148%CPU	0
Mongo	Select+Fetch	200000000	1	50000	191197 ms	100%CPU	2 ms	first run
MySQL	Select+Fetch	200000000	1	50000	323260 ms	100%CPU	322279 ms	first run
Mongo	Select+Fetch	200000000	1	50000	172113 ms	100%CPU	3 ms
MySQL	Select+Fetch	200000000	1	50000	1374 ms	120%CPU	144 ms
Mongo	Select+Fetch	200000000	2	50000	349728 ms	100%CPU	3 ms
MySQL	Select+Fetch	200000000	2	50000	991 ms	154%CPU	113 ms
Mongo	Select+Fetch	200000000	4	50000	596883 ms	100%CPU	2 ms
MySQL	Select+Fetch	200000000	4	50000	1160 ms	180%CPU	134 ms
Mongo	Select+Fetch	200000000	8	50000	1446121 ms	75%CPU	2 ms
MySQL	Select+Fetch	200000000	8	50000	1296 ms	201%CPU	196 ms

500000 selected rows out of 200 million rows

database	operation	total rows	threads	no of selected rows	run time (select + fetch)	CPU usage (4 cores)	select time
Mongo	Insert	200000000	8	0	8388447 ms	134%CPU	0
MySQL	Insert	200000000	8	0	23010922 ms	148%CPU	0
Mongo	Select+Fetch	200000000	1	500000	188207 ms	101%CPU	2 ms	first run
MySQL	Select+Fetch	200000000	1	500000	321247 ms	100%CPU	320165 ms	first run
Mongo	Select+Fetch	200000000	1	500000	172975 ms	101%CPU	2 ms
MySQL	Select+Fetch	200000000	1	500000	1265 ms	162%CPU	460 ms
Mongo	Select+Fetch	200000000	2	500000	371833 ms	103%CPU	3 ms
MySQL	Select+Fetch	200000000	2	500000	1514 ms	188%CPU	511 ms
Mongo	Select+Fetch	200000000	4	500000	626729 ms	104%CPU	0 ms
MySQL	Select+Fetch	200000000	4	500000	1768 ms	230%CPU	711 ms
Mongo	Select+Fetch	200000000	8	500000	1545454 ms	78%CPU	2 ms
MySQL	Select+Fetch	200000000	8	500000	2403 ms	293%CPU	1374 ms

Conclusions for this test case

On first run (no cache on MySql): total fetch + select Mongo time is 85s (4ms select time) at MySql it is 325s (324s select time). Again the select at Mongo is fast: 4 ms. Probably at select Mongo makes nothing, the lazy processing implementation appears again and the fetch is slower.

On next runs (the MySql cache enters): total fetch + select Mongo time is 170s (0ms select time) at MySql it is 0.7s (0.064s select time). Notice here the big impact of cache on MySql!

For a project of this scale that also has the possibility to cache the data requests it seems that overall MySql is still much better for read operations.
If it is not possible to cache data Mongo would be almost 1.3X – 2X faster than MySql.
Notice the good insert speed on Mongo: it is 2-3X faster than MySql.

The 50000 and 500000 tests give similar results.

Source code used for tests

MySQL Code:
database drop and create:

Class.forName(driver).newInstance();
con = DriverManager.getConnection(url + db, user, pass);
con.createStatement().execute("DROP TABLE test");
con.createStatement().execute(
"CREATE TABLE test"
+ "(id INTEGER not NULL AUTO_INCREMENT, "
+ " test1 VARCHAR(255), " + " test2 VARCHAR(255), "
+ " test3 VARCHAR(255), " + " test4 INTEGER, "
+ " PRIMARY KEY ( id ))");
con.close();

Class.forName(driver).newInstance();

con = DriverManager.getConnection(url + db, user, pass);

con.createStatement().execute("DROP TABLE test");

con.createStatement().execute(

"CREATE TABLE test"

+ "(id INTEGER not NULL AUTO_INCREMENT, "

+ " test1 VARCHAR(255), " + " test2 VARCHAR(255), "

+ " test3 VARCHAR(255), " + " test4 INTEGER, "

+ " PRIMARY KEY ( id ))");

con.close();

database insert code executed on 8 threads:

mysqlgen3.argu – rows number (example 10 million) / threads (example 8 threads)
mysqlgen3.nrselect – it`s the number of the selected rows

Class.forName(driver).newInstance();
con = DriverManager.getConnection(url + db, user, pass);
final Statement st = con.createStatement();
String c, d, e;
int x = 0;
for (i = 0; i < mysqlgen3.argu; i++) {
if (i % 250000 == 0) {
System.gc();
}
final String randomString = UUID.randomUUID().toString();
c = randomString.substring(0, 5);
d = randomString.substring(5, 10);
e = randomString.substring(10, 15);
if (mysqlgen3.nrselect >= nrcount) {
nrcount++;
st.execute("INSERT INTO `test` (`test1`, `test2`, `test3`, `test4`) VALUES ('"
+ c + "', '" + d + "', '" + e + "', 12345)");
} else {
x = generator.nextInt(9999);
st.execute("INSERT INTO `test` (`test1`, `test2`, `test3`, `test4`) VALUES ('"
+ c + "', '" + d + "', '" + e + "', '" + x + "')");
}
}
con.close();

Class.forName(driver).newInstance();

con = DriverManager.getConnection(url + db, user, pass);

final Statement st = con.createStatement();

String c, d, e;

int x = 0;

for (i = 0; i < mysqlgen3.argu; i++) {

if (i % 250000 == 0) {

System.gc();

}

final String randomString = UUID.randomUUID().toString();

c = randomString.substring(0, 5);

d = randomString.substring(5, 10);

e = randomString.substring(10, 15);

if (mysqlgen3.nrselect >= nrcount) {

nrcount++;

st.execute("INSERT INTO `test` (`test1`, `test2`, `test3`, `test4`) VALUES ('"

+ c + "', '" + d + "', '" + e + "', 12345)");

} else {

x = generator.nextInt(9999);

st.execute("INSERT INTO `test` (`test1`, `test2`, `test3`, `test4`) VALUES ('"

+ c + "', '" + d + "', '" + e + "', '" + x + "')");

}

con.close();

select + fetch:

ResultSet res;
long startTime, endTime, totalTime;
Class.forName(driver).newInstance();
con = DriverManager.getConnection(url + db, user, pass);
Statement st = con.createStatement();
startTime = System.currentTimeMillis();
try {
res = st.executeQuery("SELECT * FROM test WHERE `test4`=12345");
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Select time: " + totalTime + " ms");
startTime = System.currentTimeMillis();
while (res.next()) {
count++;
}
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Fetch time: " + totalTime + " ms");
System.out.println("Rows: " + count);</p>
<b>Mongo Code:</b>
<b>database drop and create:</b>
<p align="LEFT">
DBCollection collection;
Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("local");
collection = db.getCollection("test");
collection.drop();
if (collection == null) {
collection = db.createCollection("test", null);
}
mongo.close();

ResultSet res;

long startTime, endTime, totalTime;

Class.forName(driver).newInstance();

con = DriverManager.getConnection(url + db, user, pass);

Statement st = con.createStatement();

startTime = System.currentTimeMillis();

try {

res = st.executeQuery("SELECT * FROM test WHERE `test4`=12345");

totalTime = System.currentTimeMillis() - startTime;

System.out.println("Select time: " + totalTime + " ms");

startTime = System.currentTimeMillis();

while (res.next()) {

count++;

}

totalTime = System.currentTimeMillis() - startTime;

System.out.println("Fetch time: " + totalTime + " ms");

System.out.println("Rows: " + count);

Mongo Code:

database drop and create:

DBCollection collection;

Mongo mongo = new Mongo("localhost", 27017);

DB db = mongo.getDB("local");

collection = db.getCollection("test");

collection.drop();

if (collection == null) {

collection = db.createCollection("test", null);

}

mongo.close();

database insert code executed on 8 threads:

DBCollection collection;
Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("local");
collection = db.getCollection("test");
String c, d, e;
int x = 0;
BasicDBObject document;
for (i = 0; i < mongogen3.argu; i++) {
if (i % 10000 == 0) {
System.gc();
}
final String randomString = UUID.randomUUID().toString();
c = randomString.substring(0, 5);
d = randomString.substring(5, 10);
e = randomString.substring(10, 15);
if (mongogen3.nrselect >= nrcount) {
nrcount++;
document = new BasicDBObject();
document.put("test1", "" + c + "");
document.put("test2", "" + d + "");
document.put("test3", "" + e + "");
document.put("test4", 12345);
collection.insert(document);
} else {
x = generator.nextInt(9999);
document = new BasicDBObject();
document.put("test1", "" + c + "");
document.put("test2", "" + d + "");
document.put("test3", "" + e + "");
document.put("test4", "" + x + "");
collection.insert(document);
}
}
mongo.close();

DBCollection collection;

Mongo mongo = new Mongo("localhost", 27017);

DB db = mongo.getDB("local");

collection = db.getCollection("test");

String c, d, e;

int x = 0;

BasicDBObject document;

for (i = 0; i < mongogen3.argu; i++) {

if (i % 10000 == 0) {

System.gc();

}

final String randomString = UUID.randomUUID().toString();

c = randomString.substring(0, 5);

d = randomString.substring(5, 10);

e = randomString.substring(10, 15);

if (mongogen3.nrselect >= nrcount) {

nrcount++;

document = new BasicDBObject();

document.put("test1", "" + c + "");

document.put("test2", "" + d + "");

document.put("test3", "" + e + "");

document.put("test4", 12345);

collection.insert(document);

} else {

x = generator.nextInt(9999);

document = new BasicDBObject();

document.put("test1", "" + c + "");

document.put("test2", "" + d + "");

document.put("test3", "" + e + "");

document.put("test4", "" + x + "");

collection.insert(document);

}

mongo.close();

select + fetch:

long startTime, totalTime;
Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("local");
DBCollection collection = db.getCollection("test");
BasicDBObject query;
DBCursor cursor;
query = new BasicDBObject();
query.put("test4", 12345);
startTime = System.currentTimeMillis();
cursor = collection.find(query);
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Select time: " + totalTime + " ms");
startTime = System.currentTimeMillis();
while (cursor.hasNext()) {
count++;
cursor.next();
}
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Fetch time: " + totalTime + " ms");
System.out.println("Rows: " + count);

long startTime, totalTime;

Mongo mongo = new Mongo("localhost", 27017);

DB db = mongo.getDB("local");

DBCollection collection = db.getCollection("test");

BasicDBObject query;

DBCursor cursor;

query = new BasicDBObject();

query.put("test4", 12345);

startTime = System.currentTimeMillis();

cursor = collection.find(query);

totalTime = System.currentTimeMillis() - startTime;

System.out.println("Select time: " + totalTime + " ms");

startTime = System.currentTimeMillis();

while (cursor.hasNext()) {

count++;

cursor.next();

}

totalTime = System.currentTimeMillis() - startTime;

System.out.println("Fetch time: " + totalTime + " ms");

System.out.println("Rows: " + count);

On both MySQL and Mongo used threads for select+fetch and insert. Example of implemented threads:
threadCount is the number of threads and MyThread class contains the presented code above.

for (int x = 0; x < threadCount; x++) {
MyThread temp = new MyThread("Thread #" + x);
threads.add(temp);
temp.start();
System.out.println("Started Thread:" + x);
}
boolean allDone = true;
while (allDone != false) {
allDone = false;
for (int x = 0; x < threadCount; x++) {
allDone |= threads.get(x).isAlive();
}
}

for (int x = 0; x < threadCount; x++) {

MyThread temp = new MyThread("Thread #" + x);

threads.add(temp);

temp.start();

System.out.println("Started Thread:" + x);

}

boolean allDone = true;

while (allDone != false) {

allDone = false;

for (int x = 0; x < threadCount; x++) {

allDone |= threads.get(x).isAlive();

}

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Wednesday, 6 August 2014