4. Storing data in the Amazon Cloud (HBase)
4.1 HBase as Key Value store
Initially I thought of using a relational database for permanently storing the collected sensor values, mainly because I thought it would be easy. When the Raspberry PI and the cloud server run the same database system and are connected through a database link updating would be as simple as ‘insert into select * from ‘. Using SQLite made this imposable because it does not support remote tables. In addition storing 1,000 values per second would quickly increase the number of rows (over 31 billion for one year), possibly going beyond the limits of a Relational Database System (RDMBS). Finally, because this was mostly a learning project and I had some experience with big data technology, such as Hadoop and HBase, I decided to go for a Key Value store.
I decided to use HBase as storage system. Mainly because I had some experience with using HBase in combination with Python and secondly because its the underlying system of OpenTSBD. OpenTSBD is an open source project for storing time series data. This is almost the essence of what I was trying to achieve. I decided to use standard HBase instead of OpenTSBD to be able to understand more what was happening in the storage layer and to have more control of the row key design.
4.2 HBase row key design
One of the most import parts of using HBase is the row key design. Lars George explains this in detail in his book HBase: The Definitive Guide. I chose this one:
<device id>_<port id>_<reverse epoch>_<reverse milliseconds>
For example: 40b5af01_rx000A01_8571346790_9184822.
The key design uses the concepts device and port. Both are XBee concepts. A simple combined <sensor id>, similar to OpenTSBD, might be a better alternative because it is a more general approach. On the other hand splitting the key up in the id of the device that sends the value and the id of the sensor on that device is a bad concept. In this case port id can be seen as sensor id. Both device id and port id are fixed length alphanumeric id’s. The time parts of the key are reversed to use scan function of HBase to quickly return the latest value of each sensor. The sensor table in HBase only uses one column and one column family.
4.3 Uploading data to HBase for the Rapsberry PI
The Raspberry PI inserts the values directly in the HBase table. Every second it looks up the latest timestamp for each sensor in the HBase table and inserts every value from the SQLite table on the PI that has a higher timestamp. If for any reason the internet connection between the Raspberry PI and the cloud server breaks, all recorded values on the the Raspberry PI will automatically be inserted when the connection is reestablished. Only the values that have not been removed by the ‘delete’ thread on the Raspberry PI will be uploaded. Currently the connection can be down for 2 minutes without data loss.
The connection type is a ‘Thrift’ connection. Thrift has the advantage of being a binary protocol, reducing the overhead to for example REST and there is a nice Python wrapper for HBase Thrift. Values are sent in batches with a maximum of 5,000 values per batch to limit memory overhead on the Raspberry PI. For security the Thrift connection runs through a ssh tunnel. Autossh sets-up the connection between the Raspberry PI and the cloud server. The Python script runs as daemon, following the excellent instructions by Stephen Phillips http://blog.scphillips.com/posts/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/ .
autossh: ssh tunnel for thrift
4.4 Messaging alternative
An alternative of course would have been to use a messaging system. Messaging systems however much harder to develop and particularly more difficult to debug. At least for me they are.
4.5 Installing HBase on Amazon server
The Amazon server I use is a m1.large instance running Ubunto. This instance has 8 gigabyte of memory, which is required to run HBase and 800 gigabytes of storage. This storage makes it possible to save 10 yeas of raw data (compressed). I use spot pricing to reduce costs and so far I have not paid more than 20 USD per month for this server. This does of course show, that this is mainly an a research project. To save money on your energy bill you would at least have to make up for the monthly server fee.
In my installation of HBase the data is stored on the Hadoop file system HDFS, which is very common. For the installation of Hadoop I followed the excellent tutorial of Aravindu Sandela for installing Hadoop 2.0 on Ubuntu http://bigdatahandler.com/hadoop-hdfs/installing-single-node-hadoop-2-2-0-on-ubuntu/ . For the installation of Hbase I followed the tutorial on the Apache HBase website for running HBase in pseudo distributed mode. Of course both HBase and Hadoop are designed to run on a cluster of servers, so running it on a single machine is a bit useless, but for this project it works very well.
4.6 LZO compression in HBase
Storing the raw sensor value at the intended collection rate of 1,000 values per second the 800 gigabytes would have only given me one year of sensor values. This is more than sufficient for this project. The OpenTSBD project however recommended using LZO compression on the HBase tables. In addition, I use the FAST_DIFF data_block_encoding option. This HBASE feature only stores the part of the row key that is different for the previous one. Because my row key design uses a lot of redundant information (both device id and sensor id) this greatly reduces the amount of required storage. Both measures reduced the storage with 90%, to 10% of the original required storage. This makes it possible to save 10 years of raw sensor information on a single server.