We deal with fairly large amounts of webserver log data, so are also saving HDFS space and job i/o by using the hadoop-lzo package. It gives fast compression that retains our ability to use the data through Hive queries.
If you are only interested in compression, and have Hadoop and Hive configured appropriately, you can even mix compressed and uncompressed data in separate partitions of a Hive table. A normal table definition will work:
CREATE EXTERNAL TABLE foo ( columnA string, columnB string ) PARTITIONED BY (date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION '/path/to/hive/tables/foo';
One big advantage of LZO, though, is its ability to be split in map/reduce jobs. This is done by creating an index of the LZO file with the LzoIndexer tool of the hadoop-lzo project. To actually use the index, you will need to use a special input format for your Hive table:
CREATE EXTERNAL TABLE foo ( columnA string, columnB string ) PARTITIONED BY (date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/path/to/hive/tables/foo';
Now to actually come to the point. In my case, I had already created the table, and was trying to add indexing after the fact. Hive permits changing input format with an alter statement:
ALTER TABLE foo SET FILEFORMAT INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
But this alters only future partitions, not existing partitions. They retain their TextInputFormat. So now when I ran my Hive queries, instead of the LZO index file being used for splitting the input, it wound wind up used as table data. My results were mostly correct, but there were some result rows that were garbage.
I fixed this by dropping and recreating the table and partitions with the correct input format. Because I use EXTERNAL tables, the data itself was preserved.
While this is not a big deal, I have lost the ability to mix compressed and uncompressed data in the table. The Hive language manual claims I can alter partition metadata, which would be another way to deal with this, but so far I’ve not been able to make that work in versions 0.5 and 0.6.
Thanks to Dmitriy and Johan from Twitter for helping me understand all this.
The original hadoop-gpl-compression project:
Discussion of Hive and table attributes: