Archive for the ‘lijit’ Category

Hive tables, partitions and LZO compression

Thursday, February 24th, 2011

At Lijit we’ve been working with lots of the projects in the Hadoop ecosystem.  In particular, we’re using Hive quite a bit, since it abstracts map/reduce into a familiar SQL-like language.

We deal with fairly large amounts of webserver log data, so are also saving HDFS space and job i/o by using the hadoop-lzo package. It gives fast compression that retains our ability to use the data through Hive queries.

If you are only interested in compression, and have Hadoop and Hive configured appropriately, you can even mix compressed and uncompressed data in separate partitions of a Hive table.  A normal table definition will work:

CREATE EXTERNAL TABLE foo (
                       columnA string,
                       columnB string )
       PARTITIONED BY (date string)
       ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
       LOCATION '/path/to/hive/tables/foo';

One big advantage of LZO, though, is its ability to be split in map/reduce jobs. This is done by creating an index of the LZO file with the LzoIndexer tool of the hadoop-lzo project. To actually use the index, you will need to use a special input format for your Hive table:

CREATE EXTERNAL TABLE foo (
         columnA string,
         columnB string )
    PARTITIONED BY (date string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
          OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    LOCATION '/path/to/hive/tables/foo';

Now to actually come to the point. In my case, I had already created the table, and was trying to add indexing after the fact. Hive permits changing input format with an alter statement:

ALTER TABLE foo
    SET FILEFORMAT
        INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

But this alters only future partitions, not existing partitions. They retain their TextInputFormat. So now when I ran my Hive queries, instead of the LZO index file being used for splitting the input, it wound wind up used as table data. My results were mostly correct, but there were some result rows that were garbage.

I fixed this by dropping and recreating the table and partitions with the correct input format. Because I use EXTERNAL tables, the data itself was preserved.

While this is not a big deal, I have lost the ability to mix compressed and uncompressed data in the table. The Hive language manual claims I can alter partition metadata, which would be another way to deal with this, but so far I’ve not been able to make that work in versions 0.5 and 0.6.

Thanks to Dmitriy and Johan from Twitter for helping me understand all this.

hadoop-lzo:

https://github.com/kevinweil/hadoop-lzo

The original hadoop-gpl-compression project:
http://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1

Hive language manual:

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL

Discussion of Hive and table attributes:
https://issues.apache.org/jira/browse/HIVE-957

Moab

Tuesday, December 18th, 2007

The posting binge continues. Lots of people search me for “moab“, so I thought I’d meet my readers expectations and write a Moab post, complete with link to a flickr set.

Of course, when speaking of MTB or Moab (or crafting a post for search results), I also need to be sure to link to my buddy Brian’s site SingleTrackRides.

You mean somebody might actually read this?

Monday, December 17th, 2007

I noticed today that Todd has added me to his blogroll. And somebody’s been searching me on physical therapy. That must be my physical therapist Larry Meyer up at BCSM. (I’m recovering from a double-osteotomy to correct a valgus malalignment. I have way better pictures than that Wikipedia article.)

So my reaction? Uh oh; I guess I’d better do something more here than just post ridiculous YouTube videos and random text.

The Lijit Cocktail

Tuesday, August 21st, 2007

One of the bloggers over at drinkoftheweek.com that is using our service dreamed up a cocktail in our honor. We had to give him an iPhone, though:
http://www.drinkoftheweek.com/blog/this-weeks-drink-lijit-cocktail/

We tried ‘em out yesterday. Just a little bit too sweet for my taste, but it’s hard to argue with a drink named after you.

We’d like to call it the Lijito, but we’re not sure if that’s something derogatory in Portugues or something, so we might just wind up calling it the Lijitini.

Random text post

Friday, April 20th, 2007

Here’s some random text for The Google to index:

I thought, “sldfkjdlkjd wwerrrjrjs ttlljjjgllkkttj sldfe sdfw?” But naturally, fjwjejrb jboegj and wllejrbbbjelti3! So instead I wnne85gnns dlfkjwrng yot!

Shortly thereafter, wnern6xioa hgqqp38ng alasjtq8p6 sgklgfnqp lanffsan ouq dlfn sda gqpq dnfgga. But still, there is the matter of akl e7a z fngfn lqn fqi ore.

“IuthlzJKSf hnaa jasdj tjla5!” I said. “A hafaljsda nfqo ietyalf hld sh q5a.”

No, that will never work. Aj awfahqeol q89 ahlfqp3984 bdfalkf.