a_brit on "RE: Still room for growth in U.S. home electronics"
Kettle, Toaster, Microware, TV, Washing Machine, Tumble Dryer, PC, DVD Player, Stereo
Okay there is 9 of the most common I imagine, there is still another 11 before getting to the average household.
Publication date: 2008-09-09
More news from 2008-09-09
Long Term Data Archiving Formats, Storage, and Architecture
I'm thinking about ways to store archival data for the long term
and wanted to solicit anyone who's been down this road for some
input, advice, warnings, etc.
Background
Essentially I'm dealing with a system where "live" and "recently
live" data is stored in a set of
replicated MySQL servers and queried
in real-time. As time goes on, however, older "expired" data is
moved to a smaller set of replicated "archive" servers that also
happen to be MySQL.
This is problematic for a few reasons, but rather than be all
negative, I'll express what I'm trying to do in the form of some
goals.
Goals
There are a few high-level things I'd like this archive to
handle based on current and future needs:
Be able to store data for the foreseeable future. That means
hundreds of millions of records and, ultimately, billions.
Fast access to a small set of records. In other words, I like
having MySQL and indexes that'll get me what I want in a hurry
without having to write a lot of code. The archive needs to be
able to handle real-time queries quickly. It does this today and
needs to continue to work.
Future-proof file/data format(s). One problem with simply using
MySQL is that there will be schema changes over time. A column may
be added or dropped or renamed. That change can't easily be
implemented retroactively on a larger data set in a big table or
whatnot. But if you don't then code needs to be willing to deal
with those changes, NULLs appearing, etc.
Fault tolerance. In other words, the data has to live in more
than once place.
Support for large scans on the data. This can be to do full-text
style searches, looking for patterns that can't easily be indexed,
computing statistics, etc.
It's worth noting that data is added to the archive on a constant
basis and it is queried regularly in a variety of ways. But there
are no delete or updates occurring. It's a write heavy system most
of the time.
Pieces of a Possible Solution
I'm not sure that a single tool or piece of infrastructure will
ever solve all the needs, but I'm thinking there may be several open
source solutions that can be put to use.
You'll notice that this involves duplicating data, but storage is
fairly cheap. And each piece is especially good at solving one or
more of our needs.
MySQL. I still believe there's a need for having a copy of the
data
either denormalized
or in a star
schema in a set of replicated MySQL instances using MyISAM. The
transactional overhead of InnoDB isn't really needed here. To keep
things manageable one might create tables per month or quarter or
year. Down the road
maybe Drizzle makes
sense?
Sphinx. I've been
experimenting with Sphinx for indexing large amounts of textual data
(with some numeric attributes) and it works remarkably well. This
would be very useful instead of building MySQL full-text
indexes or doing painful LIKE queries.
Hadoop/HDFS
and Flat
Files or a simple record structure. To facilitate fast batch
processing of large chunks of data, it'd be nice to have everything
stored in HDFS as part of a small Hadoop cluster where one can use
Hadoop Streaming to run jobs over the entire data set. But what's
good future-proof file format that's efficient? We could use
something like XML
(duh), JSON, or
even Protocol
Buffers. And it may make sense to compress the data with gzip
too. Maybe put a month's worth of data per file and compress?
Even Pig
could be handy down the road.
While it involves some data duplication, I believe these pieces
could do a good job of handling a wide variety of use cases:
real-time simple queries, full-text searching, and more intense
searching or statistical processing that can't be pre-indexed.
So what else is there to consider here? Other tools or
considerations when dealing with a growing archive of data whose
structure may grow and change over time?
I'm mostly keeping discussion of storage hardware out of this,
since it's not the piece I really deal with (a big disk is a big
disk for most of my purposes), but if you have thoughts on that,
feel free to say so. So far I'm only thinking 64bit Linux boxes
with RAID for MySQL and non-RAID for HDFS and Sphinx.
Related Posts
The Long Term Performance of InnoDB
Open Source Queueing and Messaging Systems?
Dumber is Faster with Large Data Sets (and Disk Seeks)
The Perl UTF-8 and utf8 Encoding Mess
(comments)
Publication date: 2008-09-09
more
a_brit on "RE: Still room for growth in U.S. home electronics"
Kettle, Toaster, Microware, TV, Washing Machine, Tumble Dryer, PC, DVD Player, Stereo
Okay there is 9 of the most common I imagine, there is still another 11 before getting to the average household.
Publication date: 2008-09-09
more
a_brit on "RE: Google to digitize old newspapers"
Will this be attempted internationally or just USA?
Publication date: 2008-09-09
more
Site generated in 0.1523sec.