Skip to content
This repository has been archived by the owner on Sep 18, 2021. It is now read-only.

Journal files not erased as they roll over #124

Open
kperi opened this issue Jul 3, 2013 · 4 comments
Open

Journal files not erased as they roll over #124

kperi opened this issue Jul 3, 2013 · 4 comments

Comments

@kperi
Copy link

kperi commented Jul 3, 2013

Quoting @andrewclegg and @robey from this post: #16

"We have queues where there are hundreds of gigabytes of old journal files for queues that currently have items=0. Obviously this takes up a lot of disk space, but it also means Kestrel takes ages to start up, as it replays all those old operations.

How can we force Kestrel to compact them?"

@robey
Copy link
Contributor

robey commented Jul 9, 2013

Sorry for the delay -- I was out of town for a while.

There are two scripts built with kestrel that should help here:

$ ./dist/kestrel/scripts/qdump.sh --help

That will dump out a (sort of) human-readable list of the operations in a journal file, without the associated data. It might be useful for figuring out what's messed up, although if the journal is "hundreds of gigabytes", it might be too time-consuming to matter.

$ ./dist/kestrel/scripts/qpack.sh --help

This takes a set of journal files and packs them into their minimal operations (usually one ADD for each item that's currently in the queue). You should take the bad server out of rotation, run this script on its journal files, and then start it back up again with only the new journal file. (You can delete the old ones after it's finished packing the new one.) It might be helpful to "qdump" the new file before starting, in case there's something wonky.

@kperi
Copy link
Author

kperi commented Jul 15, 2013

Hi,
Thanks for the reply.

I've tried qdump on some of the files and I am getting something like:

qdump.sh my_queue.1373865238490 00000000 ADD 2601 00000a3e ADD 2641 000014a4 REM 000014a5 REM 000014a6 ADD 2597 00001ee0 ADD 2616 0000292d ADD 2437 000032c7 REM 000032c8 ADD 2572 00003ce9 ADD 2624 ............... 0199cf REM 000199d0 REM 000199d1 ADD 2649 0001a43f ADD 2649 0001aead ADD 2631 0001b909 REM 0001b90a REM 0001b90b REM 0001b90c REM Exception in thread "main" java.util.NoSuchElementException: queue empty at scala.collection.mutable.Queue.dequeue(Queue.scala:65) at net.lag.kestrel.tools.QueueDumper.dumpItem(QDumper.scala:102) at net.lag.kestrel.tools.QueueDumper$$anonfun$apply$2.apply(QDumper.scala:47) at net.lag.kestrel.tools.QueueDumper$$anonfun$apply$2.apply(QDumper.scala:45) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451) at net.lag.kestrel.tools.QueueDumper.apply(QDumper.scala:45) at net.lag.kestrel.tools.QDumper$$anonfun$main$1.apply(QDumper.scala:201) at net.lag.kestrel.tools.QDumper$$anonfun$main$1.apply(QDumper.scala:199) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) at scala.collection.immutable.List.foreach(List.scala:76) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:30) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:44) at net.lag.kestrel.tools.QDumper$.main(QDumper.scala:199) at net.lag.kestrel.tools.QDumper.main(QDumper.scala)

Other files in the same queue seem not to have a problem with qdump, ie I am getting a full list of ops and qdump exits without exceptions.


Additionally, the overal queue + journal size was reduced from 154G to 500m just after start pushing new data to the queue

Thanks,
Kostas

@andrewclegg
Copy link

We've just had the same situation occur that @kperi described above.

I'm trying to rescue the situation without losing any data, but when I try to run qpack.sh on the out-of-control journal files, I get errors like these:

Packing journals...
Packing: 2.2M   2.2M  Exception in thread "main" net.lag.kestrel.BrokenItemException: java.io.IOException: Unexpected EOF
        at net.lag.kestrel.Journal.readJournalEntry(Journal.scala:414)
        at net.lag.kestrel.Journal.next$1(Journal.scala:422)
        at net.lag.kestrel.Journal$$anonfun$next$1$1.apply(Journal.scala:427)
        at net.lag.kestrel.Journal$$anonfun$next$1$1.apply(Journal.scala:427)
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1060)
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1052)
        at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:952)
        at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:952)
        at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:941)
        at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:946)
        at scala.collection.TraversableOnce$FlattenOps$$anon$1.hasNext(TraversableOnce.scala:391)
        at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
        at scala.collection.Iterator$class.foreach(Iterator.scala:772)
        at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451)
        at net.lag.kestrel.JournalPacker.apply(JournalPacker.scala:73)
        at net.lag.kestrel.tools.QPacker$.main(QPacker.scala:65)
        at net.lag.kestrel.tools.QPacker.main(QPacker.scala)
Caused by: java.io.IOException: Unexpected EOF
        at net.lag.kestrel.Journal.readBlock(Journal.scala:443)
        at net.lag.kestrel.Journal.readJournalEntry(Journal.scala:382)
        ... 16 more

This suggests one possible reason for the journals not getting packed automatically: the journal packer thread is failing on corrupt journal files because of this exception.

I couldn't find any direct evidence of this in the log files, but then, the disks were full due to this issue, so maybe the logger couldn't write the appropriate stacktrace. A bit speculative, but it's all I've got so far...

@andrewclegg
Copy link

Note: there were a lot of timestamped journal files with size 0, but I removed all of these first before trying to run qpack.sh.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants