2013年12月19日木曜日

btrfs on Linux 3.12.5

ZFSに疲れたので btrfs をインストールしてみた。 Linux 3.12.5 をインストールしたのちに

cd ~/btrfs-progs/; make; make install

cd ~/linux-btrfs/fs/btrfs/; make -C ~/linux-3.12.5/ M=${PWD} modules &&

    make -C ~/linux-3.12.5/ M=${PWD} modules_install


してみただけ。あ、そういえば truncate_pagecache() の定義が変わっていてコンパイルが通らなかったので手許で修正した。メインツリーの btrfs はこうなっているようです。






2013年12月14日土曜日

ZFSアレイ用のRaw Device Mappingを作成する

新しく購入したハードディスクをESXi上のOpenIndianaからZFS RAID-Zアレイとして利用するためにRDMをつくった。


まず ESXi に SSH ログインしてデバイスのファイル名を確認する。

# ls -1 /dev/disks/
(省略)
t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________SERIAL001
t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________SERIAL002
t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________SERIAL003
t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________SERIAL004
t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________SERIAL005
(省略)


RDM の VMDK を作るシェルスクリプト(ってほどのものでもないが)を作って実行。

#!sh
DEV_PREFIX=/vmfs/devices/disks/t10.ATA_____WDC_WD40EFRX2D68WT0N0_________________________
VMFS_DIR=/vmfs/volumes/160gb/

vmkfstools -z ${DEV_PREFIX}SERIAL001 ${VMFS_DIR}/array3a_SERIAL001.vmdk
vmkfstools -z ${DEV_PREFIX}SERIAL002 ${VMFS_DIR}/array3b_SERIAL002.vmdk
vmkfstools -z ${DEV_PREFIX}SERIAL003 ${VMFS_DIR}/array3c_SERIAL003.vmdk
vmkfstools -z ${DEV_PREFIX}SERIAL004 ${VMFS_DIR}/array3d_SERIAL004.vmdk
vmkfstools -z ${DEV_PREFIX}SERIAL005 ${VMFS_DIR}/array3e_SERIAL005.vmdk


ちなみに一世代前のアレイは array2_SERIAL00X.vmdk といったイメージで RDM ファイル名を付けていたのだが、複数のハードディスクを VM コンフィグレーション上で見分けたいときにあまり嬉しくない(リスト表示などで後半が省略されて見分けがつかない、等)ので、今回は array3[abcde]_SERIAL00X.vmdk というネーミングにしてみた。


ESXi の SSH ログイン上で VMFS5 を作る

私の手許の vSphere Client はなぜか ESXi にてデータストアを作成しようとエラーになったりするのがある別のマシン上の vSphere Client からだとうまくいったりするので恐らく Client のインストールがうまくいってないんだろうと思う。再インストールしても再現しちゃうので ESXi にログインして直接作業。




# fdisk /dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXX

***
*** The fdisk command is deprecated: fdisk does not handle GPT partitions.  Please use partedUtil
***


The number of cylinders for this disk is set to 19457.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXX: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

                                                                            
Device Boot      Start         End      Blocks  Id System
/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXXp1            
1        3891    31249408  83 Linux
/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXXp2         
3891        4377     3906560  82 Linux swap
/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXXp3            4377       19457   121131360+ 8e Linux LVM

Command (m for help): d
Partition number (1-4): 1

Command (m for help): d
Partition number (1-4): 2

Command (m for help): d
Selected partition 3

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): Value is out of range
Partition number (1-4): 1
First cylinder (1-19457, default 1): Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-19457, default 19457): Using default value 19457

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (VMFS)

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table
~
# vmkfstools --createfs vmfs5 --blocksize 1m
/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1
--setfsname 160gb
create fs
deviceName:'/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1',
fsShortName:'vmfs5', fsName:'160gb'
deviceFullPath:/dev/disks/t10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1
deviceFile:t10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1
VMFS5
file system creation is deprecated on a BIOS/MBR partition on device
't10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1'
Checking if remote hosts are using this device as a valid file system. This may take a few seconds...
Creating
vmfs5 file system on
"t10.ATA_____ST3160815AS_________________________________________6RA9XXXX:1"
with blockSize 1048576 and volume label "160gb".
Successfully created new volume: 52abd28e-24203c58-ed35-009c02995e4c
~ # a


できた。MBRなんか使うなヨって言われてるけどとりあえず一時的な作業用なのでこれでいいや。 vSphere Client 側でデータストアを再検索して完了。


Screen_shot_20131214_at_10518_pm


2013年12月12日木曜日

メモリを消費するけど重複排除されない perl ワンライナー

メモリをひたすら消費したいというシチュエーションで メモリをがんがん消費する Perl ワンライナー を発見。ただ ESXi がメモリを重複排除しちゃう可能性があるので、そういう可能性が低そうな形にアレンジしました。

perl -e '$hn=`hostname`; while(1){$i++;$h{$i}=$i . " $hn" . 'x'x4000 }'



2013年12月10日火曜日

Resolving "out of memory" situation on ZFS import, without actual DRAM

This entry has several screenshots w/Japanese, and may not able to read them - my apologies for the inconvenience.


You won't go to heaven if you have enabled deduplication feature of ZFS once.


ZFS will keep DDT (de-duplication table), which manages data blocks, in ARC and the media if dedup is set to "on". Generally it should work great for read/writes, but you'll see the tragedy when you'll try to delete huge data with DDT. ZFS may stall if the system doesn't have enough DRAM to hold DDT. Sometimes deleting 1GB files on 8GB DRAM system can be challenging.


Even though ZFS stopped due to out-of-memory, usually rebooting the system works for me. The pool will be imported, the partial transaction will be finished, and the volume will be available. I experienced this kind of issue for several dozen times(!), but I've never lost any data on ZFS, and I believe ZFS is still reliable.


And... it happened again....


The other day I did "zfs destroy array2/backup" to free the unnecessary filesystem with 4GB data, but it caused out-of-memory and stopped ZFS again. As I mentioned, rebooting the system should be the remedy, but not for this time.


"Argh!
I will lose my data!"


Anyway I tried to recover from this situation...


■ Update OpenIndiana system

I've run OpenIndiana 151a7 kernel for a while. I believe this kernel has some ZFS issue which causes deadlock when ZFS is starving with free memory. This is just my guess, but updating the kernel may work, so I've updated OpenIndiana packages via pkg command.

■ Over-commit the vRAM using ESXiMy first ZFS-based system was running on <4GB DRAM environment. Today I allocate 16Gigabytes of DRAM for the guest on ESXi, but it seems to be not enough for this situation. Finally I decided to "over-commit" DRAM as possible as can, because the system uses touches lots of memory but I didn't see so much random access against the entire memory region.




Yes, I gave it 32Gigabytes of memory (This host actually has only 24GB). Fortunately, it is quite easy to configure vRAM over-commit on ESXi. The memory size can be set to 32GB even if the host only has 8GB or 16GB actual memory. Its paging file is on Fusion-io ioDrive2 flash tier.

■ Imported successfully

I had some confidence that more additional memory will resolve this, and my guess was true - finally the pool had been imported successfully. According to ESXi, The OpenIndiana VM spent 18.91GB vRAM.



Anyway, once the ZFS pool had been successfully imported, it doesn't need so much vRAM anymore. It only required 1.9GB vRAM after the restart the guest.



■ ZFS Dedupe? ... No...

Now I believe we should avoid using deduplication feature of ZFS, or you'll see some difficulty when deleting data on the pool.



I've been struggled with ZFS dedupe several times, and going to build a new ZFS pool with dedup=ABSOLUTELY off.