Lustre 파일 시스템 구축 및 MDT(Meta Server) 이중화 구현하기

by 서진우 · 2008년 2월 21일

### Lustre 파일 시스템 구축 및 MDT(Meta Server) 이중화 구현하기 ##########

작성일 : 2008년 01월 31일
작성자 : 서진우

1. Lustre 설치 방법

– lustre download

http://www.sun.com/software/products/lustre/get.jsp 에서 1.6 버전을 다운로드 받는다.

e2fsprogs-1.40.2.cfs5-0redhat.i386.rpm
e2fsprogs-devel-1.40.2.cfs5-0redhat.i386.rpm
kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2.i686.rpm
lustre-1.6.4.2-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm
lustre-ldiskfs-3.0.4-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm
lustre-modules-1.6.4.2-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm

– lustre install

설치 전에 기본적으로 아래 프로그램이 설치 되어 있어야 한다.

# rpm -q libxml2
# rpm -q python
# rpm -q PyXML

lustre 관련 패키지를 설치 한다.

# rpm -Uvh kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2.i686.rpm
# rpm -Uvh lustre-1.6.4.2-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm
# rpm -Uvh lustre-modules-1.6.4.2-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm
# rpm -Uvh lustre-ldiskfs-3.0.4-2.6.9_55.0.9.EL_lustre.1.6.4.2smp.i686.rpm
# rpm -Uvh e2fsprogs-1.40.2.cfs5-0redhat.i386.rpm e2fsprogs-devel-1.40.2.cfs5-0redhat.i386.rpm

reboot 한다.

새로 설치한 kernel-lustre-smp-2.6.9-55.0.9.EL_lustre 커널로 부팅한다.

2. Lustre 기본 설정 방법

lustre 구성은 크게 MDS 서버와 OST 서버로 나누어 진다. 아래는 LFS를 구성한 호스트 정보이다.

# vi /etc/hosts
—————————————————————————————-
vnode00 192.168.123.60 # MDS, MGS host
vnode01 192.168.123.61 # OST host
vnode02 192.168.123.62 # OST host
vnode03 192.168.123.63 # OST host
—————————————————————————————-

일단 모든 서버의 /etc/modprobe.conf에 아래 내용을 추가한다.

# vi /etc/modprobe.conf
—————————————————————————————-
# Networking options, see /sys/module/lnet/parameters
options lnet networks=tcp
# alias lustre llite — remove this line from existing modprobe.conf
# (the llite module has been renamed to lustre)
# end Lustre modules
—————————————————————————————-

그런 후 MDS 서버를 설정한다.

mkfs.lustre –fsname=<LSF_mount_volume_name> –mdt –mgs <mds_device>

vnode00> mkfs.lustre –fsname=testfs –mdt –mgs /dev/hda1
vnode00> mkdir -p /mnt/test/mdt
vnode00> mount -t lustre /dev/hda1 /mnt/test/mdt
vnode00> cat /proc/fs/lustre/devices
—————————————————————————————
0 UP mgs MGS MGS 13
1 UP mgc MGC192.168.123.60@tcp 80af37db-5a85-d057-4750-8b6f2817c170 5
2 UP mdt MDS MDS_uuid 3
3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 11
—————————————————————————————

아래와 같이 모든 OST 서버에 동일한 설정을 한다.

vnode01> mkfs.lustre –fsname=testfs –ost –mgsnode=vnode00@tcp0 /dev/hda1
vnode01> mkdir -p /mnt/test/ost0
vnode01> mount -t lustre /dev/hda1 /mnt/test/ost0
vnode01> mkdir /lustrefs
vnode01> mount -t lustre vnode00@tcp0:/testfs /lustrefs

모든 OST 서버에 위와 같이 mount 한후 df로 확인한다.

vnode01> df -Th
————————————————————————————-
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext3 14G 2.6G 11G 20% /
none tmpfs 125M 0 125M 0% /dev/shm
/dev/hda1 lustre 5.0G 234M 4.5G 5% /mnt/test/ost0
vnode00@tcp0:/testfs
lustre 20G 975M 18G 6% /lustrefs
————————————————————————————-

이것으로 기본 셋팅은 완료되었다.

3. Lustre 고급 설정 방법

– MDT 와 MGS 파티션 분리 하기

metadata가 저장되는 MDT 와 MSG 파티션을 분리해서 사용할 수 있다. 많은 파일이 동시에 저장되는
경우 Meta Server 에 부하가 걸려서 전체적인 성능이 저하 될 수 있다. 이때 MDT 정보와 MSG 정보가
저장되는 파티션을 분리해서 I/O 집중을 분산 할때 사용된다.

vnode00> mkfs.lustre –mgs /dev/hda1
vnode00> mkdir -p /mnt/test/mgs
vnode00> mount -t lustre /dev/hda1 /mnt/test/mgs
vnode00> mkfs.lustre –fsname=testfs –mdt –mgsnode=vnode00@tcp0 /dev/hda2
vnode00> mkdir -p /mnt/test/mdt
vnode00> mount -t lustre /dev/hda2 /mnt/test/mdt

– MGS 노드에 multi interface network 환경일 경우

MGS 노드와 OST 노드가 두개 이상의 네트워크 채널을 가지고 구성할 경우 OST 서버에서 아래와
같이 이중으로 MGS 서버에 연결할 수 있다.

vnode01> mount -t lustre vnode00@tcp0,1@elan:/testfs /lustrefs

– mkfs.lustre reformat 하기

vnode00> mkfs.lustre –fsname=testfs –mdt –mgs –reformat /dev/hda1

– OST 서버 추가 하기

mkfs.lustre –fsname=testfs –ost –mgsnode=cfs21@tcp0 /dev/sdaX
mkdir -p /mnt/test/ostX
mount -t lustre /dev/sdaX /mnt/test/ostX

– OST 상태 확인하기

MDT 서버에서 ..

[root@vnode00 ~]# cat /proc/fs/lustre/lov/testfs-mdtlov/target_obd
0: testfs-OST0000_UUID ACTIVE
1: testfs-OST0001_UUID ACTIVE
2: testfs-OST0002_UUID ACTIVE
3: testfs-OST0003_UUID ACTIVE

– 서버 부팅 시 자동 시작하기

모든 서버에서 아래 명령을 실행한다.

[root@vnode00 ~]# ensh mount -l -t lustre
### executing in vnode00
vnode00 /dev/hda1 on /mnt/test/mdt type lustre (rw) [testfs-MDT0000]
### executing in vnode01
vnode01 /dev/hda1 on /mnt/test/ost0 type lustre (rw) [testfs-OST0000]
vnode01 vnode00@tcp0:/testfs on /lustrefs type lustre (rw)
### executing in vnode02
vnode02 /dev/hda1 on /mnt/test/ost0 type lustre (rw) [testfs-OST0001]
### executing in vnode03
vnode03 vnode00@tcp0:/testfs on /lustrefs type lustre (rw)
### executing in vnode04
vnode04 /dev/hda1 on /mnt/test/ost0 type lustre (rw) [testfs-OST0003]

각각 서버의 /etc/fstab 밑에 UUID 이름을 볼륨명으로 아래와 같이 추가한다.

vnode00> vi /etc/fstab
——————————————————————————
.
.
LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev 1 0
——————————————————————————

vnode01> vi /etc/fstab
——————————————————————————
.
.
LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev 1 0
vnode00@tcp0:/testfs /lustrefs lustre defaults,_netdev 1 0
——————————————————————————

– Lustre 서버 중지 하기

lustre 는 mount 와 umount 시 해당 데몬이 자동으로 시작, 종료된다.

# umount -f /mnt/test/ost0

– Inactive 상태의 OST 마운트하기

vnode01> mount -o exclude=testfs-OST0000 -t lustre vnode00:/testfs /lustrefs
vnode01> at /proc/fs/lustre/lov/testfs-clilov-*/target_obd

– OST 서버 제거 하기

MGS 서버에서 아래 명령을 실행 한다.

vnode00> lctl conf_param testfs-OST0001.osc.active=0

– OST 서버 복구 하기

vnode00> lctl conf_param testfs-OST0001.osc.active=1

– FS 제거하기

Lustre 파일 시스템을 제거하고 해당 Device 에 다른 파일 시스템을 생성하고자 할때 아래 구문을 사용한다.

# mkfs.lustre –writeconf –reformat …

4. Lustre Failover 구성하기

MDS : vnode00 <-> vnode00_1
OST : vnode01 <-> vnode01_1

vnode00> mkfs.lustre –fsname=testfs –mdt –mgs –failnode=vnode00_1@tcp0 /dev/hda1
vnode00> mount -t lustre /dev/hda1 /mnt/test/mdt
vnode01> mkfs.lustre –fsname=testfs –ost –failnode=vnode01_1 –mgsnode=vnode00@tcp0 –mgsnode=vnode00_1@tcp0 /dev/hda1
vnode01> mount -t lustre /dev/hda1 /mnt/test/ost0
client> mount -t lustre vnode00@tcp0:vnode00_1@tcp0:/testfs /mnt/testfs
vnode00> umount /mnt/test/mdt
vnode00_1> mount -t lustre /dev/hda1 /mnt/test/mdt
vnode00_1> cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status

단 /dev/hda1는 공유스토리지를 사용하든지 아님 DRBD를 이용해야 함.

서버가 죽었을때 mount, umount를 자동으로 해줄수 있는 heartbeat 구성이 되어야 함.

– Lustre Failover 구성하기

역활 별 호스트 구성

vnode00 : MDT master
vnode01 : MDT slave

vnode01 : OST1 master
vnode02 : OST1 slave

vnode03 : OST2 master
vnode04 : OST2 slave

모든 서버에 drbd 설치 – 0.7.x 대로 ..

vnode00 > vi /etc/drbd.conf
————————————————————————————————
#
# please have a a look at the example configuration file in
# /usr/share/doc/drbd/drbd.conf
#
skip {
As you can see, you can also comment chunks of text
with a ‘skip[optional nonsense]{ skipped text }’ section.
This comes in handy, if you just want to comment out
some ‘resource <some name> {…}’ section:
just precede it with ‘skip’.

The basic format of option assignment is
<option name><linear whitespace><value>;

It should be obvious from the examples below,
but if you really care to know the details:

<option name> :=
valid options in the respective scope
<value> := <num>|<string>|<choice>|…
depending on the set of allowed values
for the respective option.
<num> := [0-9]+, sometimes with an optional suffix of K,M,G
<string> := (<name>|\\”([^\\”\\\\\\n]*|\\\\.)*\\”)+
<name> := [/_.A-Za-z0-9-]+
}

resource drbd0 {
protocol C;
incon-degr-cmd “echo ‘!DRBD! pri on incon-degr’ | wall ; sleep 60 ; halt -f”;
startup {
degr-wfc-timeout 120;
}

disk {
on-io-error detach;
}

net {
# sndbuf-size 512k;
# timeout 60; # 6 seconds (unit = 0.1 seconds)
# connect-int 10; # 10 seconds (unit = 1 second)
# ping-int 10; # 10 seconds (unit = 1 second)
# max-buffers 2048;
# max-epoch-size 2048;
# ko-count 4;
# on-disconnect reconnect;

}
syncer {
rate 10M;
group 1;
al-extents 257;
}

on vnode00 {
device /dev/drbd0;
disk /dev/hda1;
address 192.168.123.60:7788;
meta-disk internal;
}

on vnode01 {
device /dev/drbd0;
disk /dev/hda1;
address 192.168.123.61:7788;
meta-disk internal;
}
}
——————————————————————————————–

vnode00 > vi /etc/ha.d/haresources

vnode00 192.168.123.76 drbddisk Filesystem::/dev/drbd0::/lustremdt::lustre xfs

vnode02 /etc/drbd.conf

#
# please have a a look at the example configuration file in
# /usr/share/doc/drbd/drbd.conf
#
skip {
As you can see, you can also comment chunks of text
with a ‘skip[optional nonsense]{ skipped text }’ section.
This comes in handy, if you just want to comment out
some ‘resource <some name> {…}’ section:
just precede it with ‘skip’.

The basic format of option assignment is
<option name><linear whitespace><value>;

It should be obvious from the examples below,
but if you really care to know the details:

<option name> :=
valid options in the respective scope
<value> := <num>|<string>|<choice>|…
depending on the set of allowed values
for the respective option.
<num> := [0-9]+, sometimes with an optional suffix of K,M,G
<string> := (<name>|\\”([^\\”\\\\\\n]*|\\\\.)*\\”)+
<name> := [/_.A-Za-z0-9-]+
}

resource drbd0 {
protocol C;
incon-degr-cmd “echo ‘!DRBD! pri on incon-degr’ | wall ; sleep 60 ; halt -f”;
startup {
degr-wfc-timeout 120;
}

disk {
on-io-error detach;
}

net {
# sndbuf-size 512k;
# timeout 60; # 6 seconds (unit = 0.1 seconds)
# connect-int 10; # 10 seconds (unit = 1 second)
# ping-int 10; # 10 seconds (unit = 1 second)
# max-buffers 2048;
# max-epoch-size 2048;
# ko-count 4;
# on-disconnect reconnect;

}
syncer {
rate 10M;
group 1;
al-extents 257;
}

on vnode02 {
device /dev/drbd0;
disk /dev/hda1;
address 192.168.123.62:7788;
meta-disk internal;
}

on vnode03 {
device /dev/drbd0;
disk /dev/hda1;
address 192.168.123.63:7788;
meta-disk internal;
}
}

vnode02 /etc/ha.d/haresources
—————————————————————————————
vnode02 192.168.123.77 drbddisk Filesystem::/dev/drbd0::/lustreost::lustre xfs
—————————————————————————————-

vnode02 /etc/ha.d/ha.cf
————————————————————————————
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 5
hopfudge 1
udpport 1002
auto_failback off
udp eth0
node vnode02
node vnode03
———————————————————————————–

[root@vnode00 ~]# drbdadm primary all
[root@vnode00 ~]# mkfs.lustre –writeconf –reformat –fsname=testfs –mdt –mgs –failnode=vnode01@tcp0 /dev/drbd0
[root@vnode00 ~]# mount -t lustre /dev/drbd0 /lustremdt/

[root@vnode02 ~]# drbdadm primary all
[root@vnode02 ~]# mkfs.lustre –writeconf –reformat –fsname=testfs –ost –failnode=vnode03 –mgsnode=vnode00@tcp0 –mgsnode=vnode01@tcp0 /dev/drbd0
[root@vnode02 ~]# mount -t lustre /dev/drbd0 /lustreost

[root@vnode04 ~]# drbdadm primary all
[root@vnode04 ~]# mkfs.lustre –writeconf –reformat –fsname=testfs –ost –failnode=vnode05 –mgsnode=vnode00@tcp0 –mgsnode=vnode01@tcp0 /dev/drbd0
[root@vnode04 ~]# mount -t lustre /dev/drbd0 /lustreost

client > mount -t lustre vnode00@tcp0:vnode01@tcp0:/testfs /lustrefs
client > df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 14G 2.7G 11G 21% /
none 125M 0 125M 0% /dev/shm
vnode00@tcp0:vnode01@tcp0:/testfs
9.6G 413M 8.8G 5% /lustrefs

lctl conf_param testfs-MDT0000.sys.timeout=10
cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status

cat /proc/fs/lustre/health_check_timeout

양쪽 서버의 DRBD의 일관성이 깨지면…양쪽 서버에서

# drbdadm disconnect drbd0
# drbdadm connect drbd0

혹은

/etc/rc.d/init.d/drbd reload

;;;

양쪽 서버에서

cat /proc/drbd

를 실행해서 cs: 상태를 살펴 본다.

cs:StandAlone 혹은 cs:WFConnection 상태가 있을 것이다.

이때 standalone 에서 drbdadm connect all 을 실행하면 된다.

만일 둘다 StandAlone 이다. 이때는 둘다 connect 를 해준다.

만일 둘다 WFConnection 이다. 이때는 둘다 disconnect 후 connect 해준다.

정상적인 cs 상태는 cs:Connected 이다.

Lustre 파일 시스템 구축 및 MDT(Meta Server) 이중화 구현하기

You may also like...

페이스북/트위트/구글 계정으로 댓글 가능합니다.응답 취소

알림글

시스존 통합 검색

카테고리

2025 7월
월	화	수	목	금	토	일
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Lustre 파일 시스템 구축 및 MDT(Meta Server) 이중화 구현하기

You may also like...

[클러스터] 윈도우 클러스터링의 쿼럼디스크 구성

대용량 데이터(big data) 처리 관련 기술 요소

lustre 설치법

페이스북/트위트/구글 계정으로 댓글 가능합니다.응답 취소

알림글

시스존 통합 검색

카테고리