Nutch & Hadoop Inst. Guide
Nutch(crawler) using the Hadoop (formerly NDFS) distributed file system (HDFS) 과 MapReduce
Nutch Hadoop의 세팅 작업하면서 정리한번 해놓은 좋지않겠니~
Nutch는 Nightly버젼으로 하면되고 Hadoop은 0.8을 기준으로 하였다.
Nutch 받기
Hadoop 받기
hadoop 설치
당근 JDK와 Ant는 설치하고환경설정(참조)한다.
1. 압축풀린 hadoop 디렉으로
2. #ant package
(멈춰진곳은 주석처리?)
nutch 설치
#cd nutch압축풀린디렉
#vi build.properties
dist.dir=/nutch/search
#ant package
(멈춰진곳은 주석처리?)
nutch hadoop global filesystem 구성 ()
mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/local
mkdir /nutch/home
! /nutch 상위디렉토리는 어디든 상관없다.(멀티서버는 각서버별로 다 맞추는것이..) 여기선 루트 밑으로
groupadd users
useradd -d /nutch/home -g users nutch
chown -R nutch:users /nutch
passwd nutch nutchuserpassword
! 그룹 만들기 귀찮으면 그냥 staff 적용한다. (유저퍼미션확인)
Hadoop환경 설정
#cd /nutch/search/conf
#vi hadoop-env.sh
export HADOOP_HOME=/nutch/search
export JAVA_HOME=/usr/java/jdk1.5.0_06
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
ssh Key 생성
#cd $HOME
#ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/onnet/.ssh/id_rsa): [enter]
Enter passphrase (empty for no passphrase): [enter]
Enter same passphrase again: [enter]
#cd .ssh
#cp id_rsa.pub authorized_keys
hadoop-site.xml 편집
#/nutch/search/conf/hadoop-site.xml
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>fs.default.name</name>
<value>호스트명:9000</value>
<description>
The name of the default file system. Either the literal string
“local” or a host:port for NDFS.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>호스트명:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
“local”, then jobs are run in-process as a single map and
reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/nutch/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/nutch/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/nutch/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/nutch/filesystem/mapreduce/local</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
! 호스트명 확인, 디렉 파일uri 확인
hadoop네임노드 파일 시스템 포맷
#bin/hadoop namenode -format
hadoop Start
#cd /nutch/search
#bin/start-all.sh
사용자 암호 입력
! localhost: ssh_exchange_identification: Connection closed by remote host 출력시
#vi conf/slaves
호스트명 변경
hadoop Stop
#bin/stop-all.sh
구동후 hadoop 로그 확인
#ls /nutch/search/logs/
Nutch crawl을 위한 url 리스트 작성
#cd /nutch/search
#mkdir urls
#vi urls/urllist.txt
http://lucene.apache.org
http://kangho.egloos.com
You should now have a urls/urllist.txt file with the one line pointing to the apache lucene site. Now we are going to add that directory to the filesystem. Later the nutch crawl will use this file as a list of urls to crawl. To add the urls directory to the filesystem run the following command:
name,data 노드 실행상태 확인후(logs)
#cd /nutch/search#bin/hadoop dfs -put urls urls
Nutch Crawl url Filter설정(정규표현)
#cd /nutch/search
#vi conf/crawl-urlfilter.txt
# +^http://([a-z0-9]*\\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\\.)*apache.org/
Nutch Crawl 실행
#cd /nutch/search
#bin/nutch crawl urls -dir crawled -depth 3
Crawl 상태 확인
#bin/nutch readdb crawld/crawldb -stats
#bin/hadoop dfs -du