Nutch(crawler) using the Hadoop (formerly NDFS) distributed file system (HDFS) 과 MapReduce

Nutch Hadoop의 세팅 작업하면서 정리한번 해놓은 좋지않겠니~

Nutch는 Nightly버젼으로 하면되고  Hadoop은 0.8을 기준으로 하였다.

Nutch 받기

Hadoop 받기

hadoop 설치

당근 JDK와 Ant는 설치하고환경설정(참조)한다.

1. 압축풀린 hadoop 디렉으로

2.  #ant package

(멈춰진곳은 주석처리?)

nutch 설치

#cd nutch압축풀린디렉



#ant package

(멈춰진곳은 주석처리?)

nutch hadoop global filesystem 구성 ()

mkdir /nutch

mkdir /nutch/search

mkdir /nutch/filesystem

mkdir /nutch/local

mkdir /nutch/home

! /nutch 상위디렉토리는 어디든 상관없다.(멀티서버는 각서버별로 다 맞추는것이..) 여기선 루트 밑으로

groupadd users

useradd -d /nutch/home -g users nutch

chown -R nutch:users /nutch

passwd nutch nutchuserpassword

! 그룹 만들기 귀찮으면 그냥 staff 적용한다. (유저퍼미션확인)

Hadoop환경 설정

#cd /nutch/search/conf


export HADOOP_HOME=/nutch/search

export JAVA_HOME=/usr/java/jdk1.5.0_06


export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

ssh Key 생성

#cd $HOME

#ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/onnet/.ssh/id_rsa): [enter]

Enter passphrase (empty for no passphrase): [enter]

Enter same passphrase again: [enter]

#cd .ssh

#cp authorized_keys

hadoop-site.xml 편집


<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>






    The name of the default file system. Either the literal string

    “local” or a host:port for NDFS.







    The host and port that the MapReduce job tracker runs at. If

    “local”, then jobs are run in-process as a single map and

    reduce task.







    define tasks to be number of slave hosts







    define mapred.reduce tasks to be number of slave hosts
























!  호스트명 확인, 디렉 파일uri 확인

hadoop네임노드 파일 시스템 포맷

#bin/hadoop namenode -format

hadoop Start

#cd /nutch/search


사용자 암호 입력

! localhost: ssh_exchange_identification: Connection closed by remote host 출력시

#vi conf/slaves

호스트명 변경  

hadoop Stop


구동후 hadoop 로그 확인

#ls /nutch/search/logs/

Nutch crawl을 위한 url 리스트 작성

#cd /nutch/search

#mkdir urls

#vi urls/urllist.txt

You should now have a urls/urllist.txt file with the one line pointing to the apache lucene site. Now we are going to add that directory to the filesystem. Later the nutch crawl will use this file as a list of urls to crawl. To add the urls directory to the filesystem run the following command:

name,data 노드 실행상태 확인후(logs)

#cd /nutch/search#bin/hadoop dfs -put urls urls


Nutch Crawl url Filter설정(정규표현)

#cd /nutch/search

#vi conf/crawl-urlfilter.txt

# +^http://([a-z0-9]*\\.)*MY.DOMAIN.NAME/



Nutch Crawl 실행

#cd /nutch/search

#bin/nutch crawl urls -dir crawled -depth 3

Crawl 상태 확인

#bin/nutch readdb crawld/crawldb -stats

#bin/hadoop dfs -du


