Nutch & Hadoop Inst. Guide

Nutch(crawler) using the Hadoop (formerly NDFS) distributed file system (HDFS) 과 MapReduce

Nutch Hadoop의 세팅 작업하면서 정리한번 해놓은 좋지않겠니~

Nutch는 Nightly버젼으로 하면되고  Hadoop은 0.8을 기준으로 하였다.

Nutch 받기

Hadoop 받기

hadoop 설치

당근 JDK와 Ant는 설치하고환경설정(참조)한다.

1. 압축풀린 hadoop 디렉으로

2.  #ant package

(멈춰진곳은 주석처리?)

nutch 설치

#cd nutch압축풀린디렉

#vi build.properties

      dist.dir=/nutch/search

#ant package

(멈춰진곳은 주석처리?)

nutch hadoop global filesystem 구성 ()

mkdir /nutch

mkdir /nutch/search

mkdir /nutch/filesystem

mkdir /nutch/local

mkdir /nutch/home

! /nutch 상위디렉토리는 어디든 상관없다.(멀티서버는 각서버별로 다 맞추는것이..) 여기선 루트 밑으로

groupadd users

useradd -d /nutch/home -g users nutch

chown -R nutch:users /nutch

passwd nutch nutchuserpassword

! 그룹 만들기 귀찮으면 그냥 staff 적용한다. (유저퍼미션확인)

Hadoop환경 설정

#cd /nutch/search/conf

#vi hadoop-env.sh

export HADOOP_HOME=/nutch/search

export JAVA_HOME=/usr/java/jdk1.5.0_06

export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

ssh Key 생성

#cd $HOME

#ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/onnet/.ssh/id_rsa): [enter]

Enter passphrase (empty for no passphrase): [enter]

Enter same passphrase again: [enter]

#cd .ssh

#cp id_rsa.pub authorized_keys

hadoop-site.xml 편집

#/nutch/search/conf/hadoop-site.xml

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

  <name>fs.default.name</name>

  <value>호스트명:9000</value>

  <description>

    The name of the default file system. Either the literal string

    “local” or a host:port for NDFS.

  </description>

</property>

<property>

  <name>mapred.job.tracker</name>

  <value>호스트명:9001</value>

  <description>

    The host and port that the MapReduce job tracker runs at. If

    “local”, then jobs are run in-process as a single map and

    reduce task.

  </description>

</property>

<property>

  <name>mapred.map.tasks</name>

  <value>2</value>

  <description>

    define mapred.map tasks to be number of slave hosts

  </description>

</property>

<property>

  <name>mapred.reduce.tasks</name>

  <value>2</value>

  <description>

    define mapred.reduce tasks to be number of slave hosts

  </description>

</property>

<property>

  <name>dfs.name.dir</name>

  <value>/nutch/filesystem/name</value>

</property>

<property>

  <name>dfs.data.dir</name>

  <value>/nutch/filesystem/data</value>

</property>

<property>

  <name>mapred.system.dir</name>

  <value>/nutch/filesystem/mapreduce/system</value>

</property>

<property>

  <name>mapred.local.dir</name>

  <value>/nutch/filesystem/mapreduce/local</value>

</property>

<property>

  <name>dfs.replication</name>

  <value>1</value>

</property>

</configuration>

!  호스트명 확인, 디렉 파일uri 확인

hadoop네임노드 파일 시스템 포맷

#bin/hadoop namenode -format

hadoop Start

#cd /nutch/search

#bin/start-all.sh

사용자 암호 입력

! localhost: ssh_exchange_identification: Connection closed by remote host 출력시

#vi conf/slaves

호스트명 변경  

hadoop Stop

#bin/stop-all.sh

구동후 hadoop 로그 확인

#ls /nutch/search/logs/

Nutch crawl을 위한 url 리스트 작성

#cd /nutch/search

#mkdir urls

#vi urls/urllist.txt

http://lucene.apache.org

http://kangho.egloos.com

You should now have a urls/urllist.txt file with the one line pointing to the apache lucene site. Now we are going to add that directory to the filesystem. Later the nutch crawl will use this file as a list of urls to crawl. To add the urls directory to the filesystem run the following command:

name,data 노드 실행상태 확인후(logs)

#cd /nutch/search#bin/hadoop dfs -put urls urls

  

Nutch Crawl url Filter설정(정규표현)

#cd /nutch/search

#vi conf/crawl-urlfilter.txt

# +^http://([a-z0-9]*\\.)*MY.DOMAIN.NAME/

+^http://([a-z0-9]*\\.)*apache.org/

  

Nutch Crawl 실행

#cd /nutch/search

#bin/nutch crawl urls -dir crawled -depth 3

Crawl 상태 확인

#bin/nutch readdb crawld/crawldb -stats

#bin/hadoop dfs -du

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

4 Responses

  1. 2024년 10월 23일

    … [Trackback]

    […] Find More on that Topic: nblog.syszone.co.kr/archives/2673 […]

  2. 2024년 11월 8일

    … [Trackback]

    […] Find More on to that Topic: nblog.syszone.co.kr/archives/2673 […]

  3. 2024년 11월 9일

    … [Trackback]

    […] Info to that Topic: nblog.syszone.co.kr/archives/2673 […]

  4. 2024년 11월 15일

    … [Trackback]

    […] Info on that Topic: nblog.syszone.co.kr/archives/2673 […]

페이스북/트위트/구글 계정으로 댓글 가능합니다.