(cache)nginx 에서 archive.org, archive.is 차단하기

nginx 에서 archive.org, archive.is 차단하기

2016년 7월 20일

마지막 수정 날짜 : 2018년 5월 28일

archive.org 및 archive.is 는 과거의 페이지를 보존 해 주는 사이트다.
본인이 원하지 않는데도 불구하고 누군가가 보존하거나, robot 이 접속 해 기록을 하는 경우가 있어 이 부분을 막았다.
(그래도 본인이 수집하면 끝나겠지만)

이미 가져간 데이터 삭제 방법

archive.org 는 robots.txt 로 차단을 하면 되지만, archive.is 는 방법이 없다.1

차단 방법

archive.org

robots.txt 이용하기 (사이트 최 상단에 robots.txt 를 만들고 해당 내용을 넣고 저장한다.)
User-agent: ia_archiver
Disallow: /
nginx 의 설정 이용 (다른 서버에서는 하는 방법 모름..)
server { 아래에 아래와 같은 내용을 넣자.

if ($http_user_agent ~* (Wayback\ Machine\ Live\ Record)) { return 403; }

1
2
3

if ($http_user_agent ~* (Wayback\ Machine\ Live\ Record)) {
return 403;
}

archive.is

아이피 차단밖에 방법이 없다.
nginx 에서 http { 밑에 다음과 같이 넣자.

deny 78.47.86.0/24; deny 46.166.139.0/24; deny 178.62.195.0/24;

1
2
3

deny 78.47.86.0/24;
deny 46.166.139.0/24;
deny 178.62.195.0/24;

2017.06.12 추가) 아이피 확인 결과 178.62.195.5 아이피도 있어 0~255 추가

Firewalld 로 차단

나는 Centos 7 을 사용하고 있기 때문에 해당 운영체제를 기준으로 작성했다.

firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address=178.62.195.0/24 service name=http drop'

firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address=78.47.86.0/24 service name=http drop'

firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address=46.166.139.0/24 service name=http drop'

(https 서비스도 사용중이라면 http 명령어를 쓰고 나서 https 명령어도 따로 써 줘야 한다.
그냥 http 부분을 https 로 바꾸고 다시 입력하면 된다.)

2017.06.12 추가) Centos 7 에서 firewalld 로 차단

결론

이렇게 하면 archive.is 및 archive.org 에서 조회가 되지 않으며, 서버 접근 시 로그는 다음과 같이 찍힌다.
(archive.is) 78.47.86.130 – – [20/Jul/2016:21:03:17 +0900] “GET / HTTP/1.1” 403 584 “-” “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36”
(archive.org) 207.241.225.226 – – [20/Jul/2016:21:03:28 +0900] “GET /robots.txt HTTP/1.1” 403 182 “-” “Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)”

로그 남기기를 원하지 않는다면 access_log 를 off 하면 된다.