파이썬 8강. 네이버뉴스 크롤링

네이버 뉴스의 카테고리별로 주소 구조를 살펴보니

sid1이 카테고리 변수인 것을 확인할 수 있다.

1. 전체 페이지 다운로드

https://n.news.naver.com/mnews/article/088/0000756435?sid=100 (매일신문)
https://n.news.naver.com/mnews/article/586/0000038374?sid=100 (시사저널)
https://n.news.naver.com/mnews/article/123/0002273936?sid=100 (한국경제TV)
https://n.news.naver.com/mnews/article/005/0001523282?sid=100 (국민일보)

100 -> 카테고리 -> 변수명 : sid1

088 -> 신문사 -> 변수명 : oid

0000756435 -> 기사번호 -> 변수명 : aid

모두 정치 카테고리의 글의 주소이다.

주소를 보니 article 다음 숫자 088이 신문사의 번호인것같다.

검증을 위해 088의 5번째 글을 찾아보자.

https://n.news.naver.com/mnews/article/088/0000000005?sid=100

2004년에 올라온 매일신문의 글이 맞다.

sid=101로 바꾸면 경제파트의 매일신문사의 5번째 글이 나온다.

1. 게시글 5개정도 스크랩 연습

list의 크기가 2가 뜨면 정상인 프로그램이다.

기사번호든 신문사든 중간중간 기사가 삭제되어 없는 번호도 있고, 끝번호가 어디인지도 알 수 없다.

try catch로 예외처리를 잡아줘야한다.

오류가 났을 때 그만둬야하는 로직이라면 for문 밖에 try를 걸고

멈추면 안될때는 내부에 걸어줘야한다.

# beautifulsoup X

import requests
from bs4 import BeautifulSoup

list = []

# aid = 1 -> 얘를 ++해서 "0000000001"로 변환한 후 url에 넣어야함
aid = ["0000000001", "0000000002", "1000000003"]

for a in aid:
    # 어떤식으로 예외가 발생할 지 모른다. 주소 입력할 때 오타날 수도 있잖아
    try:
        html = requests.get(
            f"https://n.news.naver.com/mnews/article/005/{a}?sid=100")
        print(html)
        if html.status_code == 200:
            list.append(html.text)
    except Exception as e:
        pass

print(len(list))

2. 정치(sid1=100)/국민일보(005)/0000000001 ~ 끝까지 파이썬으로 크롤링

arrayList에 html.text를 담고 size를 출력하는 프로그램 만들기

-> status.code가 500 뜰때까지

0000000001은 문자열이라서 여기에 1을 더하면 결합되어 00000000011이된다.

0000000001을 숫자로 바꾼뒤 1을 더하고 2를 0000000002로 바꿔야한다.

문자열에 0 붙이기 검색 -> format 사용

aid = 1  # 얘를 ++해서 "0000000002"로 변환한 후 url에 넣어야함
aid_string = format(aid, '010')

끝번호가 어디인지 어떻게 알 수 있을까?

while 돌리면서 계속해서 받아야한다.

오류가 연속으로 30번 터질 때까지 돌리고 리스트에 쌓아놓은 기사중

가장 마지막에 넣은 글을 beautifulsoup로 파싱하여 날짜를 확인한 후

최신기사이면 멈추도록 만들면 되겠다.

너무 복잡하다ㅜㅜ

from operator import truediv
import requests
from bs4 import BeautifulSoup

list = []

aid = 1  # 얘를 ++해서 "0000000002"로 변환한 후 url에 넣어야함

error = 0

# print(aid_string)
# print(type(aid_string))  # type = str

while True:

    aid_string = format(aid, '010')

    # 어떤식으로 예외가 발생할 지 모른다. 주소 입력할 때 오타날 수도 있잖아
    try:
        html = requests.get(
            f"https://n.news.naver.com/mnews/article/005/{aid_string}?sid=100")

        if html.status_code == 200:
            list.append(html.text)
            error = 0

        if html.status_code == 500:
            error = error + 1
            print(error)

        if error == 30:

            soup = BeautifulSoup(html.text, 'html.parser')

            date_el = soup.select_one(".media_end_head_info_datestamp_time")
            print(date_el.text)

            break

        aid = aid + 1

        if aid % 1000 == 0:
            print(aid_string)

    except Exception as e:
        pass

print(len(list))

[출처]

https://cafe.naver.com/metacoding

메타코딩 : 네이버 카페

코린이들의 궁금증

cafe.naver.com

메타 코딩 유튜브
https://www.youtube.com/c/%EB%A9%94%ED%83%80%EC%BD%94%EB%94%A9

메타코딩

문의사항 : getinthere@naver.com 인스타그램 : https://www.instagram.com/meta4pm 깃헙 : https://github.com/codingspecialist 유료강좌 : https://www.easyupclass.com

www.youtube.com