본문 바로가기
Python

BeautifulSoup - 특정 태그값 가져오기

by 청원뿔세포 2022. 7. 20.
import requests
from bs4 import BeautifulSoup
  • requests 와 BeautifulSoup 모듈을 가져온다
URL = "https://lolchess.gg/"
  • 사용하고 싶은 주소를 가져온다
  • 예제로 사용할 주소는 알바천국 주소이다
get_URL = requests.get(URL)
print(get_URL, type(get_URL))
<Response [200]> <class 'requests.models.Response'>
soup = BeautifulSoup(get_URL.text, "html.parser")
print(str(soup)[:100])
print(type(soup))
<!DOCTYPE html>

<html data-locale="en-US" lang="en">
<head>
<title>TFT Stats, Leaderboards, League 
<class 'bs4.BeautifulSoup'>
  • requests 로 주소를 사용가능하게 받아온다
  • bs4 로 url을 텍스트로 받아온다.

    (자료형은 <class 'bs4.BeautifulSoup'>이고, 굉장히 내용이 많기 때문에 str로 100자까지만 확인해 보았다)

  • 이미지를 보면 개발자 도구(F12)로 원하는 부분이 html에서 어디에 위치하고 있는지 확인할 수 있다.
  • 마우스를 올려둔 "천상계 덱"의 하이퍼링크를 텍스트로 추출해보도록 하자

container = soup.find_all("div", {"class":"container"})
print(container)
[<div class="container">
<ul>
<li>
<a href="//dak.gg/pubg" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_pubg.png" srcset="//cdn.lolchess.gg/images/family/ico_pubg@2x.png 2x"/>
<span>PUBG</span>
</a>
</li>
<li>
<a href="//dak.gg/bser" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/logo-game-bser.png"/>
<span>Eternal Return</span>
</a>
</li>
<li>
<a href="//dak.gg/warzone" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/logo-wz.png" srcset="//cdn.lolchess.gg/images/family/logo-wz@2x.png 2x"/>
<span>CoD: Warzone</span>
</a>
</li>
<li>
<a href="//dak.gg/valorant" rel="noopener noreferrer" target="_blank">
<img alt="" src="//cdn.lolchess.gg/images/family/logo-valorant.svg" width="16"/>
<span>Valorant</span>
</a>
</li>
<li>
<a href="//poro.gg" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_lol.png" srcset="//cdn.lolchess.gg/images/family/ico_lol@2x.png 2x"/>
<span>League of Legends</span>
</a>
</li>
<li class="active">
<a href="//lolchess.gg">
<img src="//cdn.lolchess.gg/images/family/ico_tft_lolchess.png"/>
<span>TeamFight Tactics</span>
</a>
</li>
<li>
<a href="//dak.gg/lor" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_lor.png"/>
<span>LoR</span>
</a>
</li>
<li>
<a href="//dak.gg/apex" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/symbol-apexlegends@2x.png"/>
<span>Apex Legends</span>
</a>
</li>
</ul>
</div>, <div class="brand-site brand-site-tft container">
<a class="logo logo-[]" href="/">
<img alt="LoLChess.GG" src="//cdn.lolchess.gg/images/common/lolchessgg-logo-bkbg@2x.png"/>
</a>
<div class="search-box" id="gnb-search-box">
<form action="/search" class="search">
<div class="btn-group">
<input name="region" type="hidden" value="na"/>
<button aria-expanded="false" aria-haspopup="true" class="btn btn-sm dropdown-toggle" data-toggle="dropdown" type="button">
<span>NA</span>
</button>
<div class="dropdown-menu">
<a class="dropdown-item" data-region="br" href="#">BR</a>
<a class="dropdown-item" data-region="eune" href="#">EUNE</a>
<a class="dropdown-item" data-region="euw" href="#">EUW</a>
<a class="dropdown-item" data-region="jp" href="#">JP</a>
<a class="dropdown-item" data-region="kr" href="#">KR</a>
<a class="dropdown-item" data-region="lan" href="#">LAN</a>
<a class="dropdown-item" data-region="las" href="#">LAS</a>
<a class="dropdown-item" data-region="na" href="#">NA</a>
<a class="dropdown-item" data-region="oce" href="#">OCE</a>
<a class="dropdown-item" data-region="tr" href="#">TR</a>
<a class="dropdown-item" data-region="ru" href="#">RU</a>
</div>
</div>
<input maxlength="30" name="name" placeholder="Search Summoner Name" required="" type="text" value="">
<button type="submit">
<i class="fas fa-search"></i>
</button>
</input></form>
</div>
</div>, <div class="container">
<ul>
<li class="guide">
<a href="/patch-notes">Guides</a>
</li>
<li>
<a href="https://lolchess.gg/meta">Team Comps</a>
</li>
<li>
<a href="https://lolchess.gg/decks" style="color: orange;">Meta Trends</a>
</li>
<li class="new">
<a href="https://lolchess.gg/statistics/items">Item Trends</a>
</li>
<li class="bar"><span></span></li>
<li>
<a href="https://lolchess.gg/leaderboards">Leaderboards</a>
</li>
<li>
<a href="https://lolchess.gg/favorites">Favorites</a>
</li>
<li class="bar"><span></span></li>
<li class="toggle_set">
<a class="" href="/tft/7.0">
                        SET 7
                    </a>
</li>
<li>
<a href="https://lolchess.gg/guide/augments">Augments</a>
</li>
<li>
<a href="https://lolchess.gg/champions/set7">Champions</a>
</li>
<li>
<a href="https://lolchess.gg/synergies/set7">Traits</a>
</li>
<li>
<a href="https://lolchess.gg/items/set7">Items</a>
</li>
<li class="">
<a href="https://lolchess.gg/cheatsheet/set7">Cheat Sheet</a>
</li>
<li class="bar"><span></span></li>
<li class="builder">
<a href="https://lolchess.gg/builder/set7">Builder</a>
</li>
<li class="simulator">
<a href="/simulator">Synergy Builder</a>
</li>
</ul>
</div>, <div class="container">
<div class="locale">
<a data-lang="en_US" href="#">
<i aria-hidden="true" class="fa fa-globe"></i>
<span>English</span>
<i class="fas fa-caret-down"></i>
</a>
<ul class="dropdown-menu">
<li>
<a data-lang="ko_KR" href="https://lolchess.gg/?hl=ko-KR">
                    한국어
                </a>
</li>
<li>
<a data-lang="en_US" href="https://lolchess.gg/?hl=en-US">
                    English
                </a>
</li>
<li>
<a data-lang="ja_JP" href="https://lolchess.gg/?hl=ja-JP">
                    日本語
                </a>
</li>
<li>
<a data-lang="vi_VN" href="https://lolchess.gg/?hl=vi-VN">
                    Tiếng Việt
                </a>
</li>
<li>
<a data-lang="de_DE" href="https://lolchess.gg/?hl=de-DE">
                    Deutsch
                </a>
</li>
</ul>
</div>
<p class="copyright">
                © LoLCHESS.GG. All Rights Reserved. <a href="mailto:tft@lolchess.gg">TFT@LoLCHESS.GG</a>
</p>
<p>
                lolchess.gg is hosted by PlayXP Inc.
                lolchess.gg isn’t endorsed by Riot Games
                and doesn’t reflect the views or opinions of Riot Games
                or anyone officially involved in producing or managing League of Legends.
                League of Legends and Riot Games are trademarks
                or registered trademarks of Riot Games, Inc.
                League of Legends © Riot Games, Inc.
            </p>
<div>
<small>
</small>
</div>
</div>]
  • div태그 아래에 classcontainer인 것들이 우리가 원하는 것 말고도 다른 것들이 있다.
  • 우리가 원하는 부분만 추출해내야 한다
print(type(container))
<class 'bs4.element.ResultSet'>
  • 우리가 선언한 containerfind_all메소드를 사용할 수 없는 자료형이기 때문에 루프를 돌려서 하나씩 꺼내서 사용해야한다.
for i in range(len(container)):
    menu_bar = container[i].find_all("li")
    print(i, menu_bar)
0 [<li>
<a href="//dak.gg/pubg" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_pubg.png" srcset="//cdn.lolchess.gg/images/family/ico_pubg@2x.png 2x"/>
<span>PUBG</span>
</a>
</li>, <li>
<a href="//dak.gg/bser" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/logo-game-bser.png"/>
<span>Eternal Return</span>
</a>
</li>, <li>
<a href="//dak.gg/warzone" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/logo-wz.png" srcset="//cdn.lolchess.gg/images/family/logo-wz@2x.png 2x"/>
<span>CoD: Warzone</span>
</a>
</li>, <li>
<a href="//dak.gg/valorant" rel="noopener noreferrer" target="_blank">
<img alt="" src="//cdn.lolchess.gg/images/family/logo-valorant.svg" width="16"/>
<span>Valorant</span>
</a>
</li>, <li>
<a href="//poro.gg" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_lol.png" srcset="//cdn.lolchess.gg/images/family/ico_lol@2x.png 2x"/>
<span>League of Legends</span>
</a>
</li>, <li class="active">
<a href="//lolchess.gg">
<img src="//cdn.lolchess.gg/images/family/ico_tft_lolchess.png"/>
<span>TeamFight Tactics</span>
</a>
</li>, <li>
<a href="//dak.gg/lor" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/ico_lor.png"/>
<span>LoR</span>
</a>
</li>, <li>
<a href="//dak.gg/apex" rel="noopener noreferrer" target="_blank">
<img src="//cdn.lolchess.gg/images/family/symbol-apexlegends@2x.png"/>
<span>Apex Legends</span>
</a>
</li>]
1 []
2 [<li class="guide">
<a href="/patch-notes">Guides</a>
</li>, <li>
<a href="https://lolchess.gg/meta">Team Comps</a>
</li>, <li>
<a href="https://lolchess.gg/decks" style="color: orange;">Meta Trends</a>
</li>, <li class="new">
<a href="https://lolchess.gg/statistics/items">Item Trends</a>
</li>, <li class="bar"><span></span></li>, <li>
<a href="https://lolchess.gg/leaderboards">Leaderboards</a>
</li>, <li>
<a href="https://lolchess.gg/favorites">Favorites</a>
</li>, <li class="bar"><span></span></li>, <li class="toggle_set">
<a class="" href="/tft/7.0">
                        SET 7
                    </a>
</li>, <li>
<a href="https://lolchess.gg/guide/augments">Augments</a>
</li>, <li>
<a href="https://lolchess.gg/champions/set7">Champions</a>
</li>, <li>
<a href="https://lolchess.gg/synergies/set7">Traits</a>
</li>, <li>
<a href="https://lolchess.gg/items/set7">Items</a>
</li>, <li class="">
<a href="https://lolchess.gg/cheatsheet/set7">Cheat Sheet</a>
</li>, <li class="bar"><span></span></li>, <li class="builder">
<a href="https://lolchess.gg/builder/set7">Builder</a>
</li>, <li class="simulator">
<a href="/simulator">Synergy Builder</a>
</li>]
3 [<li>
<a data-lang="ko_KR" href="https://lolchess.gg/?hl=ko-KR">
                    한국어
                </a>
</li>, <li>
<a data-lang="en_US" href="https://lolchess.gg/?hl=en-US">
                    English
                </a>
</li>, <li>
<a data-lang="ja_JP" href="https://lolchess.gg/?hl=ja-JP">
                    日本語
                </a>
</li>, <li>
<a data-lang="vi_VN" href="https://lolchess.gg/?hl=vi-VN">
                    Tiếng Việt
                </a>
</li>, <li>
<a data-lang="de_DE" href="https://lolchess.gg/?hl=de-DE">
                    Deutsch
                </a>
</li>]

 

 

  • html이 영어버전으로 되어있어서 "천상계 덱"부분이 "Meta Trends"로 되어있었다.
  • 위 코드에서 출력할 때 인덱스도 함께 보이도록 출력하였다.
  • 우리가 원하는 부분의 위치는 인덱스 2 이다.
  • 이제는 하이퍼링크 부분을 찾아보자
href = container[2].find_all("a")
for j in range(len(href)):
    print(j, href[j])
0 <a href="/patch-notes">Guides</a>
1 <a href="https://lolchess.gg/meta">Team Comps</a>
2 <a href="https://lolchess.gg/decks" style="color: orange;">Meta Trends</a>
3 <a href="https://lolchess.gg/statistics/items">Item Trends</a>
4 <a href="https://lolchess.gg/leaderboards">Leaderboards</a>
5 <a href="https://lolchess.gg/favorites">Favorites</a>
6 <a class="" href="/tft/7.0">
                        SET 7
                    </a>
7 <a href="https://lolchess.gg/guide/augments">Augments</a>
8 <a href="https://lolchess.gg/champions/set7">Champions</a>
9 <a href="https://lolchess.gg/synergies/set7">Traits</a>
10 <a href="https://lolchess.gg/items/set7">Items</a>
11 <a href="https://lolchess.gg/cheatsheet/set7">Cheat Sheet</a>
12 <a href="https://lolchess.gg/builder/set7">Builder</a>
13 <a href="/simulator">Synergy Builder</a>
  • 인덱스 번호 2번에서 Meta Trends"를 찾을 수 있었다.
  • href정보만 추출하려면 .attrs['href']메소드를 사용하면 된다.
print(href[2].attrs['href'])
https://lolchess.gg/decks
  • 해당링크로 들어가보면 천상계 덱을 추천해주는 페이지로 잘 연결이 된다.
  • href말고도 style, src, title등 원하는 부분을 .attrs['원하는 부분']메소드에 넣어서 찾을 수 있다!

'Python' 카테고리의 다른 글

파이썬 - 전위 표현식  (0) 2022.11.13
부동소수점  (0) 2022.09.11
파이썬 - enumerate  (0) 2022.05.15
파이썬 - 넘파이 브로드캐스팅  (0) 2022.05.11
파이썬 - 넘파이 repeat  (0) 2022.05.09

댓글