Как извлекать текст с вебстраницы (BeautifulSoup)?
29.12.2024, 15:02. Показов 1447. Ответов 2
Анализирую файл html, который имеет однотипные блоки кода. Извлекаю текст из них кодом:
| Python | 1
| title = link_tag.img['alt'] if link_tag.img and 'alt' in link_tag.img.attrs else "Без названия" |
|
А как извлекать не из alt, а из текста, который отображается на вебстранице? То есть отсюда:
| HTML5 | 1
| >ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО</a> |
|
Файл имеет блоки кода такого вида:
| HTML5 | 1
| <div data-testid="grid-item" class="vkitGridItem__root--6OepO"><div class="vkitVideoCardLayout__card--i7sFT" role="button" tabindex="0"><div class="vkitVideoCardLayout__videoContainer--0nunZ vkitOverlay__rootAfter--2nYMH vkuiAspectRatio vkuiAspectRatio--mode-stretch vkuiRootComponent" style="--vkui_internal--aspect_ratio: 1.7777777777777777;"><div><a class="vkitVideoCardThumb__thumb--OBLna vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__resetLinkStyle vkuiClickable__host vkuiClickable__realClickable vkui-focus-visible vkuiRootComponent" href="/video-227419342_456239589" data-testid="video_card_thumb"><div class="vkitVideoCardPreview__preview--fV4kv" role="button" tabindex="0"><div class="vkitVideoCardPreviewImage__container--a3XbD vkitVideoCardPreviewImage__fullSize--tKnNd vkitVideoCardPreviewImage__containerVisible--0l7hr"><img class="vkitVideoCardPreviewImage__img--arJyY vkitVideoCardPreviewImage__fullSize--tKnNd" src="https://sun9-80.userapi.com/9RZ-XZEVgw_dMajQAhtK7fraxNX4sh0SVOhaIw/n5NdwVIEzms.jpg" alt="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" loading="lazy"><div class="vkitVideoCardOverlayIcon__overlayIcon--Edafr"><svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--36 vkuiIcon--w-36 vkuiIcon--h-36 vkuiIcon--play_36" width="36" height="36" viewBox="0 0 36 36" style="width: 36px; height: 36px;"><use xlink:href="#play_36" style="fill: currentcolor;"></use></svg></div></div><video loop="" crossorigin="anonymous" aria-label="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" class="vkitVideoCardTrailerPlayer__trailer--ckOd3" src="https://vkvd446.okcdn.ru/?expires=1735685081527&srcIp=185.2.104.125&srcAg=CHROME_YA&ms=185.226.52.161&type=1&subId=7298639071975&sig=DjtyT3l3cNQ&ct=19&urls=45.136.21.181&clientType=13&appId=512000384397&zs=14&id=7458883832551"></video><div class="vkitVideoCardPreview__footer--D2CvT"></div><span class="vkui--vkBase--light vkuiTokensClassProvider--default-color vkitVideoCardBadge__badge--FW4fM vkitVideoCardBadge__durationBadge--D7q8C vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiTypography--weight-2 vkuiRootComponent" data-testid="video_card_duration">31:51</span></div><span aria-hidden="true" class="vkuiTappable__stateLayer vkuiTappable__ripple"></span></a><div class="vkitVideoCardControls__controls--lVxzZ"><svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--16 vkuiIcon--w-16 vkuiIcon--h-16 vkuiIcon--clock_outline_16 vkitVideoCardIconControl__controlIcon--qWrWU" width="16" height="16" viewBox="0 0 16 16" type="0" data-testid="video_card_watch_later_button" style="width: 16px; height: 16px;"><use xlink:href="#clock_outline_16" style="fill: currentcolor;"></use></svg><div class="vkitVideoCardExpandingControl__tappable--Sp36W vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__host vkuiRootComponent" aria-expanded="false"><div class="vkitVideoCardExpandingControl__container--QlWRD"><svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--16 vkuiIcon--w-16 vkuiIcon--h-16 vkuiIcon--add_16" width="16" height="16" viewBox="0 0 16 16" style="width: 16px; height: 16px;"><use xlink:href="#add_16" style="fill: currentcolor;"></use></svg></div></div></div></div></div><div class="vkitVideoCardInfoLayout__container--pkEj6"><a class="vkitVideoCardInfoLayout__avatar--ijqSA" href="/@club227419342" data-testid="video_card_author"><div class="vkuiAvatar vkuiInternalRichAvatar vkuiImageBase vkuiImageBase--loaded vkuiClickable__host vkuiRootComponent" style="width: 36px; height: 36px;"><img class="vkuiImageBase__img vkuiImageBase__img--objectFit-cover" src="https://sun3-20.userapi.com/s/v1/ig2/1pDMfViYvWVdp2tJFln-y1z4qG97uUaMKVOMXf_ZASlK_hmBWNTiYhmToxGZPlypJLmHkjKu1dGW-l3B1PHvA93a.jpg?quality=95&crop=229,39,1093,1093&as=32x32,48x48,72x72,108x108,160x160,240x240,360x360,480x480,540x540,640x640,720x720,1080x1080&ava=1&u=K6ZIOFxIIvhbCKMmJOMV3UVwrxKwxEm1wBUO6ls2Ddw&cs=50x50"><div class="vkuiImageBase__children"></div><div aria-hidden="true" class="vkuiImageBase__border"></div></div></a><div class="vkitVideoCardInfoLayout__content--BTSZK"><div data-testid="video_card_title" class="vkitVideoCardInfoLayout__info--ptxDJ vkitVideoCardInfoLayout__infoWithAction--SKdKO"><a class="vkitTextClamp__root--23kcq vkitVideoCardInfoLayout__title--fcKfk vkitVideoCardInfoLayout__titleLink--B644E vkuiSubhead vkuiSubhead--sizeY-compact vkuiTypography vkuiTypography--normalize vkuiTypography--weight-2 vkuiRootComponent" href="/video-227419342_456239589" title="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" style="--vkui_internal--textclamp-lines: 2;">ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО</a><div class="vkitVideoCardInfoLayout__action--AfE5m"><svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--24 vkuiIcon--w-24 vkuiIcon--h-24 vkuiIcon--more_horizontal_24 vkitVideoCardMenu__icon--C5gWM" width="24" height="24" viewBox="0 0 24 24" data-testid="video_card_more" aria-expanded="false" style="width: 24px; height: 24px;"><use xlink:href="#more_horizontal_24" style="fill: currentcolor;"></use></svg></div></div><span class="vkitTextClamp__root--23kcq vkitTextClamp__rootSingleLine--Ib0gz vkitVideoCardOwners__owner--Q4jJc vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiRootComponent" data-testid="video_card_owner" style="--vkui_internal--textclamp-lines: 1;"><a class="vkitVideoCardOwnerItem__link--506KO vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__resetLinkStyle vkuiClickable__host vkuiClickable__realClickable vkui-focus-visible vkuiRootComponent" href="/@club227419342">Max48TV</a></span><div class="vkitTextClamp__root--23kcq vkitTextClamp__rootSingleLine--Ib0gz vkitVideoCardInfoLayout__additionalInfo--kCmhz vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiRootComponent" data-testid="video_card_additional_info" style="--vkui_internal--textclamp-lines: 1;">724 просмотра · 3 месяца назад</div></div></div></div></div> |
|
Добавлено через 1 минуту
Отформатировал образец блока кода HTML:
| HTML5 | 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
| <div data-testid="grid-item" class="vkitGridItem__root--6OepO">
<div class="vkitVideoCardLayout__card--i7sFT" role="button" tabindex="0">
<div class="vkitVideoCardLayout__videoContainer--0nunZ vkitOverlay__rootAfter--2nYMH vkuiAspectRatio vkuiAspectRatio--mode-stretch vkuiRootComponent" style="--vkui_internal--aspect_ratio: 1.7777777777777777;">
<div>
<a class="vkitVideoCardThumb__thumb--OBLna vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__resetLinkStyle vkuiClickable__host vkuiClickable__realClickable vkui-focus-visible vkuiRootComponent" href="/video-227419342_456239589" data-testid="video_card_thumb">
<div class="vkitVideoCardPreview__preview--fV4kv" role="button" tabindex="0">
<div class="vkitVideoCardPreviewImage__container--a3XbD vkitVideoCardPreviewImage__fullSize--tKnNd vkitVideoCardPreviewImage__containerVisible--0l7hr">
<img class="vkitVideoCardPreviewImage__img--arJyY vkitVideoCardPreviewImage__fullSize--tKnNd" src="https://sun9-80.userapi.com/9RZ-XZEVgw_dMajQAhtK7fraxNX4sh0SVOhaIw/n5NdwVIEzms.jpg" alt="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" loading="lazy">
<div class="vkitVideoCardOverlayIcon__overlayIcon--Edafr">
<svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--36 vkuiIcon--w-36 vkuiIcon--h-36 vkuiIcon--play_36" width="36" height="36" viewBox="0 0 36 36" style="width: 36px; height: 36px;">
<use xlink:href="#play_36" style="fill: currentcolor;"/>
</svg>
</div>
</div>
<video loop="" crossorigin="anonymous" aria-label="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" class="vkitVideoCardTrailerPlayer__trailer--ckOd3" src="https://vkvd446.okcdn.ru/?expires=1735685081527&srcIp=185.2.104.125&srcAg=CHROME_YA&ms=185.226.52.161&type=1&subId=7298639071975&sig=DjtyT3l3cNQ&ct=19&urls=45.136.21.181&clientType=13&appId=512000384397&zs=14&id=7458883832551"/>
<div class="vkitVideoCardPreview__footer--D2CvT"/>
<span class="vkui--vkBase--light vkuiTokensClassProvider--default-color vkitVideoCardBadge__badge--FW4fM vkitVideoCardBadge__durationBadge--D7q8C vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiTypography--weight-2 vkuiRootComponent" data-testid="video_card_duration">31:51</span>
</div>
<span aria-hidden="true" class="vkuiTappable__stateLayer vkuiTappable__ripple"/>
</a>
<div class="vkitVideoCardControls__controls--lVxzZ">
<svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--16 vkuiIcon--w-16 vkuiIcon--h-16 vkuiIcon--clock_outline_16 vkitVideoCardIconControl__controlIcon--qWrWU" width="16" height="16" viewBox="0 0 16 16" type="0" data-testid="video_card_watch_later_button" style="width: 16px; height: 16px;">
<use xlink:href="#clock_outline_16" style="fill: currentcolor;"/>
</svg>
<div class="vkitVideoCardExpandingControl__tappable--Sp36W vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__host vkuiRootComponent" aria-expanded="false">
<div class="vkitVideoCardExpandingControl__container--QlWRD">
<svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--16 vkuiIcon--w-16 vkuiIcon--h-16 vkuiIcon--add_16" width="16" height="16" viewBox="0 0 16 16" style="width: 16px; height: 16px;">
<use xlink:href="#add_16" style="fill: currentcolor;"/>
</svg>
</div>
</div>
</div>
</div>
</div>
<div class="vkitVideoCardInfoLayout__container--pkEj6">
<a class="vkitVideoCardInfoLayout__avatar--ijqSA" href="/@club227419342" data-testid="video_card_author">
<div class="vkuiAvatar vkuiInternalRichAvatar vkuiImageBase vkuiImageBase--loaded vkuiClickable__host vkuiRootComponent" style="width: 36px; height: 36px;">
<img class="vkuiImageBase__img vkuiImageBase__img--objectFit-cover" src="https://sun3-20.userapi.com/s/v1/ig2/1pDMfViYvWVdp2tJFln-y1z4qG97uUaMKVOMXf_ZASlK_hmBWNTiYhmToxGZPlypJLmHkjKu1dGW-l3B1PHvA93a.jpg?quality=95&crop=229,39,1093,1093&as=32x32,48x48,72x72,108x108,160x160,240x240,360x360,480x480,540x540,640x640,720x720,1080x1080&ava=1&u=K6ZIOFxIIvhbCKMmJOMV3UVwrxKwxEm1wBUO6ls2Ddw&cs=50x50">
<div class="vkuiImageBase__children"/>
<div aria-hidden="true" class="vkuiImageBase__border"/>
</div>
</a>
<div class="vkitVideoCardInfoLayout__content--BTSZK">
<div data-testid="video_card_title" class="vkitVideoCardInfoLayout__info--ptxDJ vkitVideoCardInfoLayout__infoWithAction--SKdKO">
<a class="vkitTextClamp__root--23kcq vkitVideoCardInfoLayout__title--fcKfk vkitVideoCardInfoLayout__titleLink--B644E vkuiSubhead vkuiSubhead--sizeY-compact vkuiTypography vkuiTypography--normalize vkuiTypography--weight-2 vkuiRootComponent" href="/video-227419342_456239589" title="ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО" style="--vkui_internal--textclamp-lines: 2;">ГРУЗИМСЯ НА ВОРОНЕЖ, СЛЕПАЯ ЗОНА НА ГРУЗОВИКЕ, МКАД НАКИПЕЛО</a>
<div class="vkitVideoCardInfoLayout__action--AfE5m">
<svg aria-hidden="true" display="block" class="vkuiIcon vkuiIcon--24 vkuiIcon--w-24 vkuiIcon--h-24 vkuiIcon--more_horizontal_24 vkitVideoCardMenu__icon--C5gWM" width="24" height="24" viewBox="0 0 24 24" data-testid="video_card_more" aria-expanded="false" style="width: 24px; height: 24px;">
<use xlink:href="#more_horizontal_24" style="fill: currentcolor;"/>
</svg>
</div>
</div>
<span class="vkitTextClamp__root--23kcq vkitTextClamp__rootSingleLine--Ib0gz vkitVideoCardOwners__owner--Q4jJc vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiRootComponent" data-testid="video_card_owner" style="--vkui_internal--textclamp-lines: 1;">
<a class="vkitVideoCardOwnerItem__link--506KO vkuiTappable vkuiTappable--hasPointer-none vkuiClickable__resetLinkStyle vkuiClickable__host vkuiClickable__realClickable vkui-focus-visible vkuiRootComponent" href="/@club227419342">Max48TV</a>
</span>
<div class="vkitTextClamp__root--23kcq vkitTextClamp__rootSingleLine--Ib0gz vkitVideoCardInfoLayout__additionalInfo--kCmhz vkuiFootnote vkuiTypography vkuiTypography--normalize vkuiRootComponent" data-testid="video_card_additional_info" style="--vkui_internal--textclamp-lines: 1;">724 просмотра · 3 месяца назад</div>
</div>
</div>
</div>
</div> |
|
0
|