관리-도구
편집 파일: clean.cpython-37.pyc
B o��] i � @ s� d Z ddlmZ ddlZddlZyddlmZ ddlmZ W n$ e k r` ddl mZmZ Y nX ddlmZ ddl mZ dd l mZmZ dd l mZmZ ye W n ek r� eZY nX ye W n ek r� eZY nX ye W n ek �r eefZY nX ddd ddddgZe�dejejB �Ze�dej�Ze�dej�j Z!e�dej�j Z"dd� Z#e�d�j$Z%e�dejejB �Z&e�'d�Z(ej'ddeid�Z)G dd � d e*�Z+e+� Z,e,j-Z-e�dej�e�d ej�gZ.d!d"d#d$d%d&gZ/e�d'ej�e�d(ej�e�d)�gZ0d*gZ1e.e/e0e1fd+d�Z2d,d-� Z3d.d� Z4e2j e4_ d"d!d#gZ5d/gZ6d0e5e6ed1�fd2d�Z7d3d� Z8d4d5� Z9e�d6ej�Z:d7d8� Z;dS )9zcA cleanup tool for HTML. Removes unwanted tags and content. See the `Cleaner` class for details. � )�absolute_importN)�urlsplit)�unquote_plus)r r )�etree)�defs)� fromstring�XHTML_NAMESPACE)� xhtml_to_html�_transform_result� clean_html�clean�Cleaner�autolink� autolink_html� word_break�word_break_htmlzexpression\s*\(.*?\)z @\s*importz^data:image/.+;base64z<(?:javascript|jscript|livescript|vbscript|data|about|mocha):c C s t | �rd S t| �S )N)�_is_image_dataurl�_is_possibly_malicious_scheme)�s� r �B/opt/alt/python37/lib64/python3.7/site-packages/lxml/html/clean.py�_is_javascript_schemeN s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ dZdZdZ dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed ddd gd d d dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r a Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets as they could contain Javascript. ``comments``: Removes any comments. ``style``: Removes any style tags. ``inline_style`` Removes any style attributes. Defaults to the value of the ``style`` option. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` ``remove_tags``: A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. ``kill_tags``: A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site). ``safe_attrs``: A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. ``host_whitelist``: A list or set of hosts that you can use for embedded content (for content like ``<object>``, ``<link rel="stylesheet">``, etc). You can also implement/override the method ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) ``embedded``. Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. Note that you may also need to set ``whitelist_tags``. ``whitelist_tags``: A set of tags that can be included with ``host_whitelist``. The default is ``iframe`` and ``embed``; you may wish to include other tags like ``script``, or you may want to implement ``allow_embedded_url`` for more control. Set to None to include all tags. This modifies the document *in place*. TFNr �iframe�embedc K sZ x:|� � D ].\}}t| |�s,td||f ��t| ||� q W | jd krVd|krV| j| _d S )NzUnknown parameter: %s=%r�inline_style)�items�hasattr� TypeError�setattrr �style)�self�kw�name�valuer r r �__init__� s zCleaner.__init__�src�href�code�object)�script�link�appletr r �layer�ac C s� t |d�r|�� }t|� x|�d�D ] }d|_q&W | jsD| �|� t| jpNd�}t| j p\d�}t| j pjd�}| jr~|�d� | j r�t| j�}x:|�tj�D ]*}|j}x|�� D ]}||kr�||= q�W q�W | j�r2| j r�| jtjk�s(x@|�tj�D ]0}|j}x$|�� D ]}|�d��r||= �qW q�W |j| jdd� | j�s�x`t|�D ]T}|�d �} t�d | �} t�d | �} | �| ��r�|jd = n| | k�rJ|�d | � �qJW | j�s2x�t |�d ��D ]t}|�dd ��!� �"� dk�r�|�#� �q�|j$�p�d } t�d | �} t�d | �} | �| ��rd |_$n| | k�r�| |_$�q�W | j�sB| j%�rN|�tj&� | j%�rb|�tj'� | j�rt|�d � | j�r�t�(|d � | j)�r�|�d� nT| j�s�| j�r�xBt |�d��D ]0}d|�dd ��!� k�r�| �*|��s�|�#� �q�W | j+�r|�d� | j,�r|�-d� | j.�r�x\t |�d��D ]J}d}|�/� }x$|dk �r`|jdk�r`|�/� }�q>W |dk�r,|�#� �q,W |�-d� |�-d� | j0�r�|�-tj1� | j2�r�|�d� |�-d� | j3�r�|�-d� g } g }x`|�� D ]T}|j|k�r| �*|��r�q�|�4|� n&|j|k�r�| �*|��r*�q�| �4|� �q�W | �rj| d |k�rj| �5d�}d|_|j�6� n8|�r�|d |k�r�|�5d�}|jdk�r�d|_|�6� |�7� x|D ]}|�#� �q�W x| D ]}|�8� �q�W | j9�r�|�r�t:d��ttj;�}|�rtg }x(|�� D ]}|j|k�r|�4|� �qW |�rt|d |k�r\|�5d�}d|_|j�6� x|D ]}|�8� �qbW | j<�r�xdt=|�D ]X}| �>|��s�|�d�}|�r�d|k�r�d d!| k�rq�d"| }nd}|�d|� �q�W dS )#z& Cleans the document. �getrootZimageZimgr r* �onF)Zresolve_base_hrefr � �typeztext/javascriptz /* deleted */r+ Z stylesheet�rel�meta)�head�html�title�paramN)r, r) )r, )r r r- r) r8 Zform)Zbutton�input�select�textarea)ZblinkZmarqueer Zdivr6 zIIt does not make sense to pass in both allow_tags and remove_unknown_tagsZnofollowz nofollow z %s z%s nofollow)?r r/ r �iter�tag�comments�kill_conditional_comments�set� kill_tags�remove_tags� allow_tags�scripts�add�safe_attrs_only� safe_attrsr ZElement�attrib�keys� javascriptr � startswithZ rewrite_links�_remove_javascript_linkr �_find_styled_elements�get�_css_javascript_re�sub�_css_import_re�_has_sneaky_javascriptr �list�lower�strip� drop_tree�text�processing_instructions�CommentZProcessingInstructionZstrip_attributes�links� allow_elementr4 �page_structure�update�embeddedZ getparent�framesZ frame_tags�forms� annoying_tags�append�pop�clear�reverseZdrop_tag�remove_unknown_tags� ValueErrorZtags�add_nofollow�_find_external_links�allow_follow)r! �doc�elrA rB rC rG rH �aname�old�newZfound_parent�parent�_removeZ_kill�badr3 r r r �__call__� s zCleaner.__call__c C s dS )zF Override to suppress rel="nofollow" on some anchors. Fr )r! �anchorr r r rj � s zCleaner.allow_followc C s� |j | jkrdS | j|j }t|ttf�r^x.|D ]&}|�|�}|sFdS | �||�s0dS q0W dS |�|�}|spdS | �||�S dS )z� Decide whether an element is configured to be accepted or rejected. :param el: an element. :return: true to accept the element or false to reject/discard it. FTN)r= �_tag_link_attrs� isinstancerS �tuplerN �allow_embedded_url)r! rl �attrZone_attr�urlr r r r[ � s zCleaner.allow_elementc C s^ | j dk r|j| j krdS t|�\}}}}}|�� �dd�d }|dkrLdS || jkrZdS dS )a Decide whether a URL that was found in an element's attributes or text if configured to be accepted or rejected. :param el: an element. :param url: a URL found on the element. :return: true to accept the URL and false to reject it. NF�:� r )�http�httpsT)�whitelist_tagsr= r rT �split�host_whitelist)r! rl rz �scheme�netloc�path�query�fragmentr r r rx � s zCleaner.allow_embedded_urlc C s g }| � |dd� tj� dS )z� IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional. c S s t �| j�S )N)�_conditional_comment_re�searchrW )rl r r r �<lambda>� � z3Cleaner.kill_conditional_comments.<locals>.<lambda>N)�_kill_elementsr rY )r! rk rr r r r r? � s z!Cleaner.kill_conditional_commentsc C sD g }x$|� |�D ]}||�r|�|� qW x|D ]}|�� q0W d S )N)r<