html_to_text

This simple library helps you to extract useful text information from html documents.

Installing

To install html_to_text, simply:

pip install html_to_text

Usage:

Prepare your html document and remove exess tags from it. To do this, you can use html_cleaner object returned by get_html_cleaner function. The get_html_cleaner function takes three parameters:

remove_without_content: A set of tags which will be removed without their content.
remove_with_content: A set of tags which will be removed with their content.
convert_charrefs: If it is True, all character references will be automatically converted to the corresponding Unicode characters (default True).

>>> from html_to_text import get_html_cleaner
>>> html = """
...   <html>
...       <head>
...           <title>Example</title>
...           <script>Some scripts</script>
...           <style>Some style</style>
...       </head>
...       <body>
...           <div class="content">
...               <h1>This is h1 example.</h1>
...               <p><b>This</b> is <b><s>some</s> text<b> information. This is some <span>text</span> 
...               information.</p>
...               <h2>This is h2 example.</h2>
...               <p>This is some text information. This is some text information.</p>
...           </div>
...           <div class="nav">
...               <ul>
...                   <li><a href="#">page 1</a></li>
...                   <li><a href="#">page 2</a></li>
...                   <li><a href="#">page 3</a></li>
...               </ul>
...           </div>
...       </body>
...   </html>
... """
...
>>> cleaner = get_html_cleaner(
...   remove_without_content={'b', 's', 'span'},
...   remove_with_content={'style', 'script'}
... )
...
>>> cleaner.feed(html)
>>> print(cleaner.data)
<html>
  <head>
    <title>Example</title>
    
    
  </head>
  <body>
    <div class="content">
      <h1>This is h1 example</h1>
      <p>This is some text information. This is some text information.</p>
      <h2>This is h2 example</h2>
      <p>This is some text inforamtion. This is some text information.</p>
    </div>
    <div class="nav">
      <ul>
        <li><a href="#">page 1</a></li>
        <li><a href="#">page 2</a></li>
        <li><a href="#">page 3</a></li>
      </ul>
    </div>
  </body>
</html>

To extract useful text from html document should use parser object returned by get_parser function. This function takes:

tags_to_save: A set of tags for saving.
tags_to_remove: A set of tags for removing.
punctuation: Punctuation marks.
min_allowed_weight: Minimum allowed weight for chunk (html block).
save_attrs: If parameter is true, attributes of tag will be save, default False.
tag_class: Tag class.
tag_link: Tag link ('a' default).
chunk_class: Chunk class.
tag_wrapper: Wrapper for tags.
chunks_wrapper: Wrapper for chunks (blocks with html).
save_chunks_wrapper: Wrapper for 'save' chunks.
splitter: HTMLSplitter instance. Which can split html document to chunks (little blocks with html).
chunks_cleaner: HTMLChunksCleaner instance. Which can remove tags from chunks and calculate length of links.
save_chunks_cleaner: HTMLChunksCleaner instance. Which can remove tags from chunks.

>>> from html_to_text import get_parser
>>> parser = get_parser(
...   tags_to_save={'title', 'h1','h2'},
...   tags_to_remove={'h1', 'h2', 'script', 'style'},
...   min_allowed_weight=2.3
... )
...
>>> parser.feed(cleaner.data)
>>> print(parser.data)
This is some text information. This is some text information. This is some text information. This is 
some text information.
>>> print(parser.saved_tags)
{'h1': ['This is h1 example.'], 'title': ['Example'], 'h2': ['This is h2 example.']}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
html_to_text		html_to_text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
tests_html_to_text.py		tests_html_to_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

html_to_text

Installing

Usage:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

emludei/html_to_text

Folders and files

Latest commit

History

Repository files navigation

html_to_text

Installing

Usage:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages