HTML::Paragraphs --- inserts paragraph markers and transform HTML documents


NAME

HTML::Paragraphs --- inserts paragraph markers and transform HTML documents


REQUIRED LIBRARIES

HTML::Transform


SYNOPSIS

        use HTML::Paragraphs;
        $p = new HTML::Paragraphs;
        $p->parse(\<<STOP);
        <h1>Test document</h1>
        Paragraph markers will be automatically inserted
        into this text document.
        This saves you some headache.
        STOP


DESCRIPTION

HTML::Paragraphs is a HTML parser/transformation module that is able to detect paragraphs and do automatic <p>...</p> insertion. If you run a document such as

        <H1>Test document</H1>
        Here we have one
        paragraph.
        And here we have another.

through the paragraph transformer the result will be.

        <H1>Test document</H1>
        <P>Here we have one
        paragrah.</P>
        <P>And here we have another.</P>

HTML::Paragraphs is a subclass of HTML::Transform. See the HTML::Transform manpage for a full description of the functions supported by this module. This document only describes the difference between HTML::Paragraphs and HTML::Transform.

$p = HTML::Paragraphs->new ()
Creates and returns a new paragraph parsing object.

$p->parse DOC, ARGS
$p->parse DOC
HTML::Paragraphs->parse DOC, ARGS
HTML::Paragraphs->parse DOC
Transforms the HTML-document doc according to the parser's transformation rules. The default behavior is to insert <P>...</P> around all paragraphs in the document. If you want to add additional rules you must create a new class or use the set_handler() function.

See the HTML::Transform manpage for more information.

$p->paragraph_split TEXT, ATTR, ATTRSEQ, ORIGTEXT
This parser rule implements the actual paragraph splitting. It is called with a block of text and defaults to putting <p>...</p> areound each paragraph in the text (each block separated by double newlines). If you want to change the behaviour you can override the function.

$self->set_block TAGS
Specifies that the tags in the list TAGS are block level tags. Block level tags are tags such as <h1> and <table> that should not be enclosed in <p>...</p> blocks.

To understand the difference between block and non-block tags, note that

        <b>Bold text</b>

should be converted to

        <p><b>Bold text</b></p>

while

        <h1>Header</h1>

should not be converted to

        <p><h1>Header</h1></p>

HTML::Paragraphs recognizes all block level tags in the HTML standard, so you do not need to call set_block() for those tags.

$self->block TAG
Returns true if TAG is a block level tag. If you create a subclass of this class you can override block() to return the right value for the tags you have defined as an alternative to calling set_block().

The block() implementation in HTML::Paragraphs returns the correct value for all standard HTML tags so you probably want to call it for the tags you do not handle.

        sub block {
                my ($self, $tag) = @_;
                return ($tag eq "myblock") || $self->SUPER::block($tag);
        }

$self->set_block_container TAGS
Specifies that the tags in the list TAGS are block containers, i. e. tags that can contain <P>...</P> blocks. A typical example is <TD>...</TD>, since each table cell can contain several paragraphs.

HTML::Paragraphs can correctly handle all tags in the HTML standard, so you only need to call set_block_container() for the tags you have defined.

$self->block_container TAG
Returns true if TAG is a block container tag. If you create a subclass you can override this function to specify which tags are block containers, as an alternative to calling the set_block_container() function.

You should probably call the method in the superclass for all tags you do not handle.

$self->p TEXT, ATTR, ATTRSEQ, ORIGTEXT
This function is called for each paragraph in the text. The default behavior is to put <p>...</p> around each paragraph in the text (each block separated by double newlines).


USAGE GUIDE

One of the most annoying things about writing HTML documents is having to insert <P>...</P> tags around each paragraph document. The Paragraphs lets you avoid this hassle. You can pass a document such as

        <H1>My story</H1>
        I was born many years ago. I was very small
        then I don't remember very much of it.
        Later I grow up. I don't remember much of
        that either, but it seemed to involve a lot
        of ants.

And paragraph markers will be automatically added

        <H1>My story</H1>
        <p>I was born many years ago. I was very small
        then I don't remember very much of it.</p>
        <p>Later I grow up. I don't remember much of
        that either, but it seemed to involve a lot
        of ants.</p>

If you want to do additional parsing you have to create a subclass of the Paragraphs. See the HTML::Transform manpage for more information on this.

Note that you shouldn't use the autofix() function together with the HTML::Paragraphs module. They will interfere destructively.

To be fair, when you have created a subclass and introduced your own tags, the insertion of paragraph markers is not completely automatic. You have to provide the parser with some information about the tags in the document. The reason is that you want different tags to be treated differently. Consider the example:

        <H1>Test</H1>
        <B>This is a test</B>

You want this to be transformed to:

        <H1>Test</H1>
        <P><B>This is a test</B></P>

So the <H1> tag needs to be treated differently from the <B> tag. For the tags in the HTML standard this does not pose a problem, because we can enumerate them and define their behavior, but for the tags you define in your rule document, the parser will not know how to treat them unless you tell it.

You specify that a tag is a block level tag by calling set_block() or overriding block(). Block level tags are tags that should not be wrapped up in <P> tags. For example, <H1> is a block level tag, since you do not want it replaced with

        <P><H1>Header</H1></P>

Other typical block level tags are: <ADDRESS>, <BLOCKQUOTE>, <PRE>, <DL>, <OL>, <UL> and <P> (since you do not want <P><P>...</P></P>). If you have defined your own tag that behaves like these tags, you need to override block() or call set_block(). Note that the implementation of block() in HTML::Paragraphs handles all the standard HTML tags, so you probably want to call it in your overriding method.

The second thing you need to do is call set_block_container() or override is block_container(). This method should return true for every tag that can contain <P>...</P> blocks. For example, you will probably want to do paragraph parsing inside <TD> tags, to make sure that

        <TD>
        Paragraph 1.
        Paragraph 2.
        </TD>

is replaced by

        <TD>
        <P>Paragraph 1.</P>
        <P>Paragraph 2.</P>
        </TD>

But you probably do not want to do paragraph parsing inside <PRE> tags. Just as before, call the superclass method block_container() to get the default behavior for all the standard tags.

Using set_block() your code may look like this:

        $p->set_block(qw(program block));
        $p->set_block_container("block");

Using overrides, your code may look like this:

        sub block {
                my ($self, $tag) = @_;
                return (grep {/^$tag$/} qw(program block) or
                        $self->SUPER::block($tag));
        }
        sub block_container {
                my ($self, $tag) = @_;
                ($tag eq "block") || $self->SUPER::block_container($tag);
        }


VERSION HISTORY

Current version: 1.0 beta 3

Changes since 1.0 beta 1


SEE ALSO

the HTML::Transform manpage


AUTHOR

Niklas Frykholm, niklas@kagi.com

This program can be used and distributed freely.

 HTML::Paragraphs --- inserts paragraph markers and transform HTML documents