Skip to content

how to improve PDF to HTML conversion #10

@mrchristian

Description

@mrchristian

Currently the Semantic Climate project converts PDFs to HTML.

The content is the IPPC Climate report AR6 and we need to improve is markup for further semantic annotation, resuse, and presentation. From a typesetting perspective and freeing us from descructive reliance on PDF (note we can get PDF like results in a non-descructive way using Vivliostyle) - that's me @mrchristian I would like to produce HTMl that could be rendered in Vivliostyle better than this.

The output needs improvement. Currently it contained a number of elements which may not be needed, e.g., page numbers, inline styles, etc.

The objective would be to improve the output with tooling that can integrate with the current workflow.

The suggestion would be to create a way to evaluate the process by collating information on the issue:

  1. Current tooling
  2. Condition of the source PDFs
  3. Problems with outputs
  4. List of parts and markup that we need to retain their integrity
  5. Define what we want in out target outputs
  6. Do we want other output formats for richer markup and other interoperability
  7. List and evaluate tools
  8. Consult experts in the field: pandoc, le-tex, fidus, vivlio, css-rocks, etc

This research can be conducted in a wiki page on the Semantic Climate repository.

Here are sample files:

PDF source - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf

HTML full text - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html

Tasks

  1. Link to current PDF to HTML tooling.
  2. Consult Single Source Publishing Community https://github.com/singlesourcepub/community/discussions and others: le-tex, pandoc, css rocks?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions