Pdf to odt/docx conversion has me weeping!
from Maroon@lemmy.world to selfhosted@lemmy.world on 19 Jun 02:37
https://lemmy.world/post/31653955

You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.

I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.

I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.

Is there a magic open source solution that I have missed out?

#selfhosted

threaded - newest

[deleted] on 19 Jun 02:46 next collapse

.

anamethatisnt@sopuli.xyz on 19 Jun 02:57 next collapse

Would an alternative be to simply edit the pdfs?

The german software FlexiPDF still allows you to buy a yearly version for a one off sum and allow you to use a free trial with watermark to check whether it works well enough for you before you buy.
www.softmaker.com/en/products/flexipdf

Botzo@lemmy.world on 19 Jun 03:17 next collapse

pdf2docx.readthedocs.io seems to fit the bill. I can’t vouch for it.

PDF is such a curse. I say this as a person currently tasked with deploying new mysteriously complex enterprise PDF conversion software for technical documents. The rabbit hole is so deep.

mesamunefire@piefed.social on 19 Jun 06:45 next collapse

As a dev the reason pdf is so strange is because it's a compound format. It can be just images strung together. It can also be pure text with fonts, ect...etc ..

If you open the file as a text file, you can see this. It's many different formats in a trenchcoat.

Botzo@lemmy.world on 19 Jun 09:13 collapse

Yeah, also a dev here. I’d be so happy if they’d parted ways with the 90s legacy bits at some point. Just glad there are enough parsing libraries that I’ll never need to care (right? Please tell me I’m right!).

mesamunefire@piefed.social on 19 Jun 21:11 collapse

I hope your right too lol.

observantTrapezium@lemmy.ca on 19 Jun 07:07 next collapse

It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.

Botzo@lemmy.world on 19 Jun 09:00 collapse

This is probably my first time ever using it for an appropriate purpose as this team’s technical docs are destined for the press (and digital distribution). They just have no idea how to software, so I was brought in to build bridges between and ultimately simplify all their tools.

Treczoks@lemmy.world on 19 Jun 13:33 collapse

It is not a curse. It does exactly what it is intended to do: Create an archive of a document that is universally reproduceable.

It is a very well designed cul-de-sac for exactly this purpose. Using it for anything else is calling for trouble.

whimsy@lemmy.zip on 19 Jun 03:26 next collapse

Maybe LibreOffice Draw can help you out? It has PDF editing capabilities

Treczoks@lemmy.world on 19 Jun 13:35 collapse

If you ever need to edit a PDF that way, just use Inkscape. It is way better than LO draw for that.

JASN_DE@feddit.org on 19 Jun 03:30 next collapse

I haven’t tested that part of it yet, but the self-hostable StirlingPDF offers conversion from PDF to a number of formats.

The rest I use it for works fine, so maybe that could be an option.

fossilesque@mander.xyz on 19 Jun 03:40 next collapse

StirlingPDF does this. I’ll dm you the one I host for my writing group.

observantTrapezium@lemmy.ca on 19 Jun 06:45 next collapse

I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.

What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…

fossilesque@mander.xyz on 19 Jun 07:07 collapse

StirlingPDF is basically 1 size fits all.

observantTrapezium@lemmy.ca on 19 Jun 07:19 collapse

Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).

A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.

fossilesque@mander.xyz on 19 Jun 07:52 collapse
bizdelnick@lemmy.ml on 19 Jun 09:08 next collapse

There’s no any solution. It is impossible to convert from PDF to any editable format correctly. The exception is a “hybrid PDF” that has an embedded editable document. If you need to edit PDFs that you created yourself, store them in hybrid format.

cmnybo@discuss.tchncs.de on 19 Jun 11:42 next collapse

The only real solution is to always keep your source files. PDFs are not intended to be edited.

Treczoks@lemmy.world on 19 Jun 12:52 collapse

The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.

It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.

And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.

Same with protected PDFs where you simply cannot copy the text from the start.

And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.

PDF is an archival, output format, the end of a process. Not something to work from.

Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.

ChaoticNeutralCzech@feddit.org on 19 Jun 19:31 collapse

Renumbering characters during font minimization? I haven’t encountered that, it would break searching and copying.

Anyway, PDFs for example don’t even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.

And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it’s up to the conversion tool to guess if it’s part of the main text layer.

Sorry, it’s really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word’s “Graphic > Convert to Shapes” feature).

Really, every software that outputs PDF should treat it as an export process, hopefully making it clearer that “saving as PDF” is visually lossless but structurally lossy and messy.

Treczoks@lemmy.world on 20 Jun 14:05 collapse

The compressing and renumbering seems to be more common with embedded Chinese fonts - Space-wise it makes a lot of sense. But yes, mark and copy text, paste it into word or writer, and you get gibberish. Can’t verify the search, though. And, of course, Google translate can’t do anything with it, either.