[java + pdf] Interpretazione file

**Leandro** · 21-11-2011, 17:37

Ciao a tutti,
scusate se rischio l'OT ma non so bene dove postare.
Devo leggere dei pdf per estrarne dei dati, sto utilizzando PDFBox di java ma mi riesce difficile interpretare / applicare le api in maniera chiara.
Il concetto è che io posso risalire ai dati in maniera posizionale, ovvero mi trovo per esempio nella prima riga il nominativo, nella seconda la PIVA, nella terza tre dati uno dietro l'altro in forma tabellare ( uno ad inizio riga, l'altro al centro e il terzo a destra). Alla fine di queste tre righe si ricomincia con lo stesso tipo di dati ma per un altro utente e così via per n pagine.
Il problema principale per me è quello di poter distinguere una riga dall'altra (la prima dalla seconda dalla terza ecc.)
Qualcuno sa dirmi dove poter studiare la struttura COS (nella fattispecie gli Stream) per poter reperire informazioni tipo il numero di riga o la posizione ecc. ecc.?

Per ora sto leggendo questo The Cos Object Tree

Unitamente a questo esempio (che si trova nel pacchetto PDFBox di apache

codice:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.pdfbox.examples.pdmodel;

import java.io.IOException;
import java.io.OutputStream;
import java.util.List;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.exceptions.COSVisitorException;

import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

import org.apache.pdfbox.pdmodel.common.PDStream;

import org.apache.pdfbox.util.PDFOperator;


/**
 * This is an example that will replace a string in a PDF with a new one.
 *
 * The example is taken from the pdf file format specification.
 *
 * @author Ben Litchfield
 * @version $Revision: 1.3 $
 */
public class ReplaceString
{
    /**
     * Constructor.
     */
    public ReplaceString()
    {
        super();
    }

    /**
     * Locate a string in a PDF and replace it with a new string.
     *
     * @param inputFile The PDF to open.
     * @param outputFile The PDF to write to.
     * @param strToFind The string to find in the PDF document.
     * @param message The message to write in the file.
     *
     * @throws IOException If there is an error writing the data.
     * @throws COSVisitorException If there is an error writing the PDF.
     */
    public void doIt( String inputFile, String outputFile, String strToFind, String message)
        throws IOException, COSVisitorException
    {
        // the document
        PDDocument doc = null;
        try
        {
            doc = PDDocument.load( inputFile );
            List pages = doc.getDocumentCatalog().getAllPages();
            for( int i=0; i<pages.size(); i++ )
            {
                PDPage page = (PDPage)pages.get( i );
                PDStream contents = page.getContents();
                PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
                parser.parse();
                List tokens = parser.getTokens();
                for( int j=0; j<tokens.size(); j++ )
                {
                    Object next = tokens.get( j );
                    if( next instanceof PDFOperator )
                    {
                        PDFOperator op = (PDFOperator)next;
                        //Tj and TJ are the two operators that display
                        //strings in a PDF
                        if( op.getOperation().equals( "Tj" ) )
                        {
                            //Tj takes one operator and that is the string
                            //to display so lets update that operator
                            COSString previous = (COSString)tokens.get( j-1 );
                            String string = previous.getString();
                            string = string.replaceFirst( strToFind, message );
                            previous.reset();
                            previous.append( string.getBytes("ISO-8859-1") );
                        }
                        else if( op.getOperation().equals( "TJ" ) )
                        {
                            COSArray previous = (COSArray)tokens.get( j-1 );
                            for( int k=0; k<previous.size(); k++ )
                            {
                                Object arrElement = previous.getObject( k );
                                if( arrElement instanceof COSString )
                                {
                                    COSString cosString = (COSString)arrElement;
                                    String string = cosString.getString();
                                    string = string.replaceFirst( strToFind, message );
                                    cosString.reset();
                                    cosString.append( string.getBytes("ISO-8859-1") );
                                }
                            }
                        }
                    }
                }
                //now that the tokens are updated we will replace the
                //page content stream.
                PDStream updatedStream = new PDStream(doc);
                OutputStream out = updatedStream.createOutputStream();
                ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
                tokenWriter.writeTokens( tokens );
                page.setContents( updatedStream );
            }
            doc.save( outputFile );
        }
        finally
        {
            if( doc != null )
            {
                doc.close();
            }
        }
    }

    /**
     * This will open a PDF and replace a string if it finds it.
     * 

     * see usage() for commandline
     *
     * @param args Command line arguments.
     */
    public static void main(String[] args)
    {
        ReplaceString app = new ReplaceString();
        try
        {
            if( args.length != 4 )
            {
                app.usage();
            }
            else
            {
                app.doIt( args[0], args[1], args[2], args[3] );
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

    /**
     * This will print out a message telling how to use this example.
     */
    private void usage()
    {
        System.err.println( "usage: " + this.getClass().getName() +
            " <input-file> <output-file> <search-string> <Message>" );
    }
}

Grazie a tutti.
Leandro

**Leandro** · 22-11-2011, 15:40

up

**Leandro** · 22-11-2011, 17:51

Text Extraction PDFBox

codice:

// Nuovo PDDocument 
// http://pdfbox.apache.org/apidocs/org...DDocument.html
PDDocument  f = new PDDocument ();
f = PDDocument.load( path );

// http://pdfbox.apache.org/apidocs/org...tStripper.html
PDFTextStripper ts = new PDFTextStripper();

// Il testo dell'intero documento con tutti i caratteri quali \n ecc. ecc.
String textToParse = ts.getText(f);

Buon divertimento
Leandro

Discussione: [java + pdf] Interpretazione file

Strumenti discussione

Ricerca discussione

Visualizza

[java + pdf] Interpretazione file

Risolto

Permessi di invio