Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

Introduction

he extension mechanism of the dLibra server is based on the Java Plugin Framework (JPF) library. The basic element of that mechanism is the JPF plugin description file:

Code Block
xml
xml
<?xml version="1.0" ?>
<!DOCTYPE plugin PUBLIC "-//JPF//Java Plug-in Manifest 0.7" "http://jpf.sourceforge.net/plugin_0_7.dtd">
<plugin id="pl.psnc.dlibra.content" version="$Revision: 1.2 $"
	vendor="PSNC">
	<extension-point id="extraction.TextualContentExtractor">
		<parameter-def id="class" />
		<parameter-def id="order" />
	</extension-point>
</plugin>

The file shown above only defines one server extension point described below.

Note
titleUwaga

Interfejsy programistyczne wyszczególnione w poniższych opisach znajdują się w bibliotece programistycznej

dlproj
os
1dlibra-server-extension-api
os

.

The extraction.TextualContentExtractor Extension Point

The extraction.TextualContentExtractor extension set is for extracting text from files with publication content. The extension makes it possible to index publication content regardless of its format. In the case of text document formats, such as HTML, text can be accessed almost instantly. In the case of other formats, files must be prepared to make text extraction possible. Only then will extensions be able to extract the text and pass it on to be indexed.

That extension has two parameters:

  • class – the name of the class which implements the programming interface of the extension, and
  • order – the parameter which is responsible for extension selection order (which makes it possible to determine which extension will be used when more than one extension can handle the given content format).

The programming interface (Java language) for that extension is

dljdoc
os
1dlibra-server-extension-api
2pl.psnc.dlibra.content.extraction.TextualContentExtractor
os
. For more information about it, see the programming documentation (JavaDocs).

The dLibra system is provided with a pre-installed set of extensions of that type. The set includes:

  • text extraction from simple formats, such as CHM, HTML, RTF, or TXT (
    dlproj
    os
    1dlibra-server-extension-tce-basic
    os
    ),
  • text extraction from the DjVu format (
    dlproj
    os
    1dlibra-server-extension-tce-djvu
    os
    ),
  • text extraction from formats supported by an external mechanism, LIUS (
    dlproj
    os
    1dlibra-server-extension-tce-lius
    os
    ), and
  • text extraction from the PDF format (
    dlproj
    os
    1dlibra-server-extension-tce-pdf
    os
    ).

Those extensions are briefly described below.

The Basic Extension

The basic extension has a set of classes which implement interface

dljdoc
os
1dlibra-server-extension-api
2pl.psnc.dlibra.content.extraction.TextualContentExtractor
os
and make it possible to extract text from files in the following formats:

  • CHM - class
    dljdoc
    os
    1dlibra-server-extension-tce-basic
    2pl.psnc.dlibra.content.extraction.CHMTextualContentExtractor
    os
  • HTML - class
    dljdoc
    os
    1dlibra-server-extension-tce-basic
    2pl.psnc.dlibra.content.extraction.HTMLTextualContentExtractor
    os
  • RTF - class
    dljdoc
    os
    1dlibra-server-extension-tce-basic
    2pl.psnc.dlibra.content.extraction.RTFTextualContentExtractor
    os
  • TXT - class
    dljdoc
    os
    1dlibra-server-extension-tce-basic
    2pl.psnc.dlibra.content.extraction.TXTTextualContentExtractor
    os

The DjVu extension

The DjVu extension makes it possible to extract text from the text layer of files in the DjVu format (if they have such a layer). That task is done by class  

dljdoc
os
1dlibra-server-extension-tce-djvu
2pl.psnc.dlibra.content.extraction.DjVuTextualContentExtractor
os

The LIUS Extension

The LIUS extension makes use of the (external) LIUS (Lucene Index Update and Search) library which allows, among other things, text extraction from files with the following formats: MsWord, MsExcel, MsPowerPoint, RTF, PDF, XML, HTML, TXT, OpenOffice, ZIP, MP3, VCard, Latex, and JavaBeans. The extension class which extracts text from those file formats is 

dljdoc
os
1dlibra-server-extension-tce-lius
2pl.psnc.dlibra.content.extraction.LIUSTextualContentExtractor
os
.

The PDF extension

Extracting text from files in the PDF format is based on the (external) PDFBox library. The class which implements interface 

dljdoc
os
1dlibra-server-extension-api
2pl.psnc.dlibra.content.extraction.TextualContentExtractor
os
i and extracts text from files of that type is
dljdoc
os
1dlibra-server-extension-tce-pdf
2pl.psnc.dlibra.content.extraction.PDFTextualContentExtractor
os
.