Introduction

he extension mechanism of the dLibra server is based on the Java Plugin Framework (JPF) library. The basic element of that mechanism is the JPF plugin description file:

Code Block

	xml
	xml

<?xml version="1.0" ?>
<!DOCTYPE plugin PUBLIC "-//JPF//Java Plug-in Manifest 0.7" "http://jpf.sourceforge.net/plugin_0_7.dtd">
<plugin id="pl.psnc.dlibra.content" version="$Revision: 1.2 $"
	vendor="PSNC">
	<extension-point id="extraction.TextualContentExtractor">
		<parameter-def id="class" />
		<parameter-def id="order" />
	</extension-point>
</plugin>

The file shown above only defines one server extension point described below.

Note

title	Uwaga

Interfejsy programistyczne wyszczególnione w poniższych opisach znajdują się w bibliotece programistycznej

dlproj

	os
1	dlibra-server-extension-api
	os

.

The extraction.TextualContentExtractor Extension Point

The extraction.TextualContentExtractor extension set is for extracting text from files with publication content. The extension makes it possible to index publication content regardless of its format. In the case of text document formats, such as HTML, text can be accessed almost instantly. In the case of other formats, files must be prepared to make text extraction possible. Only then will extensions be able to extract the text and pass it on to be indexed.

That extension has two parameters:

class – the name of the class which implements the programming interface of the extension, and
order – the parameter which is responsible for extension selection order (which makes it possible to determine which extension will be used when more than one extension can handle the given content format).

The programming interface (Java language) for that extension is

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

. For more information about it, see the programming documentation (JavaDocs).

The dLibra system is provided with a pre-installed set of extensions of that type. The set includes:

text extraction from simple formats, such as CHM, HTML, RTF, or TXT (
dlproj
os
1 dlibra-server-extension-tce-basic
os
),
text extraction from the DjVu format (
dlproj
os
1 dlibra-server-extension-tce-djvu
os
),
text extraction from formats supported by an external mechanism, LIUS (
dlproj
os
1 dlibra-server-extension-tce-lius
os
), and
text extraction from the PDF format (
dlproj
os
1 dlibra-server-extension-tce-pdf
os
).

Those extensions are briefly described below.

The Basic Extension

The basic extension has a set of classes which implement interface

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

and make it possible to extract text from files in the following formats:

CHM - class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.CHMTextualContentExtractor
os
HTML - class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.HTMLTextualContentExtractor
os
RTF - class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.RTFTextualContentExtractor
os
TXT - class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.TXTTextualContentExtractor
os

The DjVu extension

The DjVu extension makes it possible to extract text from the text layer of files in the DjVu format (if they have such a layer). That task is done by class

dljdoc

	os
1	dlibra-server-extension-tce-djvu
2	pl.psnc.dlibra.content.extraction.DjVuTextualContentExtractor
	os

The LIUS Extension

The LIUS extension makes use of the (external) LIUS (Lucene Index Update and Search) library which allows, among other things, text extraction from files with the following formats: MsWord, MsExcel, MsPowerPoint, RTF, PDF, XML, HTML, TXT, OpenOffice, ZIP, MP3, VCard, Latex, and JavaBeans. The extension class which extracts text from those file formats is

dljdoc

	os
1	dlibra-server-extension-tce-lius
2	pl.psnc.dlibra.content.extraction.LIUSTextualContentExtractor
	os

.

The PDF extension

Extracting text from files in the PDF format is based on the (external) PDFBox library. The class which implements interface

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

i and extracts text from files of that type is

dljdoc

	os
1	dlibra-server-extension-tce-pdf
2	pl.psnc.dlibra.content.extraction.PDFTextualContentExtractor
	os

.

Page tree

Versions Compared

Old Version 24

New Version Current

Key

Introduction

The extraction.TextualContentExtractor Extension Point

The Basic Extension

The DjVu extension

The LIUS Extension

The PDF extension

Page tree

Page History

Versions Compared

Old Version 24

New Version Current

Key

Introduction

The extraction.TextualContentExtractor Extension Point

The Basic Extension

The DjVu extension

The LIUS Extension

The PDF extension