Introduction

he extension mechanism of the dLibra server is based on the

Wprowadzenie

Mechanizm rozszerzeń serwera dLibra bazuje na bibliotece Java Plugin Framework (JPF) . Podstawowym elementem w tym mechanizmie jest plik opisujący plugin JPFlibrary. The basic element of that mechanism is the JPF plugin description file:

Code Block

	xml
	xml

<?xml version="1.0" ?>
<!DOCTYPE plugin PUBLIC "-//JPF//Java Plug-in Manifest 0.7" "http://jpf.sourceforge.net/plugin_0_7.dtd">
<plugin id="pl.psnc.dlibra.content" version="$Revision: 1.2 $"
	vendor="PSNC">
	<extension-point id="extraction.TextualContentExtractor">
		<parameter-def id="class" />
		<parameter-def id="order" />
	</extension-point>
</plugin>

Powyższy plik definiuje tylko jeden punkt rozszerzenia serwera opisane poniżejThe file shown above only defines one server extension point described below.

Note

title	Uwaga

Interfejsy programistyczne wyszczególnione w poniższych opisach znajdują się w bibliotece programistycznej

dlproj

	os
1	dlibra-server-extension-api
	os

.

...

The extraction.TextualContentExtractor Extension Point

Zestaw rozszerzeń The extraction.TextualContentExtractor służy do ekstrakcji tekstu z plików z treścią publikacji. Dzięki temu możliwe jest indeksowanie treści publikacji niezależnie od jej formatu. Dla tekstowych formatów dokumentów, takich jak na przykład HTML, dostęp do tekstu jest niemal natychmiastowy. W przypadku innych formatów pliki muszą być w odpowiedni sposób przygotowane, żeby ekstrakcja tekstu była możliwa. Tylko wówczas rozszerzenia będą w stanie taki tekst uzyskać i przekazać do indeksacji.

Opisywane rozszerzenie przyjmuje dwa parametry:

class - nazwa klasy, która implementuje interfejs programistyczny tego rozszerzenia
order - parametr odpowiadający za kolejność wybierania danego rozszerzenia (pozwala to na ustalenie, które z rozszerzeń będzie użyte w przypadku, gdy jest więcej niż jedno rozszerzenie obsługujące dany format treści)

extension set is for extracting text from files with publication content. The extension makes it possible to index publication content regardless of its format. In the case of text document formats, such as HTML, text can be accessed almost instantly. In the case of other formats, files must be prepared to make text extraction possible. Only then will extensions be able to extract the text and pass it on to be indexed.

That extension has two parameters:

class – the name of the class which implements the programming interface of the extension, and
order – the parameter which is responsible for extension selection order (which makes it possible to determine which extension will be used when more than one extension can handle the given content format).

The programming interface (Java language) for that extension is Interfejs programistyczny (język Java) dla tego rozszerzenia to

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

. Bardziej szczegółowe informacje na temat jego działania znajdują się w dokumentacji programistycznej For more information about it, see the programming documentation (JavaDocs).

Serwer dLibry dostarczany jest z preinstalowanym zestawem rozszerzeń tego typu. Należą do nich:

The dLibra system is provided with a pre-installed set of extensions of that type. The set includes:

text extraction from simple formats, such as Ekstrakcja tekstu z prostych formatów takich jak CHM, HTML, RTF, TXT or TXT (
dlproj
os
1 dlibra-server-extension-tce-basic
os
).,
text extraction from the DjVu format Ekstrakcja tekstu z formatu DjVu (
dlproj
os
1 dlibra-server-extension-tce-djvu
os
).Ekstrakcja tekstu z formatów obsługiwanych przez zewnętrzny mechanizm LIUS ,
text extraction from formats supported by an external mechanism, LIUS (
dlproj
os
1 dlibra-server-extension-tce-lius
os
), and
text extraction from the PDF format .Ekstrakcja tekstu z formatu PDF - (
dlproj
os
1 dlibra-server-extension-tce-pdf
os
).

Rozszerzenia te opisano pokrótce poniżej.

Rozszerzenie basic

Those extensions are briefly described below.

The Basic Extension

The basic extension has a set of classes which implement interface

Rozszerzenie to posiada zestaw klas implementujących interfejs

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

i pozwalających wyciągać tekst z plików w następujących formatach and make it possible to extract text from files in the following formats:

CHM - klasa class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.CHMTextualContentExtractor
os
HTML - klasa class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.HTMLTextualContentExtractor
os
RTF - klasa class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.RTFTextualContentExtractor
os
TXT - klasa class
dljdoc
os
1 dlibra-server-extension-tce-basic
2 pl.psnc.dlibra.content.extraction.TXTTextualContentExtractor
os

Rozszerzenie djvu

The DjVu extension

The DjVu extension makes it possible to extract text from the text layer of files in the DjVu format (if they have such a layer). That task is done by class Rozszerzenie to pozwala wyciągnąć tekst z warstwy tekstowej plików w formacie DjVu (jeśli posiadają taką warstwę). Jest to realizowane przez klasę

dljdoc

	os
1	dlibra-server-extension-tce-djvu
2	pl.psnc.dlibra.content.extraction.DjVuTextualContentExtractor
	os

Rozszerzenie lius

The LIUS Extension

The LIUS extension makes use of the (external) Rozszerzenie to wykorzystuje zewnętrzną bibliotekę LIUS (Lucene Index Update and Search) , która umożliwia m.in. wyciąganie tekstu z plików w następujących formatachlibrary which allows, among other things, text extraction from files with the following formats: MsWord, MsExcel, MsPowerPoint, RTF, PDF, XML, HTML, TXT, OpenOffice, ZIP, MP3, VCard, Latex i JavaBeans. Klasa rozszerzenia, która realizuje wyciąganie tekstu z wyżej wymienionych formatów plików to , and JavaBeans. The extension class which extracts text from those file formats is

dljdoc

	os
1	dlibra-server-extension-tce-lius
2	pl.psnc.dlibra.content.extraction.LIUSTextualContentExtractor
	os

.

Rozszerzenie pdf

The PDF extension

Extracting text from files in the PDF format is based on the (external) PDFBox library. The class which implements interface Wyciąganie tekstu z plików w formacie PDF oparte jest o zewnętrzną bibliotekę PDFBox. Klasa implementująca interfejs

dljdoc

	os
1	dlibra-server-extension-api
2	pl.psnc.dlibra.content.extraction.TextualContentExtractor
	os

i realizująca ekstrakcję tekstu z tego typu plików to i and extracts text from files of that type is

dljdoc

	os
1	dlibra-server-extension-tce-pdf
2	pl.psnc.dlibra.content.extraction.PDFTextualContentExtractor
	os

.

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Introduction

Wprowadzenie

The extraction.TextualContentExtractor Extension Point

Rozszerzenie basic

The Basic Extension

Rozszerzenie djvu

The DjVu extension

Rozszerzenie lius

The LIUS Extension

Rozszerzenie pdf

The PDF extension

Page tree

Page History

Versions Compared

Old Version 3

New Version Current

Key

Introduction

Wprowadzenie

The extraction.TextualContentExtractor Extension Point

Rozszerzenie basic

The Basic Extension

Rozszerzenie djvu

The DjVu extension

Rozszerzenie lius

The LIUS Extension

Rozszerzenie pdf

The PDF extension