GSoC 2010: The first results

The goal of project it is a creating of parsers generator for Strigi. Input of generator it is two file. First – describing of metadata format, second – mapping (it is shows how is related tags of metadata and ontology). I separated it for two reasons: 1) one format can has two or more type of metadata (different version of metadata format or new universal formal like xmp) 2) testing grammar really simpler when it is separated.

My first step it is developing of a language for describing metadata format and language for describing the mapping. I created language base on png, exif, id3, vorbis comment and xmp (last 2 a little bit). The language for metadata it is Tag language (TL – short name) and it is contain main item MetadataKey. This is example for png (file pngchunk.txt)
/* key word Binary means that description starts and MetadataKey is defined by offset and size. I planned to expand the language for finding tag by key word and if it necessary it will start by Text key word.

PngChung it name and prefix for future class of extractor */
1 Binary PngChunk;
2
/* ByteOrder и BitOrder it is order a byte and bit. If it is not defined it will LSB.*/
3 ByteOrder=MSB;
4 BitOrder=MSB;
5
/*StartMetadataKey – it is the first key that define type of metadata. Offset for this key from beginning of file. */
6 StartMetadataKey IHDR(offset=12, size=4);
/* Checking value of key. It is required for StartMetadataKey and can to check a string or a hex value.*/
7 checkKey(IHDR)="IHDR";
8
/*MetadataKey –region that contains tag. It can defined by base key (or key word “start”), offset and size. If first parameter is “start” then offset for key from beginning of file. If first parameter is another MetadataKey the offset for key from end base key. */
9 MetadataKey width(IHDR, offset=0, size=4);
/* Creating tag with name “Widht” and with number value that is saved by MetadataKey. Currently I can get only string and int32 types. But I will expand it for double and int64 in future. */
10 SetTag Width(getNumber(width));
11
12 MetadataKey height(width, offset=0, size=4);
13 SetTag Height(getNumber(height));
14
15 MetadataKey ColorType(height, offset=0, size=1);
16 SetTag Color_type(getNumber(ColorType));
17
18 MetadataKey Compression(ColorType, offset=0, size=1);
19 SetTag Compression_method(getNumber(Compression));
20
21 MetadataKey Filter(Compression, offset=0, size=1);
22 SetTag Filter_method(getNumber(Filter));
23
24 MetadataKey Interlace(Filter, offset=0, size=1);
25 SetTag Interlace_method(getNumber(Interlace));

Additional functions:
getNumberByLink(key, offset, size) or getNumberByLink(offset, size) – get number by offset and size. If key is not defined offset from beginning of file.
getStringByLink(key, offset, size), getStringByLink( offset, size), getStringByLink(key, offset), getStringByLink( offset) – get string by offset. If key is not defined offset from beginning of file. If size is not defined it get all until symbol end of string or end of file.
getValueByLink(key, offset, size), getValueByLink(offset, size) – get data without type conversion. 9never used but exist).
All this function it is short variant of MetadataKey + getValue (getString, getNumber)

Functions:
shift(key,n) – moved key on n steps .
getBit(key,n) – get bit number n inside key.
getByte(key, n) – get byte number n inside key.
(You can fine examples of using these functions in files id3tag.txt и exif.txt)

And of course I implemented loop (currently only wile) and if- else

Language for mapping I called – Tiny Format language or shot TFL. For png example (file png_triplex.txt):
/* Format it is key word. After this key word follow prefix for EndAnalyzer class.*/
1 Format Png;
2
/* Metadata – key word defined list of name of formats metadata . Png has only one type of metadata – it PngChunk*/
3 Metadata: PngChunk;
4
5
/* It checking StartMetadataKey. If it is true then triplex creates according to mapping inside this if .*/
6 if(PngChunk)
7 {
/* The name of tag here has to be same name of tag that was been created by setTag function in file pngchunk.txt */
8 With = "http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#width";
9 Height = "http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#height";
10 Color_type= "http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#colorDepth";
11 Compression_method = "http://freedesktop.org/standards/xesam/1.0/core#compressionAlgorithm";
12 Interlace_method = "http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#interlaceMode";
13
14 }

Second step it is writing of translator for TFL and TL languages. Each translator contains scanner, parser and generator of c++ code. For developing of scanner and parser I used coco/r generator. It doesn’t have tool for testing and debug grammar (like AntlrWorks for antlr) but it has more clear documentation. The result of working of the TL translator it is class extractor which extracts tag and saves it in vector of Tag.

/*files PngChunkextractor.h и PngChunkextractor.cpp after working TL translator*/
1 #ifndef PngChunkExtractor_H_
2 #define PngChunkExtractor_H_
3 #include "MetadataKey.h"
4 #include "Tag.h"
5 #include "MetadataFunc.h"
6 #include
7 #include
8
9
10 class PngChunkExtractor {
11 public:
12 PngChunkExtractor(std::ifstream _is) {is=_is;}
13 std::vector toExtract();
14 bool toCheck();
15 private:
16 std::ifstream is;
17 };
18 #endif

The result of working of the TFL translator it is class inheritance from StreamEndAnalyzer class.

/* files Pngendanalyzer.h и Pngendanalyzer.cpp after working of TFL translator */
1 #include "Pngendanalyzer.h"
2 #include
3
4
5 void PngEndAnalyzerFactory::registerFields(FieldRegister& reg) {
6 RF["PngChunk_Color_typeField"]=reg.registerField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#colorDepth");
7 addField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#colorDepth");
8 RF["PngChunk_Compression_methodField"]=reg.registerField("http://freedesktop.org/standards/xesam/1.0/core#compressionAlgorithm");
9 addField("http://freedesktop.org/standards/xesam/1.0/core#compressionAlgorithm");
10 RF["PngChunk_HeightField"]=reg.registerField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#height");
11 addField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#height");
12 RF["PngChunk_Interlace_methodField"]=reg.registerField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#interlaceMode");
13 addField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#interlaceMode");
14 RF["PngChunk_WithField"]=reg.registerField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#width");
15 addField("http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#width");
16 }
17
18
19 signed char PngEndAnalyzer::analyze(AnalysisResult& as, InputStream* in) {
20 int flag=-1;
21 PngChunkExtractor PngChunk(InputStream* in);
22 if(PngChunk.toCheck()){
23 std::vector tag_v=PngChunk.toExtract();
24 flag=0;
25 for(int i=0; igetField(name), t.getNumber()); break;
34 case 2: as.addValue(factory->getField(name), t.getString()); break;
35 }
36 name.clear();
37 }
38 return flag;
39 }

Next steps:
1. To rewrite the MetadataKey for working with InputStream (I think I have done it, but didn’t test well)
2. To add opportunity in TL language to get double type.
3. To add in TFL triplex creating.
4. To add the message about declaration error, encoding and another things that I missed.

Link:
http://gitorious.org/strigi/strigi-grammar/trees/master/Translator

GSoC 2010

среда, 14 июля 2010 г.

The first results

Комментариев нет:

Отправить комментарий

среда, 14 июля 2010 г.

The first results

Комментариев нет:

Отправить комментарий

среда, 14 июля 2010 г.