ÏÂÔØµØÖ· http://libibase.googlecode.com/
Ö÷Òª¹¦ÄÜ:
½âÎöHTML
ÖÐÎÄ·Ö´Ê(·´Ïò×î´óÆ¥Åä,ÓÃtrieʵÏÖ)
Éú³ÉÕýÏòÎĵµ(ÎÒ×Ô¼º¶¨ÒåµÄ¸ñʽ,ÔÝʱÊÇÕâÑù)
Éú³Éµ¹ÅÅË÷Òý(·Ö¿é´æ´¢,bytecodeѹËõËã·¨, ÕýÎĺͿìÕÕ²ÉÓÃzlibѹËõ)
Ìá½»²éѯ´®¼ìË÷(ֻʵÏÖÁËÏòÁ¿¿Õ¼äÄ£ÐÍ, ¶¯Ì¬ÕªÒª»¹Ã»Íê³É)
ĿǰֻÓÐÒ»¸öÃüÁîÐвâÊÔ¹¤¾ßhibase
°üÄÚ×Ô´ø10wÖÐÎÄ´Ê¿â(docĿ¼ÏÂ,gzip¸ñʽ, ʹÓõÄʱºòÐèÒª½â¿ª)
ʹÓ÷½·¨¿ÉÒÔ¿´README
½ÓÏÂÀ´¾ÍÊDzâÊÔºÍÓÅ»¯,ÒòΪдµÄʱºòºê±È½Ï¶à,ËùÒÔ±àÒ뻹ÊÇÓеãÂý....ºÇºÇ
ÒªÒ»¿éѧϰµÄ¿ÉÒÔ¼ÓÎÒµÄMSN/GTAIL : [email]sounos@gmail.com[/email]
˳±ãÌùÒ»¸öʹÓÃʵÀý:
ÎÒÓÃwgetÏÂÁËchinaunixµÄÊ×Ò³µ½/data/htmlĿ¼Ï /data/dictÏÂÊÇÎҵĴʵä
./hibase --basedir=/tmp --dict=/data/dict/dict.txt --add --doc=/data/html/index.html --url=http://www.chinaunix.net/ --date="Thu, 03 Jul 2008 10:12:18 GMT" --charset="gbk" --query --request="chinaunix" --topN=1000
parsing document[http://www.chinaunix.net/] time used:16825 microseconds
adding document[http://www.chinaunix.net/] time used:47955 microseconds
parse query time used:36
read hits[1] posting time used:1897
Caculated 1 documents time used:22
read 1 documents content time used:1404
(0) title[ChinaUnix.net = È«Çò×î´óµÄLinux/UnixÓ¦ÓÃÓ뿪·¢ÕßÉçÇø = ITÈ˵ÄÍøÉϼÒÔ°]
summary[(null)]
url[http://www.chinaunix.net/]
size[84892]date[Thu, 03 Jul 2008 10:12:18 GMT]
search [chinaunix] time used:3502
[ ±¾Ìû×îºóÓÉ redor ÓÚ 2008-7-4 21:08 ±à¼ ]
cugb_cat »Ø¸´ÓÚ£º2008-07-03 16:02:36
²»´í~
pengjay »Ø¸´ÓÚ£º2008-07-03 17:06:35
Å£x
benjiam »Ø¸´ÓÚ£º2008-07-03 17:16:54
ÎÞ·¨±àÒëͨ¹ý£¬
charcode.h ²ÉÓÃʲô±àÂë¸ñʽ±àдµÄ£¿
vc ÏÂÃæ ³öÏÖ ×Ö·û´®ÎÞ·¨Ê¶±ð¡£Ó¦¸ÃÊÇÄÚÂëgb2312 utf-8 unicode ¶¼²»ÐÐ
redor »Ø¸´ÓÚ£º2008-07-03 17:57:18
ÒýÓãºÔÌûÓÉ benjiam ÓÚ 2008-7-3 17:16 ·¢±í [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8730113&ptid=1187515]
ÎÞ·¨±àÒëͨ¹ý£¬
charcode.h ²ÉÓÃʲô±àÂë¸ñʽ±àдµÄ£¿
vc ÏÂÃæ ³öÏÖ ×Ö·û´®ÎÞ·¨Ê¶±ð¡£Ó¦¸ÃÊÇÄÚÂëgb2312 utf-8 unicode ¶¼²»ÐÐ
ÎÒÏÖÔÚÊÇUTF-8µÄ,VCÏÂÎÒû±à¹ý,¹À¼Æ¹»Çº....
ÎÒ¸øÌùÒ»¸öÉÏÀ´°É
#include <stdio.h>
#include <string.h>
#ifndef _CHARCODE_H
#define _CHARCODE_H
#define CHARCODE_NUM 252
typedef struct _CHARCODE
{
char *dec;
char *code;
char *chr;
char *desc;
}CHARCODE;
static CHARCODE charcodelist[] =
{
{" ", " ", " ", "no-break space"},
{"¡", "¡", "¡", "inverted exclamation mark"},
{"¢", "¢", "¡é", "cent sign"},
{"£", "£", "¡ê", "pound sign"},
{"¤", "¤", "¡è", "currency sign"},
{"¥", "¥", "£¤", "yen sign = yuan sign"},
{"¦", "¦", "|", "broken bar = brolen vertical bar"},
{"§", "§", "¡ì", "section sign"},
{"¨", "¨", "¡§", "diaeresis = spacing diaeresis"},
{"©", "©", "©", "copyright sign"},
{"ª", "ª", "a", "feminine ordinal indicator"},
{"«", "«", "«", "left-pointing double angle quotation mark = left pointing guillemet"},
{"¬", "¬", "¬", "not sign = discretionary hyphen"},
{"­", "­", "-", "soft hyphen = discretionary hyphen"},
{"®", "®", "®", "registered sign = registered trade mark sign"},
{"¯", "¯", "¡¥", "macron = spacing macron = overline = APL overbar"},
{"°", "°", "¡ã", "degree sign"},
{"±", "±", "¡À", "plus-minus sign = plus-or-minus sign"},
{"²", "²", "2", "superscript two = superscript digit two = squared"},
{"³", "³", "3", "superscript three = superscript digit three = cubed"},
{"´", "´", "¡ä", "acute accent = spacing acute"},
{"µ", "µ", "¦Ì", "micro sign"},
{"¶", "¶", "¶", "pilcrow sign = paragraph sign"},
{"·", "·", "¡¤", "middle dot = Georgian comma = Greek middle dot"},
{"¸", "ç", "¸", "cedilla = spacing cedilla"},
{"¹", "¹", "1", "superscript one = superscript digit one"},
{"º", "º", "o", "masculine ordinal indicator"},
{"»", "»", "»", "right-pointing double angle quotation mark = right pointing guillemet"},
{"¼", "¼", "¼", "vulgar fraction one quarter = fraction one quarter"},
{"½", "½", "½", "vulgar fraction one half = fraction one half"},
{"¾", "¾", "¾", "vulgar fraction three quarters = fraction three quarters"},
{"¿", "¿", "¿", "inverted question mark = turned question mark"},
{"À", "À", "¨¤", "latin capital letter A with grave = latin capital letter A grave"},
{"Á", "Á", "¨¢", "latin capital letter A with acute"},
{"Â", "Â", "Â", "latin capital letter A with circumflex"},
{"Ã", "Ã", "Ã", "latin capital letter A with tilde"},
{"Ä", "Ä", "Ä", "latin capital letter A with diaeresis"},
{"Å", "Å", "Å", "latin capital letter A with ring above = latin capital letter A ring"},
{"Æ", "Æ", "Æ", "latin capital letter AE = latin capital ligature AE"},
{"Ç", "Ç", "Ç", "latin capital letter C with cedilla"},
{"È", "È", "¨¨", "latin capital letter E with grave"},
{"É", "É", "¨¦", "latin capital letter E with acute"},
{"Ê", "Ê", "¨º", "latin capital letter E with circumflex"},
{"Ë", "Ë", "Ë", "latin capital letter E with diaeresis"},
{"Ì", "Ì", "¨¬", "latin capital letter I with grave"},
{"Í", "Í", "¨ª", "latin capital letter I with acute"},
{"Î", "Î", "Î", "latin capital letter I with circumflex"},
{"Ï", "Ï", "Ï", "latin capital letter I with diaeresis"},
{"Ð", "Ð", "D", "latin capital letter ETH"},
{"Ñ", "Ñ", "Ñ", "latin capital letter N with tilde"},
{"Ò", "Ò", "¨°", "latin capital letter O with grave"},
{"Ó", "Ó", "¨®", "latin capital letter O with acute"},
{"Ô", "Ô", "Ô", "latin capital letter O with circumflex"},
{"Õ", "Õ", "Õ", "latin capital letter O with tilde"},
{"Ö", "Ö", "Ö", "latin capital letter O with diaeresis"},
{"×", "×", "¡Á", "multiplication sign"},
{"Ø", "Ø", "Ø", "latin capital letter O with stroke = latin capital letter O slash"},
{"Ù", "Ù", "¨´", "latin capital letter U with grave"},
{"Ú", "Ú", "¨²", "latin capital letter U with acute"},
{"Û", "Û", "Û", "latin capital letter U with circumflex"},
{"Ü", "Ü", "¨¹", "latin capital letter U with diaeresis"},
{"Ý", "Ý", "Y", "latin capital letter Y with acute"},
{"Þ", "Þ", "T", "latin capital letter THORN"},
{"ß", "ß", "ß", "latin small letter sharp s = ess-zed"},
{"à", "à", "¨¤", "latin small letter a with grave = latin small letter a grave"},
{"á", "á", "¨¢", "latin small letter a with acute"},
{"â", "â", "a", "latin small letter a with circumflex"},
{"ã", "ã", "ã", "latin small letter a with tilde"},
{"ä", "ä", "ä", "latin small letter a with diaeresis"},
{"å", "å", "å", "latin small letter a with ring above = latin small letter a ring"},
{"æ", "æ", "æ", "latin small letter ae = latin small ligature ae"},
{"ç", "ç", "ç", "latin small letter c with cedilla"},
{"è", "è", "¨¨", "latin small letter e with grave"},
{"é", "é", "¨¦", "latin small letter e with acute"},
{"ê", "ê", "¨º", "latin small letter e with circumflex"},
{"ë", "ë", "ë", "latin small letter e with diaeresis"},
{"ì", "ì", "¨¬", "latin small letter i with grave"},
{"í", "í", "¨ª", "latin small letter i with acute"},
{"î", "î", "î", "latin small letter i with circumflex"},
{"ï", "ï", "ï", "latin small letter i with diaeresis"},
{"ð", "ð", "e", "latin small letter eth"},
{"ñ", "ñ", "ñ", "latin small letter n with tilde"},
{"ò", "ò", "¨°", "latin small letter o with grave"},
{"ó", "ó", "¨®", "latin small letter o with acute"},
{"ô", "ô", "ô", "latin small letter o with circumflex"},
{"õ", "õ", "õ", "latin small letter o with tilde"},
{"ö", "ö", "ö", "latin small letter o with diaeresis"},
{"÷", "÷", "¡Â", "division sign"},
{"ø", "ø", "ø", "latin small letter o with stroke = latin small letter o slash"},
{"ù", "ù", "¨´", "latin small letter u with grave"},
{"ú", "ú", "¨²", "latin small letter u with acute"},
{"û", "û", "û", "latin small letter u with circumflex"},
{"ü", "ü", "¨¹", "latin small letter u with diaeresis"},
{"ý", "ý", "y", "latin small letter y with acute"},
{"þ", "þ", "t", "latin small letter thorn with"},
{"ÿ", "ÿ", "ÿ", "latin small letter y with diaeresis"},
{"ƒ", "ƒ", "ƒ ", "latin small f with hook = function = florin"},
{"Α", "Α", "¦¡ ", "greek capital letter alpha"},
{"Β", "Β", "¦¢ ", "greek capital letter beta"},
{"Γ", "Γ", "¦£ ", "greek capital letter gamma"},
{"Δ", "Δ", "¦¤ ", "greek capital letter delta"},
{"Ε", "Ε", "¦¥ ", "greek capital letter epsilon"},
{"Ζ", "Ζ", "¦¦ ", "greek capital letter zeta"},
{"Η", "Η", "¦§ ", "greek capital letter eta"},
{"Θ", "Θ", "¦¨ ", "greek capital letter theta"},
{"Ι", "Ι", "¦© ", "greek capital letter iota"},
{"Κ", "Κ", "¦ª ", "greek capital letter kappa"},
{"Λ", "Λ", "¦« ", "greek capital letter lambda"},
{"Μ", "Μ", "¦¬ ", "greek capital letter mu"},
{"Ν", "Ν", "¦ ", "greek capital letter nu"},
{"Ξ", "Ξ", "¦® ", "greek capital letter xi"},
{"Ο", "Ο", "¦¯ ", "greek capital letter omicron"},
{"Π", "Π", "¦° ", "greek capital letter pi"},
{"Ρ", "Ρ", "¦± ", "greek capital letter rho"},
{"Σ", "Σ", "¦² ", "greek capital letter sigma"},
{"Τ", "Τ", "¦³ ", "greek capital letter tau"},
{"Υ", "Υ", "¦´ ", "greek capital letter upsilon"},
{"Φ", "Φ", "¦µ ", "greek capital letter phi"},
{"Χ", "Χ", "¦¶ ", "greek capital letter chi"},
{"Ψ", "Ψ", "¦· ", "greek capital letter psi"},
{"Ω", "Ω", "¦¸ ", "greek capital letter omega"},
{"α", "α", "¦Á ", "greek small letter alpha"},
{"β", "β", "¦Â ", "greek small letter beta"},
{"γ", "γ", "¦Ã ", "greek small letter gamma"},
{"δ", "δ", "¦Ä ", "greek small letter delta"},
{"ε", "ε", "¦Å ", "greek small letter epsilon"},
{"ζ", "ζ", "¦Æ ", "greek small letter zeta"},
{"η", "η", "¦Ç ", "greek small letter eta"},
{"θ", "θ", "¦È ", "greek small letter theta"},
{"ι", "ι", "¦É ", "greek small letter iota"},
{"κ", "κ", "¦Ê ", "greek small letter kappa"},
{"λ", "λ", "¦Ë ", "greek small letter lambda"},
{"μ", "μ", "¦Ì ", "greek small letter mu"},
{"ν", "ν", "¦Í ", "greek small letter nu"},
{"ξ", "ξ", "¦Î ", "greek small letter xi"},
{"ο", "ο", "¦Ï ", "greek small letter omicron"},
{"π", "π", "¦Ð ", "greek small letter pi"},
{"ρ", "ρ", "¦Ñ ", "greek small letter rho"},
{"ς", "ς", "ς ", "greek small letter final sigma"},
{"σ", "σ", "¦Ò ", "greek small letter sigma"},
{"τ", "τ", "¦Ó ", "greek small letter tau"},
{"υ", "υ", "¦Ô ", "greek small letter upsilon"},
{"φ", "φ", "¦Õ ", "greek small letter phi"},
{"χ", "χ", "¦Ö ", "greek small letter chi"},
{"ψ", "ψ", "¦× ", "greek small letter psi"},
{"ω", "ω", "¦Ø ", "greek small letter omega"},
{"ϑ", "ϑ", "ϑ ", "greek small letter theta symbol"},
{"ϒ", "ϒ", "ϒ ", "greek upsilon with hook symbol"},
{"ϖ", "ϖ", "ϖ ", "greek pi symbol"},
{"•", "•", "•", "bullet = black small circle"},
{"…", "…", "¡", "horizontal ellipsis = three dot leader"},
{"′", "′", "¡ä", "prime = minutes = feet"},
{"″", "″", "¡å", "double prime = seconds = inches"},
{"‾", "‾", "£þ", "overline = spacing overscore"},
{"⁄", "⁄", "⁄", "fraction slash"},
{"℘", "℘", "℘", "script capital P = power set = Weierstrass p"},
{"ℑ", "ℑ", "ℑ", "blackletter capital I = imaginary part"},
{"ℜ", "ℜ", "ℜ", "blackletter capital R = real part symbol"},
{"™", "™", "™", "trade mark sign"},
{"ℵ", "ℵ", "ℵ", "alef symbol = first transfinite cardinal"},
{"←", "←", "¡û", "leftwards arrow"},
{"↑", "↑", "¡ü", "upwards arrow"},
{"→", "→", "¡ú", "rightwards arrow"},
{"↓", "↓", "¡ý", "downwards arrow"},
{"↔", "↔", "↔", "left right arrow"},
{"↵", "↵", "↵", "downwards arrow with corner leftwards = carriage return"},
{"⇐", "⇐", "⇐", "leftwards double arrow"},
{"⇑", "⇑", "⇑", "upwards double arrow"},
{"⇒", "⇒", "⇒", "rightwards double arrow"},
{"⇓", "⇓", "⇓", "downwards double arrow"},
{"⇔", "⇔", "⇔", "left right double arrow"},
{"∀", "∀", "∀", "for all"},
{"∂", "∂", "∂", "partial differential"},
{"∃", "∃", "∃", "there exists"},
{"∅", "∅", "∅", "empty set = null set = diameter"},
{"∇", "∇", "∇", "nabla = backward difference"},
{"∈", "∈", "¡Ê", "element of"},
{"∉", "∉", "∉", "not an element of"},
{"∋", "∋", "∋", "contains as member"},
{"∏", "∏", "¡Ç", "n-ary product = product sign"},
{"∑", "∑", "¡Æ", "n-ary sumation"},
{"−", "−", "−", "minus sign"},
{"∗", "∗", "∗", "asterisk operator"},
{"√", "√", "¡Ì", "square root = radical sign"},
{"∝", "∝", "¡Ø", "proportional to"},
{"∞", "∞", "¡Þ", "infinity"},
{"∠", "∠", "¡Ï", "angle"},
{"∧", "∧", "¡Ä", "logical and = wedge"},
{"∨", "∨", "¡Å", "logical or = vee"},
{"∩", "∩", "¡É", "intersection = cap"},
{"∪", "∪", "¡È", "union = cup"},
{"∫", "∫", "¡Ò", "integral"},
{"∴", "∴", "¡à", "therefore"},
{"∼", "∼", "¡«", "tilde operator = varies with = similar to"},
{"≅", "≅", "≅", "approximately equal to"},
{"≈", "≈", "¡Ö", "almost equal to = asymptotic to"},
{"≠", "≠", "¡Ù", "not equal to"},
{"≡", "≡", "¡Ô", "identical to"},
{"≤", "≤", "¡Ü", "less-than or equal to"},
{"≥", "≥", "¡Ý", "greater-than or equal to"},
{"⊂", "⊂", "⊂", "subset of"},
{"⊃", "⊃", "⊃", "superset of"},
{"⊄", "⊄", "⊄", "not a subset of"},
{"⊆", "⊆", "⊆", "subset of or equal to"},
{"⊇", "⊇", "⊇", "superset of or equal to"},
{"⊕", "⊕", "¨’", "circled plus = direct sum"},
{"⊗", "⊗", "⊗", "circled times = vector product"},
{"⊥", "⊥", "¡Í", "up tack = orthogonal to = perpendicular"},
{"⋅", "⋅", "⋅", "dot operator"},
{"⌈", "⌈", "⌈", "left ceiling = apl upstile"},
{"⌉", "⌉", "⌉", "right ceiling"},
{"⌊", "⌊", "⌊", "left floor = apl downstile"},
{"⌋", "⌋", "⌋", "right floor"},
{"〈", "⟨", "¡´", "left-pointing angle bracket = bra"},
{"〉", "⟩", "¡µ", "right-pointing angle bracket = ket"},
{"◊", "◊", "◊", "lozenge"},
{"♠", "♠", "♠", "black spade suit"},
{"♣", "♣", "♣", "black club suit = shamrock"},
{"♥", "♥", "♥", "black heart suit = valentine"},
{"♦", "♦", "♦", "black diamond suit"},
{""", """, "\"", "quotation mark = APL quote"},
{"&", "&", "& ", "ampersand"},
{"<", "<", "< ", "less-than sign"},
{">", ">", "> ", "greater-than sign"},
{"Œ", "Œ", "Œ ", "latin capital ligature OE"},
{"œ", "œ", "œ ", "latin small ligature oe"},
{"Š", "Š", "Š ", "latin capital letter S with caron"},
{"š", "š", "š ", "latin small letter s with caron"},
{"Ÿ", "Ÿ", "Ÿ ", "latin capital letter Y with diaeresis"},
{"ˆ", "ˆ", "ˆ ", "modifier letter circumflex accent"},
{"˜", "˜", "˜ ", "small tilde"},
{" ", " ", " ", "en space"},
{" ", " ", " ", "em space"},
{" ", " ", " ", "thin space"},
{"‌", "‌", "‌", "zero width non-joiner"},
{"‍", "‍", "‍", "zero width joiner"},
{"‎", "‎", "‎", "left-to-right mark"},
{"‏", "‏", "‏", "right-to-left mark"},
{"–", "–", "¨C", "en dash"},
{"—", "—", "¡ª", "em dash"},
{"‘", "‘", "¡®", "left single quotation mark"},
{"’", "’", "¡¯", "right single quotation mark"},
{"‚", "‚", "‚", "single low-9 quotation mark"},
{"“", "“", "¡°", "left double quotation mark"},
{"”", "”", "¡±", "right double quotation mark"},
{"„", "„", "„", "double low-9 quotation mark"},
{"†", "†", "†", "dagger"},
{"‡", "‡", "‡", "double dagger"},
{"‰", "‰", "¡ë", "per mille sign"},
{"‹", "‹", "‹", "single left-pointing angle quotation mark"},
{"›", "›", "›", "single right-pointing angle quotation mark"},
{"€", "€", "€", "euro sign"}
};
#ifndef CHARCODE_FIND
#define CHARCODE_FIND(_s, _m, _n) \
{ \
_m = 0;_n = 0; \
while(_n < CHARCODE_NUM) \
{ \
_m = strlen(charcodelist[_n].dec); \
if(strncasecmp(_s, charcodelist[_n].dec, m) == 0) \
{ \
break; \
} \
_m = strlen(charcodelist[_n].code); \
if(strncasecmp(_s, charcodelist[_n].code, m) == 0) \
{ \
break; \
} \
_n++; \
} \
}
#endif
#endif
wilbur8415 »Ø¸´ÓÚ£º2008-07-03 19:28:56
²»´í
ºÃ¶«Î÷
[ ±¾Ìû×îºóÓÉ wilbur8415 ÓÚ 2008-7-3 19:58 ±à¼ ]
tyc611 »Ø¸´ÓÚ£º2008-07-03 22:20:46
ʲô½Ð¡°µ¹ÅÅË÷Òý¿â¡±£¬É¶Òâ˼£¬LZÄÜ·ñ½âÊÍÏÂÏÂ
77h2_eleven »Ø¸´ÓÚ£º2008-07-03 23:03:23
´óѧ±ÏÒµÉè¼Æ×öµÄÊÇËÑË÷ÒýÇæ¡£Ö±½ÓÓÃlucene.£¨ÊÇÕâôƴ°É£¬¶¼¸øÍüÁË£©
redor »Ø¸´ÓÚ£º2008-07-04 08:28:16
ÒýÓãºÔÌûÓÉ tyc611 ÓÚ 2008-7-3 22:20 ·¢±í [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8731584&ptid=1187515]
ʲô½Ð¡°µ¹ÅÅË÷Òý¿â¡±£¬É¶Òâ˼£¬LZÄÜ·ñ½âÊÍÏÂÏÂ
ʵÏÖÁËÒ»¸öµ¹ÅÅË÷Òý,ÊÇ¿âµÄÐÎʽ·¢²¼.... ²»ÊÇÍê³ÉµÄËÑË÷½â¾ö·½°¸, Ò²¾ÍÊÇÖ»¸ºÔðË÷ÒýÊý¾ÝºÍ¼ìË÷....
Òª×öÒ»¸öÍê³ÉµÄËÑË÷ÒýÇæ¾ÍÐèÒª×Ô¼º¿ª·¢ÆäËûµÄ¶«Î÷,±ÈÈçÊý¾ÝÏÂÔØ,daemon·þÎñµÈ....
|