免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 3963 | 回复: 7
打印 上一主题 下一主题

字符串分割问题 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2017-11-01 15:26 |只看该作者 |倒序浏览
给定一个字符串, 和一个单词列表。 有没有大神帮忙给个思路。

Now, please decompose OOV words into the mixture of W* and S*, which means this word can be constructed with multiple words and segments. Every word still has length >=3.
1)        This means for the OOV, first search the 27k valid word list, and find the maximum match word W1.
2)        Then for the remaining segments (e.g., S1 W1 S2), search the word list again to find the maximum word W2 (e.g., S1 W1 S3 W2 S4).
3)        Repeat the process until you cannot  find any matching word with length >=3.   

测试:
$string='WHAT'SATDSYOURATIP';

WordLIST:
WHAT'S
YOUR
TIP

Output:
WHAT'S
ATDS
YOUR
A
TIP




论坛徽章:
7
戌狗
日期:2013-12-15 20:43:38技术图书徽章
日期:2014-03-05 01:33:12技术图书徽章
日期:2014-03-15 20:31:17未羊
日期:2014-03-25 23:48:20丑牛
日期:2014-04-07 22:37:44巳蛇
日期:2014-04-11 21:58:0915-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13
2 [报告]
发表于 2017-11-01 23:44 |只看该作者
本帖最后由 rubyish 于 2017-11-01 19:59 编辑

回复 1# TrishaTie

给个思路。

1, 2, 3 jiushi silu ~~
1)        This means for the OOV, first search the 27k valid word list, and find the maximum match word W1.
2)        Then for the remaining segments (e.g., S1 W1 S2), search the word list again to find the maximum word W2 (e.g., S1 W1 S3 W2 S4).
3)        Repeat the process until you cannot  find any matching word with length >=3.   

123 de silu:

output:
  1. OOV:
  2. WHAT'SATDSYOURATIP

  3. decompose:
  4. WHAT'S ATDS YOUR A TIP
  5. ----------------------------------------------------------------
  6. OOV:
  7. Repeattheprocessuntilyoucannotfindanymatchingwordwithlength

  8. decompose:
  9. Repeat the process until you cannot find any matching word with length
  10. ----------------------------------------------------------------
  11. OOV:
  12. NowpleasedecomposeOOVwordsintothemixtureofandwhichmeansthiswordcanbeconstructedwithmultiplewords

  13. decompose:
  14. Now please decompose OOV words into the mixture of and which means this word can be constructed with multiple words
  15. ----------------------------------------------------------------

复制代码



biru.pl
  1. #!/usr/bin/perl

  2. use 5.010;

  3. # the 27k valid word list
  4. my @the27k =
  5.   qw[Every Repeat Then This again and any can cannot constructed decompose find first for has into length list match matching maximum means mixture multiple please process remaining search segments still TIP the this until valid WHAT'S which with word words you YOUR
  6. ];

  7. my $length = 3;

  8. # the maximum match word
  9. @the27k = sort { length($b) <=> length($a) } @the27k;

  10. while (<DATA>) {
  11.     chomp;
  12.     say "OOV:\n$_\n";
  13.     my @words = decompose($_);
  14.     say 'decompose:';
  15.     say "@words";
  16.     say '-' x 64;
  17. }

  18. # ____________________SUB____________________

  19. sub decompose {
  20.     my $segment = shift;

  21.     # length < 3
  22.     if ( length($segment) < $length ) {
  23.         return $segment;
  24.     }
  25.     my ( $index, $find ) = ( -1, '' );

  26.     # search the 27k valid word list
  27.     for my $word (@the27k) {
  28.         $index = index( $segment, $word );
  29.         next if $index < 0;
  30.         $find = $word;
  31.         last;
  32.     }

  33.     # you cannot find any matching word with length >=3
  34.     return $segment if $index < 0;

  35.     # Then for the remaining segments, search the word list again
  36.     my $S1 = substr $segment, 0, $index;
  37.     my $S2 = substr $segment, $index + length($find);

  38.     # Repeat the process
  39.     $S1 ? decompose($S1) : (),
  40.     $find,
  41.     $S2 ? decompose($S2) : ();

  42. }
  43. __DATA__
  44. WHAT'SATDSYOURATIP
  45. Repeattheprocessuntilyoucannotfindanymatchingwordwithlength
  46. NowpleasedecomposeOOVwordsintothemixtureofandwhichmeansthiswordcanbeconstructedwithmultiplewords
复制代码


论坛徽章:
0
3 [报告]
发表于 2017-11-02 20:42 |只看该作者
@rubyish, 非常感谢!

本地跑了一下,确实是我想实现的东西。

不过 最后一个部分  没有很看懂。
&nbsp; &nbsp; # Repeat the process
&nbsp; &nbsp; $S1 ? decompose($S1) : (),
&nbsp; &nbsp; $find,
&nbsp; &nbsp; $S2 ? decompose($S2) : ();

我可以理解为: 《》中间的 和 decompose($S1) 去做选择? 求解释。

$S1 ? decompose($S1) : 《 (), $find, $S2 ? decompose($S2) : ()》;

论坛徽章:
7
戌狗
日期:2013-12-15 20:43:38技术图书徽章
日期:2014-03-05 01:33:12技术图书徽章
日期:2014-03-15 20:31:17未羊
日期:2014-03-25 23:48:20丑牛
日期:2014-04-07 22:37:44巳蛇
日期:2014-04-11 21:58:0915-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13
4 [报告]
发表于 2017-11-03 00:43 |只看该作者
回复 3# TrishaTie

我可以理解为: 《》中间的 和 decompose($S1) 去做选择?
no ~~

biru:
3 * 5 + 6 + 4 * 5 = ( 3 * 5 ) + 6 + ( 4 * 5 )
!= 3 * ( 5 + 6 + 4 ) * 5

precedence [?:] > [,]

  1. $S1 ? decompose($S1) : (),
  2. $find,
  3. $S2 ? decompose($S2) : ();
复制代码


EQ

  1. ($S1 ? decompose($S1) : ()),
  2. $find,
  3. ($S2 ? decompose($S2) : ());
复制代码


EQ

  1. my @S1 = $S1 ? decompose($S1) : ();
  2. my @S2 = $S2 ? decompose($S2) : ();
  3. return @S1, $find, @S2;
复制代码


论坛徽章:
0
5 [报告]
发表于 2017-11-03 09:27 来自手机 |只看该作者
rubyish 发表于 2017-11-03 00:43
回复 3# TrishaTie

no ~~

谢谢 rubyish  大神 大半夜回复 消息, 哦    那我就是$find 那个地方 混了, 然后曲解了整句话的含义。
再次感谢  
想说 早点睡  哈哈

论坛徽章:
12
子鼠
日期:2014-10-11 16:46:482016科比退役纪念章
日期:2018-03-16 10:24:0515-16赛季CBA联赛之山东
日期:2017-11-10 14:32:142016科比退役纪念章
日期:2017-09-02 15:42:4715-16赛季CBA联赛之佛山
日期:2017-08-28 17:11:5515-16赛季CBA联赛之浙江
日期:2017-08-24 16:55:1715-16赛季CBA联赛之青岛
日期:2017-08-17 19:55:2415-16赛季CBA联赛之天津
日期:2017-06-29 10:34:4315-16赛季CBA联赛之四川
日期:2017-05-16 16:38:55黑曼巴
日期:2016-07-19 15:03:112015亚冠之萨济拖拉机
日期:2015-05-22 11:38:5315-16赛季CBA联赛之北京
日期:2019-08-13 17:30:53
6 [报告]
发表于 2017-11-03 14:21 |只看该作者
回复 5# TrishaTie

有时差

论坛徽章:
0
7 [报告]
发表于 2017-11-05 18:52 |只看该作者
其实我想说的是, 当我 wordlist 有2w多个的时候,而且,需要查找的单词很多的时候, 这样是不是很费时间, 因为用C# 实现了一遍,实现一部分数据,需要超级多的时间。 我如果 存 wordlist 到hash, 去遍历每个单词字符的时候,会不会省时间。

论坛徽章:
7
戌狗
日期:2013-12-15 20:43:38技术图书徽章
日期:2014-03-05 01:33:12技术图书徽章
日期:2014-03-15 20:31:17未羊
日期:2014-03-25 23:48:20丑牛
日期:2014-04-07 22:37:44巳蛇
日期:2014-04-11 21:58:0915-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13
8 [报告]
发表于 2017-11-07 23:06 |只看该作者
本帖最后由 rubyish 于 2017-11-07 19:12 编辑

shishi hash ~~
  1. #!/usr/bin/perl
  2. # version 26, subversion 1 (v5.26.1)
  3. use 5.010;

  4. my @list = qw[
  5.   Book Both Camel Cite Every Extraction Language Larry Originally Page Perl Permanent Practical Programming Related Repeat Reporting Since Special TIP Then This Though Unix Upload VALID WHAT'S Wall What Wikidata YOUR acronym again and another any are backronyms became because began book borrow bumped but can cannot change changes constructed continue cover decompose compose developed development different documentation documented easier eventually evolved facto file find first for from general has here ideas identify including independently information into item its known language languages length lengthy liberally link links list major make man many mark match matching maximum means mixture multiple not number officially one only originally page pages please process processing programmers published purpose redesign reference remaining report revisions robot same scripting search segments separate single still teams that the then there this time undergone until use various version was well which with word words you your sword
  6. ];

  7. my $lexicon = from( \@list );

  8. while (<DATA>) {
  9.     say "OOV:\n$_";
  10.     say 'decompose:';
  11.     chomp;
  12.     decompose($_);
  13.     say '-' x 64;
  14. }

  15. # ____________________SUB____________________

  16. sub from {
  17.     my $words   = shift;
  18.     my $lexicon = {};

  19.     for my $word (@$words) {
  20.         my @chara = split '', $word;
  21.         my $dit   = $lexicon;

  22.         for my $k (@chara) {
  23.             $dit = $dit->{$k} //= {};
  24.         }

  25.         $dit->{'@'} = 1;
  26.     }

  27.     return $lexicon;
  28. }

  29. sub decompose {
  30.     my $unknown = shift;
  31.     my @unknown = split '', $unknown;
  32.     my $fate    = $#unknown;
  33.     my @sigil   = (0) x @unknown;

  34.     for my $i ( 0 .. $fate ) {
  35.         my $dit = $lexicon;

  36.         for my $j ( $i .. $fate ) {
  37.             $dit = $dit->{ $unknown[$j] };
  38.             last unless $dit;

  39.             if ( $dit->{'@'} ) {
  40.                 $sigil[$i] = $j - $i + 1;
  41.             }
  42.         }
  43.     }

  44.     say join ' ', D_( $unknown, \@sigil, 0, [ 0, $#sigil ] );
  45. }

  46. sub D_ {
  47.     my ( $unknown, $sigil, $need, $range ) = @_;
  48.     my ( $r1, $r2 ) = @$range;
  49.     my ( $fine, $indes ) = ( 0, 0 );

  50.     modify( $unknown, $sigil, $r1, $r2 ) if $need;

  51.     for my $i ( $r1 .. $r2 ) {
  52.         next if $sigil->[$i] <= $fine;

  53.         $fine  = $sigil->[$i];
  54.         $indes = $i;
  55.     }

  56.     if ( !$fine ) {
  57.         return substr $unknown, $r1, $r2 - $r1 + 1;
  58.     }

  59.     my $find = substr $unknown, $indes, $fine;
  60.    
  61.     $r1 <= $indes - 1
  62.       ? D_( $unknown, $sigil, 1, [ $r1, $indes - 1 ] )
  63.       : (),
  64.     $find,
  65.     $indes + $fine <= $r2
  66.       ? D_( $unknown, $sigil, 0, [ $indes + $fine, $r2 ] )
  67.       : ();
  68. }

  69. sub modify {
  70.     my ( $unknown, $sigil, $r1, $r2 ) = @_;

  71.     for my $i ( $r1 .. $r2 ) {
  72.         next unless $sigil->[$i];
  73.         next if $i + $sigil->[$i] - 1 <= $r2;

  74.         my $dit = $lexicon;
  75.         $sigil->[$i] = 0;

  76.         for my $j ( $i .. $r2 ) {
  77.             $dit = $dit->{ substr( $unknown, $j, 1 ) };

  78.             if ( $dit->{'@'} ) {
  79.                 $sigil->[$i] = $j - $i + 1;
  80.             }
  81.         }
  82.     }
  83. }

  84. __DATA__
  85. WHAT'SATDSYOURATIP
  86. themixtureofandwhichmeansthiswordcan
  87. ThoughPerlisnotofficiallyanacronymtherearevariousbackronymsinuseincludingPracticalExtractionandReportingLanguagePerlwasoriginallydevelopedbyLarryWallin1987asageneralpurposeUnixscriptinglanguagetomakereportprocessingeasierSincethenithasundergonemanychangesandrevisionsPerl6whichbeganasaredesignofPerl5in2000eventuallyevolvedintoaseparatelanguageBothlanguagescontinuetobedevelopedindependentlybydifferentdevelopmentteamsandliberallyborrowideasfromoneanother
复制代码


您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP