免费注册	查看新帖 \|


平台论坛博客文库

› 论坛 › 程序设计 › Perl › 字符串分割问题

最近访问板块

发新帖

查看: 3963 | 回复: 7

上一主题

下一主题

字符串分割问题 [复制链接]

论坛徽章:: 0

电梯直达

跳转到指定楼层

1楼 [收藏(0)] [报告]

发表于 2017-11-01 15:26 |只看该作者 |倒序浏览

给定一个字符串，和一个单词列表。有没有大神帮忙给个思路。

Now, please decompose OOV words into the mixture of W* and S*, which means this word can be constructed with multiple words and segments. Every word still has length >=3.
1) This means for the OOV, first search the 27k valid word list, and find the maximum match word W1.
2) Then for the remaining segments (e.g., S1 W1 S2), search the word list again to find the maximum word W2 (e.g., S1 W1 S3 W2 S4).
3) Repeat the process until you cannot find any matching word with length >=3.

测试：
$string='WHAT'SATDSYOURATIP';

WordLIST：
WHAT'S
YOUR
TIP

Output:
WHAT'S
ATDS
YOUR
A
TIP

文库|博客

论坛徽章:: 7

戌狗
日期:2013-12-15 20:43:38

技术图书徽章
日期:2014-03-05 01:33:12

技术图书徽章
日期:2014-03-15 20:31:17

未羊
日期:2014-03-25 23:48:20

丑牛
日期:2014-04-07 22:37:44

巳蛇
日期:2014-04-11 21:58:09

15-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13

2楼 [报告]

发表于 2017-11-01 23:44 |只看该作者

本帖最后由 rubyish 于 2017-11-01 19:59 编辑

回复 1# TrishaTie

给个思路。

1, 2, 3 jiushi silu ~~

1)       This means for the OOV, first search the 27k valid word list, and find the maximum match word W1.
2)       Then for the remaining segments (e.g., S1 W1 S2), search the word list again to find the maximum word W2 (e.g., S1 W1 S3 W2 S4).
3)       Repeat the process until you cannot  find any matching word with length >=3.

123 de silu:

output:

OOV:
WHAT'SATDSYOURATIP
decompose:
WHAT'S ATDS YOUR A TIP
----------------------------------------------------------------
OOV:
Repeattheprocessuntilyoucannotfindanymatchingwordwithlength
decompose:
Repeat the process until you cannot find any matching word with length
----------------------------------------------------------------
OOV:
NowpleasedecomposeOOVwordsintothemixtureofandwhichmeansthiswordcanbeconstructedwithmultiplewords
decompose:
Now please decompose OOV words into the mixture of and which means this word can be constructed with multiple words
----------------------------------------------------------------

复制代码

biru.pl

#!/usr/bin/perl
use 5.010;
# the 27k valid word list
my @the27k =
qw[Every Repeat Then This again and any can cannot constructed decompose find first for has into length list match matching maximum means mixture multiple please process remaining search segments still TIP the this until valid WHAT'S which with word words you YOUR
];
my $length = 3;
# the maximum match word
@the27k = sort { length($b) <=> length($a) } @the27k;
while (<DATA>) {
chomp;
say "OOV:\n$_\n";
my @words = decompose($_);
say 'decompose:';
say "@words";
say '-' x 64;
}
# ____________________SUB____________________
sub decompose {
my $segment = shift;
# length < 3
if ( length($segment) < $length ) {
return $segment;
}
my ( $index, $find ) = ( -1, '' );
# search the 27k valid word list
for my $word (@the27k) {
$index = index( $segment, $word );
next if $index < 0;
$find = $word;
last;
}
# you cannot find any matching word with length >=3
return $segment if $index < 0;
# Then for the remaining segments, search the word list again
my $S1 = substr $segment, 0, $index;
my $S2 = substr $segment, $index + length($find);
# Repeat the process
$S1 ? decompose($S1) : (),
$find,
$S2 ? decompose($S2) : ();
}
__DATA__
WHAT'SATDSYOURATIP
Repeattheprocessuntilyoucannotfindanymatchingwordwithlength
NowpleasedecomposeOOVwordsintothemixtureofandwhichmeansthiswordcanbeconstructedwithmultiplewords

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

3楼 [报告]

发表于 2017-11-02 20:42 |只看该作者

@rubyish，非常感谢！

本地跑了一下，确实是我想实现的东西。

不过最后一个部分没有很看懂。
    # Repeat the process
    $S1 ? decompose($S1) : (),
    $find,
    $S2 ? decompose($S2) : ();

我可以理解为：《》中间的和 decompose($S1) 去做选择？求解释。

$S1 ? decompose($S1) : 《 (), $find, $S2 ? decompose($S2) : ()》；

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 7

戌狗
日期:2013-12-15 20:43:38

技术图书徽章
日期:2014-03-05 01:33:12

技术图书徽章
日期:2014-03-15 20:31:17

未羊
日期:2014-03-25 23:48:20

丑牛
日期:2014-04-07 22:37:44

巳蛇
日期:2014-04-11 21:58:09

15-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13

4楼 [报告]

发表于 2017-11-03 00:43 |只看该作者

回复 3# TrishaTie

我可以理解为：《》中间的和 decompose($S1) 去做选择？

no ~~

biru:
3 * 5 + 6 + 4 * 5 = ( 3 * 5 ) + 6 + ( 4 * 5 )
!= 3 * ( 5 + 6 + 4 ) * 5

precedence [?:] > [,]

$S1 ? decompose($S1) : (),
$find,
$S2 ? decompose($S2) : ();

复制代码

EQ

($S1 ? decompose($S1) : ()),
$find,
($S2 ? decompose($S2) : ());

复制代码

EQ

my @S1 = $S1 ? decompose($S1) : ();
my @S2 = $S2 ? decompose($S2) : ();
return @S1, $find, @S2;

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

5楼 [报告]

发表于 2017-11-03 09:27 来自手机 |只看该作者

rubyish 发表于 2017-11-03 00:43
回复 3# TrishaTie

no ~~

谢谢 rubyish  大神大半夜回复消息，哦那我就是$find 那个地方混了，然后曲解了整句话的含义。
再次感谢
想说早点睡  哈哈

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 12

子鼠
日期:2014-10-11 16:46:48

2016科比退役纪念章
日期:2018-03-16 10:24:05

15-16赛季CBA联赛之山东
日期:2017-11-10 14:32:14

2016科比退役纪念章
日期:2017-09-02 15:42:47

15-16赛季CBA联赛之佛山
日期:2017-08-28 17:11:55

15-16赛季CBA联赛之浙江
日期:2017-08-24 16:55:17

15-16赛季CBA联赛之青岛
日期:2017-08-17 19:55:24

15-16赛季CBA联赛之天津
日期:2017-06-29 10:34:43

15-16赛季CBA联赛之四川
日期:2017-05-16 16:38:55

黑曼巴
日期:2016-07-19 15:03:11

2015亚冠之萨济拖拉机
日期:2015-05-22 11:38:53

15-16赛季CBA联赛之北京
日期:2019-08-13 17:30:53

6楼 [报告]

发表于 2017-11-03 14:21 |只看该作者

回复 5# TrishaTie

有时差

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

7楼 [报告]

发表于 2017-11-05 18:52 |只看该作者

其实我想说的是，当我 wordlist 有2w多个的时候，而且，需要查找的单词很多的时候，这样是不是很费时间，因为用C# 实现了一遍，实现一部分数据，需要超级多的时间。我如果存 wordlist 到hash, 去遍历每个单词字符的时候，会不会省时间。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 7

戌狗
日期:2013-12-15 20:43:38

技术图书徽章
日期:2014-03-05 01:33:12

技术图书徽章
日期:2014-03-15 20:31:17

未羊
日期:2014-03-25 23:48:20

丑牛
日期:2014-04-07 22:37:44

巳蛇
日期:2014-04-11 21:58:09

15-16赛季CBA联赛之青岛
日期:2016-03-17 20:36:13

8楼 [报告]

发表于 2017-11-07 23:06 |只看该作者

本帖最后由 rubyish 于 2017-11-07 19:12 编辑

shishi hash ~~

#!/usr/bin/perl
# version 26, subversion 1 (v5.26.1)
use 5.010;
my @list = qw[
Book Both Camel Cite Every Extraction Language Larry Originally Page Perl Permanent Practical Programming Related Repeat Reporting Since Special TIP Then This Though Unix Upload VALID WHAT'S Wall What Wikidata YOUR acronym again and another any are backronyms became because began book borrow bumped but can cannot change changes constructed continue cover decompose compose developed development different documentation documented easier eventually evolved facto file find first for from general has here ideas identify including independently information into item its known language languages length lengthy liberally link links list major make man many mark match matching maximum means mixture multiple not number officially one only originally page pages please process processing programmers published purpose redesign reference remaining report revisions robot same scripting search segments separate single still teams that the then there this time undergone until use various version was well which with word words you your sword
];
my $lexicon = from( \@list );
while (<DATA>) {
say "OOV:\n$_";
say 'decompose:';
chomp;
decompose($_);
say '-' x 64;
}
# ____________________SUB____________________
sub from {
my $words = shift;
my $lexicon = {};
for my $word (@$words) {
my @chara = split '', $word;
my $dit = $lexicon;
for my $k (@chara) {
$dit = $dit->{$k} //= {};
}
$dit->{'@'} = 1;
}
return $lexicon;
}
sub decompose {
my $unknown = shift;
my @unknown = split '', $unknown;
my $fate = $#unknown;
my @sigil = (0) x @unknown;
for my $i ( 0 .. $fate ) {
my $dit = $lexicon;
for my $j ( $i .. $fate ) {
$dit = $dit->{ $unknown[$j] };
last unless $dit;
if ( $dit->{'@'} ) {
$sigil[$i] = $j - $i + 1;
}
}
}
say join ' ', D_( $unknown, \@sigil, 0, [ 0, $#sigil ] );
}
sub D_ {
my ( $unknown, $sigil, $need, $range ) = @_;
my ( $r1, $r2 ) = @$range;
my ( $fine, $indes ) = ( 0, 0 );
modify( $unknown, $sigil, $r1, $r2 ) if $need;
for my $i ( $r1 .. $r2 ) {
next if $sigil->[$i] <= $fine;
$fine = $sigil->[$i];
$indes = $i;
}
if ( !$fine ) {
return substr $unknown, $r1, $r2 - $r1 + 1;
}
my $find = substr $unknown, $indes, $fine;
$r1 <= $indes - 1
? D_( $unknown, $sigil, 1, [ $r1, $indes - 1 ] )
: (),
$find,
$indes + $fine <= $r2
? D_( $unknown, $sigil, 0, [ $indes + $fine, $r2 ] )
: ();
}
sub modify {
my ( $unknown, $sigil, $r1, $r2 ) = @_;
for my $i ( $r1 .. $r2 ) {
next unless $sigil->[$i];
next if $i + $sigil->[$i] - 1 <= $r2;
my $dit = $lexicon;
$sigil->[$i] = 0;
for my $j ( $i .. $r2 ) {
$dit = $dit->{ substr( $unknown, $j, 1 ) };
if ( $dit->{'@'} ) {
$sigil->[$i] = $j - $i + 1;
}
}
}
}
__DATA__
WHAT'SATDSYOURATIP
themixtureofandwhichmeansthiswordcan
ThoughPerlisnotofficiallyanacronymtherearevariousbackronymsinuseincludingPracticalExtractionandReportingLanguagePerlwasoriginallydevelopedbyLarryWallin1987asageneralpurposeUnixscriptinglanguagetomakereportprocessingeasierSincethenithasundergonemanychangesandrevisionsPerl6whichbeganasaredesignofPerl5in2000eventuallyevolvedintoaseparatelanguageBothlanguagescontinuetobedevelopedindependentlybydifferentdevelopmentteamsandliberallyborrowideasfromoneanother

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

发新帖

Chinaunix › 论坛 › 程序设计 › Perl › 字符串分割问题

北京盛拓优讯信息技术有限公司. 版权所有京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号：11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员联系我们：huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP