java – 从推文文本中提取主题标签,用户提及和网址的快捷方式？

2023年2月13日 166次阅读

我正在尝试找到一种快速的方法来获取每个字符串的数组：1-主题标签,2-用户在推文文本中提到3个网址.我在csv文件中有推文文本.

我解决问题的方法需要太长的处理时间,我想知道我是否可以优化我的代码.我将展示我的每个匹配类型的正则表达式规则,但只是不发布长代码我将只显示我如何匹配主题标签.对于网址和用户提及,相同的技术.

这里是：

public static String hashtagRegex = "^#\\w+|\\s#\\w+";
public static Pattern hashtagPattern = Pattern.compile(hashtagRegex);

public static String urlRegex = "http+://[\\S]+|https+://[\\S]+";
public static Pattern urlPattern = Pattern.compile(urlRegex);

public static String mentionRegex = "^@\\w+|\\s@\\w+";
public static Pattern mentionPattern = Pattern.compile(mentionRegex);

public static String[] getHashtag(String text) {
   String hashtags[];
   matcher = hashtagPattern.matcher(tweet.getText());

    if ( matcher.find() ) {
        hashtags = new String[matcher.groupCount()];
        for ( int i = 0; matcher.find(); i++ ) {
                    //Also i'm getting an ArrayIndexOutOfBoundsException
            hashtags[i] = matcher.group().replace(" ", "").replace("#", "");
        }
    }

   return hashtags;

}

最佳答案 Matcher#groupCount为您提供捕获组的数量,而不是匹配的数量.这就是为什么你得到一个ArrayIndexOutOfBoundsException(在你的情况下,数组初始化为大小为零).您可能希望使用List来收集匹配的动态增长,而不是数组.

加速的一种(潜在)方法可能是在空格上标记化文本,然后只检查标记的开头是否有片段,如http,@或#.这样,您就可以完全避免使用正则表达式. (没有描述,所以我无法分辨性能影响).