计量的时间一词出现在BigQuery列

Question 1

我有一个列有一长串和需要计数的最常用的词。

我需要的东西，这样工作 https://towardsdatascience.com/very-simple-python-script-for-extracting-most-common-words-from-a-story-1e3570d0b9d0. 该计数字的部分至少...

这是非常重要的，我已经选择黑名单的一些话，使他们不计数。

Question 2

试试下面简单的方法

with blacklist as (
  select 'with' word union all 
  select 'that' union all
  select 'add more as you see needed'
)
select lower(word) word, count(*) frequency
from data, unnest(regexp_extract_all(col, r'[\w]*')) word
where length(word) > 3  
and word not in (select word from blacklist)
group by word
order by frequency desc

Mikhail Berlyant · Answer 1 · 2021-11-23T22:40:30

2

试试下面简单的方法

with blacklist as (
  select 'with' word union all 
  select 'that' union all
  select 'add more as you see needed'
)
select lower(word) word, count(*) frequency
from data, unnest(regexp_extract_all(col, r'[\w]*')) word
where length(word) > 3  
and word not in (select word from blacklist)
group by word
order by frequency desc

Mikhail Berlyant

2021-11-23 22:40:30

它没有工作...该短语在葡萄牙语，这是问题吗？或者，也许我没有做出正确的substituion上你的代码。

Murilo

)，黑名单的作(选择"与"字联盟的所有选择'，'联盟的所有选择增加更多的正如你看到的需要')选择下(文字)文字、计数()频率从T0，取消嵌套(regexp_extract_all(T0。列r'[\w]'))的单词里的长度(单词)>3和字不在(选择的词从黑名单)组通过语序的频率desc///我试过这个..

Murilo

请更具体的-什么你说"它不工作"？提供的例子输入数据。等等。

Mikhail Berlyant

我的错，我收到这个消息"这种查询，没有返回的结果"。

Murilo

没关系，我有了一个错误在我原来的查询，它的工作完全现在，非常感谢你

Murilo

谢谢你的确认。很高兴它对你的作品。还考虑投票的回答，如果它帮助：o)

Mikhail Berlyant

顺便说一句，我看结果和代码被切割的话包含一些"巴西字母"喜欢"Ç""×""觉察去体验这"，是否有一个方法要考虑的那些。在一个词如"informação"，计为"信息"

Murilo

确保可行的、将检查不久。但与此同时检查了我在我其他的答案如何对待利，等等。它应该至少几个答案有关：o)

Mikhail Berlyant

计量的时间一词出现在BigQuery列

的问题

最好的答案

其他语言

此页面有其他语言版本

受欢迎的此类别

流行的问题，在这个类别