Python:使用正则表达式捕获模式中的子模式

免责声明:这是我的第一篇文章.随意给我反馈,以及我应该或不应该格式化这个问题.谢谢!

我希望通过捕获匹配日期格式模式后跟冒号的任何内容来从文本块中提取数据.我已成功使用正则表达式来捕获信息,包括观察日期,冒号以及在下一个日期之前的时间段之后的任何文本.

例如:
1999-01-01:观察到10只鸟.

我遇到的问题是我的一些数据包含站点名称,后面跟着观察日期和第一个冒号之后的观察数据中的冒号.这个’sitename:data’子模式可以在观察日期之后的块内发生零次或多次.

例如:
1999-01-01:BS-001:观察到5只鸟.一切都很健康. BS-002:观察到5只鸟,其中一些健康状况不佳.

我应该使用什么模式来捕获日期格式和冒号后的所有文本,包括潜在的站点名称,冒号和相关数据,直到下一个观察日期之前的时间段?

我目前通过日期和观察使用以下模式提取简单的观察数据(其中没有多个站点):

pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')  

上面的代码让我可以提取各种形式的观察日期.使用句点作为模式的一部分是棘手的,因为观察数据可以是一个或多个句子.

以下是我尝试搜索和拆分的文本示例.每个新匹配应以观察日期开始,因此在下面的数据中应返回3个匹配项(2013-04-13:数据,2017-01-01:数据和2018-07-04:数据):

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old
AMMA mass, but “bumpier” on outside (membrane and embryo-spacing in
the masses were AMJE-like). BS-443: 3 egg masses observed in vernal
pool habitat. A few egg masses may have been missed due to poor light
conditions. Smith-019: 250 egg masses observed in vernal pool habitat.
Observer searched only portions abutting the road (SW margin of pool).
Many AMJE masses observed attached to herbaceous vegetation and
difficult to differentiate from one another. AMJE egg-mass count is a
rough estimate within area searched. 2017-01-01: 23 individuals
observed. Egg masses were not present. 2018-07-04: BS-440: All
individuals took a break from breeding for the long holiday weekend.

理想情况下,输出看起来像这样:

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old
AMMA mass, but “bumpier” on outside (membrane and embryo-spacing in
the masses were AMJE-like). BS-443: 3 egg masses observed in vernal
pool habitat. A few egg masses may have been missed due to poor light
conditions. Smith-019: 250 egg masses observed in vernal pool habitat.
Observer searched only portions abutting the road (SW margin of pool).
Many AMJE masses observed attached to herbaceous vegetation and
difficult to differentiate from one another. AMJE egg-mass count is a
rough estimate within area searched.

2017-01-01: 23 individuals observed. Egg masses were not present.

2018-07-04: BS-440: All individuals took a break from breeding for the
long holiday weekend.

最佳答案 基本上,听起来您希望将文本分成以日期开头并在日期或文本结尾之前结束的字段.这是一种可能性:

\d{4}-\d\d-\d\d:           # date with colon
.*?                        # the minimal amount of any characters required to match
(?=                        # positive lookahead (match text but don't consume it)
   \d{4}-\d\d-\d\d:        # date with colon
  |                        # or
   $                      # end of text
)                          # end lookahead

与re.findall()一起使用:

findall(r'\d{4}-\d\d-\d\d:.*?(?=\d{4}-\d\d-\d\d:|$)', mytext)

针对上面的示例文本运行:

['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
  Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
  old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
  in the masses were AMJE-like). BS-443: 3 egg masses observed in
  vernal pool habitat. A few egg masses may have been missed due to
  poor light conditions. Smith-019: 250 egg masses observed in
  vernal pool habitat. Observer searched only portions abutting the 
  road (SW margin of pool). Many AMJE masses observed attached
  to herbaceous vegetation and difficult to differentiate from
  one another. AMJE egg-mass count is a rough estimate within
  area searched. ',
 '2017-01-01: 23 individuals observed. Egg masses were not present. ',
 '2018-07-04: BS-440: All individuals took a break from breeding for
  the long holiday weekend.']
点赞