在BaseX中优化缓慢的XQuery查询

我有一个只有一个小
XML文件的BaseX
XML数据库.这些文件基本上由两个结构组成.一个是具有46个实例的PlatformCategory,另一个是具有213个实例的PlatformGenericType.

PlatformGenericType在href属性中引用了PlatformCategory.

<PlatformGeneralType id="/plib/platformgeneraltypes/pgt1">
  <name>No statement</name>
  <enum>NO_STATEMENT</enum>
  <isOfPlatformCategory href="/plib/platformcategories/pc1"/>
  <readOnly>true</readOnly>
</PlatformGeneralType>

<PlatformCategory id="/plib/platformcategories/pc1">
  <name>No statement</name>
  <enum>NO_STATEMENT</enum>
  <environment>AIR</environment>
  <readOnly>true</readOnly>
</PlatformCategory>

当我执行以下查询时,大约需要六秒钟才能得到结果:

//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]

我该怎么做才能优化此查询?

请注意,我运行“优化全部”.

更新:上一个查询的问题似乎已得到解决.但是,当我使用以下扩展查询时,查询需要448秒:

/PLib/PlatformSpecificTypes/PlatformSpecificType
[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType
    [isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory
        [environment='AIR']/@id]/@id]

PlatformSpecificType有8939个实例及其结构:

<PlatformSpecificTypes>
    <PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
        <name>Meko 360H2</name>
        <lethalityLevel>LOW</lethalityLevel>
        <isOfPlatformGeneralType href="/plib/platformgeneraltypes/pgt62"/>
        <ownedByCountry href="/plib/countries/10"/>
    </PlatformSpecificType>
</PlatformSpecificTypes>

其查询信息:

查询:
    / PLIB / PlatformSpecificTypes / PlatformSpecificType [isOfPlatformGeneralType / @ HREF = / PLIB / PlatformGeneralTypes / PlatformGeneralType [isOfPlatformCategory / @ HREF = / PLIB / PlatformCategories / PlatformCategory [环境= ‘AIR’] / @ ID] / @ ID]
    结果:
     – 命中:3642项
     – 更新:0项
     – 印刷:2048 KB
     – 读取锁定:本地[command_plib]
     – 写锁定:无
    定时:
     – 解析:1.25毫秒
     – 编译:0.71毫秒
     – 评估:44248.94毫秒
     – 印刷:37.11毫秒
     – 总时间:44288.02毫秒
    查询计划:
    
      
        
        
        
        
          
            
              
              
            
            
              
              
              
              
                
                  
                    
                    
                  
                  
                    
                    
                    
                    
                      
                        
                          
                        
                        
                      
                    
                    
                  
                
              
              
            
          
        
      
    

数据库属性:

Database Properties
 Name: command_plib
 Size: 20247 KB
 Nodes: 781606
 Documents: 1
 Binaries: 0
 Timestamp: 2015-06-12-10-12-14

Resource Properties
 Input Path: /home/sceran/Documents/PLIB/command_plib.xml
 Input Size: 21354 KB
 Timestamp: 2015-06-11-15-34-07
 Encoding: UTF-8
 CHOP: true

Indexes
 Up-to-date: true
 TEXTINDEX: true
 ATTRINDEX: true
 FTINDEX: false
 LANGUAGE: English
 STEMMING: true
 CASESENS: true
 DIACRITICS: false
 STOPWORDS: 
 UPDINDEX: false
 AUTOOPTIMIZE: false
 MAXCATS: 100
 MAXLEN: 96

查询信息:

Compiling:
- rewriting descendant-or-self step(s)
- rewriting descendant-or-self step(s)
- converting descendant::*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)] to child steps
Query:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
Optimized Query:
db:open-pre("command_plib",0)/*:PLib/*:PlatformGeneralTypes/*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)]
Result:
- Hit(s): 55 Items
- Updated: 0 Items
- Printed: 12776 Bytes
- Read Locking: local [command_plib]
- Write Locking: none
Timing:
- Parsing: 0.55 ms
- Compiling: 0.3 ms
- Evaluating: 5786.29 ms
- Printing: 1.0 ms
- Total Time: 5788.15 ms
Query plan:
<QueryPlan compiled="true">
  <IterPath>
    <DBNode name="command_plib" pre="0"/>
    <IterStep axis="child" test="*:PLib"/>
    <IterStep axis="child" test="*:PlatformGeneralTypes"/>
    <IterStep axis="child" test="*:PlatformGeneralType">
      <CmpG op="=">
        <CachedPath>
          <IterStep axis="child" test="*:isOfPlatformCategory"/>
          <IterStep axis="attribute" test="*:href"/>
        </CachedPath>
        <IterPath>
          <Root/>
          <IterStep axis="descendant" test="*:PlatformCategory">
            <CmpG op="=">
              <CachedPath>
                <IterStep axis="child" test="*:environment"/>
              </CachedPath>
              <Str value="AIR" type="xs:string"/>
            </CmpG>
          </IterStep>
          <IterStep axis="attribute" test="*:id"/>
        </IterPath>
      </CmpG>
    </IterStep>
  </IterPath>
</QueryPlan>

更新二:
我怀疑PlatformSpecificTypes的结构会阻止索引.我想知道如果我改变它如下,它会提高查询性能吗?

<PlatformSpecificTypes>
    <PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
        <name>Meko 360H2</name>
        <lethalityLevel>LOW</lethalityLevel>
        **<isOfPlatformGeneralType>/plib/platformgeneraltypes/pgt62 </isOfPlatformGeneralType>**
        <ownedByCountry href="/plib/countries/10"/>
    </PlatformSpecificType>
</PlatformSpecificTypes>

更新三:
我上传了XML file in a gist,以便你可以检查它.

现在,当我执行以下查询时,我需要大约28秒来获得结果.

/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]

这是查询信息:

 Query:
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
Result:
- Hit(s): 3642 Items
- Updated: 0 Items
- Printed: 257 KB
- Read Locking: local [Output6]
- Write Locking: none
Timing:
- Parsing: 0.66 ms
- Compiling: 0.34 ms
- Evaluating: 28398.32 ms
- Printing: 4.63 ms
- Total Time: 28403.97 ms
Query plan:
<QueryPlan compiled="true">
  <IterPath>
    <DBNode name="Output6" pre="0"/>
    <IterStep axis="child" test="*:root"/>
    <IterStep axis="child" test="*:PlSpTys"/>
    <IterStep axis="child" test="*:PlSpTy">
      <CmpG op="=">
        <CachedPath>
          <IterStep axis="child" test="*:isOfPlGeTy"/>
          <IterStep axis="attribute" test="*:href"/>
        </CachedPath>
        <IterPath>
          <Root/>
          <IterStep axis="child" test="*:root"/>
          <IterStep axis="child" test="*:PlGeTys"/>
          <IterStep axis="child" test="*:PlGeTy">
            <CmpG op="=">
              <CachedPath>
                <IterStep axis="child" test="*:isOfPlCt"/>
                <IterStep axis="attribute" test="*:href"/>
              </CachedPath>
              <IterPath>
                <Root/>
                <IterStep axis="child" test="*:root"/>
                <IterStep axis="child" test="*:PlCts"/>
                <IterStep axis="child" test="*:PlCt">
                  <CmpG op="=">
                    <CachedPath>
                      <IterStep axis="child" test="*:environment"/>
                    </CachedPath>
                    <Str value="AIR" type="xs:string"/>
                  </CmpG>
                </IterStep>
                <IterStep axis="attribute" test="*:id"/>
              </IterPath>
            </CmpG>
          </IterStep>
          <IterStep axis="attribute" test="*:id"/>
        </IterPath>
      </CmpG>
    </IterStep>
  </IterPath>
</QueryPlan>

你能帮我优化查询持续时间吗?

最佳答案 BaseX似乎没有意识到它应该用静态结果预处理“内部”部分,因此评估成本大约是O(n ^ 2)而不是O(n).

重新格式化您的查询(在我的机器上大约需要30秒)以更好地理解它显示第一个谓词内的比较的整个右侧是静态的,而不依赖于当前分析的PlSpTy元素:

/root/PlSpTys/PlSpTy[
  isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[
    isOfPlCt/@href=/root/PlCts/PlCt[
      environment='AIR'
    ]/@id
  ]/@id
]

在我的机器上对此进行评估需要大约9毫秒,这不是很多,但如果重复运行可能会变得昂贵.计算PlSpTy元素的数量(count(/ root / PlSpTys / PlSpTy))显示接近8939个这样的元素,因此内部部分的评估成本约为8939 * 9ms~ = 80s – 必须已经优化了一些东西,但不是一切.

如果我们简单地提取查询的这一部分并预先计算它会发生什么?

let $compare :=
  /root/PlGeTys/PlGeTy[
    isOfPlCt/@href=/root/PlCts/PlCt[
      environment='AIR'
    ]/@id
  ]/@id

return
  /root/PlSpTys/PlSpTy[
    isOfPlGeTy/@href=$compare
  ]

计算时间下降到16毫秒,其中第四个用于实际打印结果.我开了一个bug report requesting better optimization.(更新:some optimizations have been applied).

点赞