Skip to content

Features[]

.halguru-webscraping.yamlPagesFeatures

Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page.

Pages:
  Features:
    - Name: Any text
      TagName: Any text
      NameRegex: Any text
      ValueRegex: Any text
      NameXpath: Any text
      ValueXpath: Any text
      IncludeHtml: true
      IncludeText: true
      NormalizeWhitespaces: true
      RemoveHtmlTags: true
      RemoveHtmlAttributes: true

Properties#

Name Type Required Description
Name Text ✔️ The name of the website feature.
TagName Text ✔️ The tag name used to identify or categorize the website feature.
NameRegex Text The regular expression pattern to identify the name component of a website feature.
ValueRegex Text The regular expression used to extract specific value matches from the HTML content of a website.
NameXpath Text The XPath expression used to locate the name of a specific feature within the website content.
ValueXpath Text The XPath expression used to locate and extract the value of a specific feature within a website's HTML content.
IncludeHtml Boolean ✔️ Determines if the raw HTML representation of a specific web feature is extracted and added to the feature's output during processing.
IncludeText Boolean ✔️ Controls whether the extracted plain text, processed via relevant scraping logic, is added to the resulting feature output.
NormalizeWhitespaces Boolean ✔️ When enabled, consecutive whitespace characters are collapsed into a single space, facilitating cleaner and more standardized output after web scraping.
RemoveHtmlTags Boolean ✔️ Determines if the raw HTML content will have tags stripped for plain text processing.
RemoveHtmlAttributes Boolean ✔️ This property is primarily used to strip unnecessary attributes from HTML elements for cleaner and more optimized data extraction.

Technical Information#

Property Value
Path Pages[].Features[]
Internal Type WebScrappingModels.WebScrapingFeature
Internal Root Type WebScrapingConfiguration
File Extension .halguru-webscraping.yaml
JSON Schema halguru-webscraping-schema.json

Last updated: 2025-10-13
Autogenerated: Yes
AI powered: Yes
Core version: 1.66.0