Reference Index

Model Description
.halguru-webscraping.yaml: Represents the configuration settings for a website crawler or scraper.
MaxLevel The maximum level allowed for processing or operations in the website.
MaxPages The maximum number of pages to process for the website.
Pages[].ContainsText Only wepage contains text will be processed.
Pages[].ContainsXpath If defined, ContainsText will be checked only in the xpath.
Pages[].ContentXPath Only html tags from this xpath. If not empty.
Pages[].Description Provides details or information about the web page or a specific element within the web page.
Pages[].DisabledTags[] List of HTML tags that should be disabled or ignored during processing.
Pages[].Features[].IncludeHtml Determines if the raw HTML representation of a specific web feature is extracted and added to the feature's output during processing.
Pages[].Features[].IncludeText Controls whether the extracted plain text, processed via relevant scraping logic, is added to the resulting feature output.
Pages[].Features[].Name The name of the website feature.
Pages[].Features[].NameRegex The regular expression pattern to identify the name component of a website feature.
Pages[].Features[].NameXpath The XPath expression used to locate the name of a specific feature within the website content.
Pages[].Features[].NormalizeWhitespaces When enabled, consecutive whitespace characters are collapsed into a single space, facilitating cleaner and more standardized output after web scraping.
Pages[].Features[].RemoveHtmlAttributes This property is primarily used to strip unnecessary attributes from HTML elements for cleaner and more optimized data extraction.
Pages[].Features[].RemoveHtmlTags Determines if the raw HTML content will have tags stripped for plain text processing.
Pages[].Features[].TagName The tag name used to identify or categorize the website feature.
Pages[].Features[].ValueRegex The regular expression used to extract specific value matches from the HTML content of a website.
Pages[].Features[].ValueXpath The XPath expression used to locate and extract the value of a specific feature within a website's HTML content.
Pages[].Features[] Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page.
Pages[].Files[].Name The name of the website feature.
Pages[].Files[].NameRegex The regular expression pattern to identify the name component of a website feature.
Pages[].Files[].NameXpath The XPath expression used to locate the name of a specific feature within the website content.
Pages[].Files[].TagName The tag name used to identify or categorize the website feature.
Pages[].Files[].UrlRegex The regular expression pattern for matching URLs associated with the website file.
Pages[].Files[].UrlXpath The XPath expression used to extract the URL from a website file's content.
Pages[].Files[] Represents a collection of files associated with the webpage for processing or extraction.
Pages[].IncludeHtml Indicates whether the HTML content of the web page should be included in the output during the web scraping process.
Pages[].IncludeText Determines whether the textual content of a web page should be included during the web scraping process.
Pages[].Name The name of the web page or a specific element within the web page.
Pages[].NormalizeWhitespaces Indicates whether whitespaces should be normalized in the text content extracted from an HTML node.
Pages[].RemoveHtmlAttributes Indicates whether HTML attributes should be removed during the web scraping process.
Pages[].RemoveHtmlTags Indicates whether HTML tags should be removed from the content of a web page during processing.
Pages[].TagName The tag name of an HTML element or feature within a website page.
Pages[].UrlContains Filter to process only web pages whose URL contains the specified substring.
Pages[] The collection of website pages configuration.
StartUrl The starting URL for the website.
UrlsStartWith[] The collection of URL prefixes used to filter and process relevant URLs.

Last updated: 2025-10-13
Autogenerated: Yes
AI powered: Yes
Core version: 1.66.0