Pages[]
.halguru-webscraping.yaml ➤ Pages
Gets or sets the collection of website pages configuration.
Pages:
- Name: Any text
TagName: Any text
Description: Any text
IncludeHtml: true
IncludeText: true
NormalizeWhitespaces: true
RemoveHtmlTags: true
RemoveHtmlAttributes: true
UrlContains: Any text
ContainsText: Any text
ContainsXpath: Any text
ContentXPath: Any text
DisabledTags: []
Features: []
Files: []
Properties
Name |
Type |
Required |
Description |
Name |
Text |
✔️ |
The name of the web page or a specific element within the web page. |
TagName |
Text |
✔️ |
The tag name of an HTML element or feature within a website page. |
Description |
Text |
|
Provides details or information about the web page or a specific element within the web page. |
IncludeHtml |
Boolean |
✔️ |
Indicates whether the HTML content of the web page should be included in the output during the web scraping process. |
IncludeText |
Boolean |
✔️ |
Determines whether the textual content of a web page should be included during the web scraping process. |
NormalizeWhitespaces |
Boolean |
✔️ |
Indicates whether whitespaces should be normalized in the text content extracted from an HTML node. |
RemoveHtmlTags |
Boolean |
✔️ |
Indicates whether HTML tags should be removed from the content of a web page during processing. |
RemoveHtmlAttributes |
Boolean |
✔️ |
Indicates whether HTML attributes should be removed during the web scraping process. |
UrlContains |
Text |
|
Filter to process only web pages whose URL contains the specified substring. |
ContainsText |
Text |
|
Only wepage contains text will be processed. |
ContainsXpath |
Text |
|
If defined, ContainsText will be checked only in the xpath. |
ContentXPath |
Text |
|
Only html tags from this xpath. If not empty. |
DisabledTags |
List |
✔️ |
List of HTML tags that should be disabled or ignored during processing. |
Features |
List |
✔️ |
Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page. |
Files |
List |
✔️ |
Represents a collection of files associated with the webpage for processing or extraction. |
Property |
Value |
Path |
Pages[] |
Internal Type |
WebScrappingModels.WebScrapingPage |
Internal Root Type |
WebScrapingConfiguration |
File Extension |
.halguru-webscraping.yaml |
JSON Schema |
halguru-webscraping-schema.json |
Last updated: | | 2025-10-13 |
Autogenerated: | | Yes |
AI powered: | | Yes |
Core version: | | 1.66.0 |